Unreliable Failure Detectors for Reliable Distributed Systems

Unreliable Failure Detectors for Reliable Distributed Systems

March 1996 | TUSHAR DEEPAK CHANDRA AND SAM TOUEG
The paper introduces the concept of unreliable failure detectors and explores their use in solving consensus in asynchronous systems with crash failures. The authors define two key properties—completeness and accuracy—to characterize these detectors. They demonstrate that consensus can be solved even with detectors that make an infinite number of mistakes, and identify which detectors can solve consensus despite any number of crashes, and which require a majority of correct processes. Consensus and Atomic Broadcast are shown to be reducible to each other in asynchronous systems with crash failures, making the results applicable to both problems. The paper also introduces a weakest failure detector, $\diamondsuit \mathcal{W}$, which can make an infinite number of mistakes but still provides sufficient information to solve consensus. The authors discuss the implementation of such detectors and their practical implications, emphasizing the importance of the asynchronous model of computation in fault-tolerant distributed systems.The paper introduces the concept of unreliable failure detectors and explores their use in solving consensus in asynchronous systems with crash failures. The authors define two key properties—completeness and accuracy—to characterize these detectors. They demonstrate that consensus can be solved even with detectors that make an infinite number of mistakes, and identify which detectors can solve consensus despite any number of crashes, and which require a majority of correct processes. Consensus and Atomic Broadcast are shown to be reducible to each other in asynchronous systems with crash failures, making the results applicable to both problems. The paper also introduces a weakest failure detector, $\diamondsuit \mathcal{W}$, which can make an infinite number of mistakes but still provides sufficient information to solve consensus. The authors discuss the implementation of such detectors and their practical implications, emphasizing the importance of the asynchronous model of computation in fault-tolerant distributed systems.
Reach us at info@study.space