[slides] Unreliable failure detectors for reliable distributed systems

This paper introduces the concept of unreliable failure detectors and studies how they can be used to solve Consensus in asynchronous systems with crash failures. The authors characterize unreliable failure detectors in terms of two properties—completeness and accuracy. They show that Consensus can be solved even with unreliable failure detectors that make an infinite number of mistakes, and determine which ones can be used to solve Consensus despite any number of crashes, and which ones require a majority of correct processes. They prove that Consensus and Atomic Broadcast are reducible to each other in asynchronous systems with crash failures; thus, the above results also apply to Atomic Broadcast. A companion paper shows that one of the failure detectors introduced here is the weakest failure detector for solving Consensus. The paper discusses the challenges of solving Consensus in asynchronous systems, where deterministic solutions are impossible due to the difficulty of determining whether a process has crashed or is simply slow. To address this, the authors propose using unreliable failure detectors, which can make mistakes but still provide enough information to solve Consensus. They define failure detectors in terms of abstract properties rather than specific implementations, allowing for modular design and correctness proofs without relying on low-level network parameters. The authors introduce eight classes of failure detectors based on completeness and accuracy properties. They show that certain failure detectors can be used to solve Consensus in systems with any number of process failures, while others require a majority of correct processes. They also show that Consensus is equivalent to Atomic Broadcast in asynchronous systems with crash failures, and that the weakest failure detector for solving Consensus is the weakest failure detector for solving Atomic Broadcast. The paper concludes that unreliable failure detectors provide a natural and simple extension of the asynchronous model of computation, allowing for the deterministic solution of Consensus and Atomic Broadcast. This approach retains the advantages of asynchrony without inheriting its disadvantages. The authors also show that one of their algorithms is quite efficient, achieving Consensus within two "asynchronous rounds" in most runs.This paper introduces the concept of unreliable failure detectors and studies how they can be used to solve Consensus in asynchronous systems with crash failures. The authors characterize unreliable failure detectors in terms of two properties—completeness and accuracy. They show that Consensus can be solved even with unreliable failure detectors that make an infinite number of mistakes, and determine which ones can be used to solve Consensus despite any number of crashes, and which ones require a majority of correct processes. They prove that Consensus and Atomic Broadcast are reducible to each other in asynchronous systems with crash failures; thus, the above results also apply to Atomic Broadcast. A companion paper shows that one of the failure detectors introduced here is the weakest failure detector for solving Consensus. The paper discusses the challenges of solving Consensus in asynchronous systems, where deterministic solutions are impossible due to the difficulty of determining whether a process has crashed or is simply slow. To address this, the authors propose using unreliable failure detectors, which can make mistakes but still provide enough information to solve Consensus. They define failure detectors in terms of abstract properties rather than specific implementations, allowing for modular design and correctness proofs without relying on low-level network parameters. The authors introduce eight classes of failure detectors based on completeness and accuracy properties. They show that certain failure detectors can be used to solve Consensus in systems with any number of process failures, while others require a majority of correct processes. They also show that Consensus is equivalent to Atomic Broadcast in asynchronous systems with crash failures, and that the weakest failure detector for solving Consensus is the weakest failure detector for solving Atomic Broadcast. The paper concludes that unreliable failure detectors provide a natural and simple extension of the asynchronous model of computation, allowing for the deterministic solution of Consensus and Atomic Broadcast. This approach retains the advantages of asynchrony without inheriting its disadvantages. The authors also show that one of their algorithms is quite efficient, achieving Consensus within two "asynchronous rounds" in most runs.

Unreliable Failure Detectors for Reliable Distributed Systems

March 1996 | TUSHAR DEEPAK CHANDRA AND SAM TOUEG