Reliable Communication in the Presence of Failures

Reliable Communication in the Presence of Failures

February 1987 | KENNETH P. BIRMAN and THOMAS A. JOSEPH
This paper presents communication primitives for distributed computations in systems with potential failures. The focus is on halting failures, where a process stops without incorrect actions. Each computation is represented as events with a partial order. The paper argues that event orderings should be handled at the communication layer, enabling processes to deduce event orderings observed by others, simplifying higher-level code and reducing inconsistent actions. The paper introduces fault-tolerant process groups, which are collections of cooperating processes using communication protocols. A key protocol is causal broadcast (CBCAST), which enforces causal delivery orderings, ensuring consistency across processes. The paper also discusses group broadcast (GBCAST) for maintaining process group views and atomic broadcast (ABCAST) for ensuring global delivery orderings. The paper addresses challenges in distributed systems, such as ensuring consistency in the face of process failures and recoveries. It proposes protocols that ensure consistent event orderings, allowing processes to act reliably even when failures occur. The paper also discusses the use of logical clocks and labels to enforce causal relationships between events. The paper describes the implementation of these protocols in a local network, including a garbage collection mechanism to manage message buffers. It also addresses the use of these protocols in a hierarchical distributed system, where communication is structured into clusters connected by long-haul links. The paper concludes that these protocols enable reliable communication in distributed systems, ensuring consistency and correctness even in the presence of failures. The protocols are designed to handle various scenarios, including process failures, recoveries, and dynamic changes to group properties. The approach simplifies higher-level algorithms and ensures that processes observe consistent event orderings.This paper presents communication primitives for distributed computations in systems with potential failures. The focus is on halting failures, where a process stops without incorrect actions. Each computation is represented as events with a partial order. The paper argues that event orderings should be handled at the communication layer, enabling processes to deduce event orderings observed by others, simplifying higher-level code and reducing inconsistent actions. The paper introduces fault-tolerant process groups, which are collections of cooperating processes using communication protocols. A key protocol is causal broadcast (CBCAST), which enforces causal delivery orderings, ensuring consistency across processes. The paper also discusses group broadcast (GBCAST) for maintaining process group views and atomic broadcast (ABCAST) for ensuring global delivery orderings. The paper addresses challenges in distributed systems, such as ensuring consistency in the face of process failures and recoveries. It proposes protocols that ensure consistent event orderings, allowing processes to act reliably even when failures occur. The paper also discusses the use of logical clocks and labels to enforce causal relationships between events. The paper describes the implementation of these protocols in a local network, including a garbage collection mechanism to manage message buffers. It also addresses the use of these protocols in a hierarchical distributed system, where communication is structured into clusters connected by long-haul links. The paper concludes that these protocols enable reliable communication in distributed systems, ensuring consistency and correctness even in the presence of failures. The protocols are designed to handle various scenarios, including process failures, recoveries, and dynamic changes to group properties. The approach simplifies higher-level algorithms and ensures that processes observe consistent event orderings.
Reach us at info@study.space
Understanding Reliable communication in the presence of failures