Fault tolerance is essential in computing systems to ensure reliability and availability despite component failures. Fault-tolerant systems either exhibit predictable failure behavior or mask failures to users, ensuring continued service. As computing systems become more critical, the demand for fault-tolerant systems increases. Designing such systems is challenging due to the complexity of managing failures and the ambiguity in terminology. This article introduces a structured approach to understanding fault tolerance in distributed systems, focusing on architectural concepts, failure classification, and masking techniques.
The article discusses the basic architectural concepts of services, servers, and the "depends" relation. Servers provide services, and their behavior depends on other servers. Failure classification includes omission, timing, and response failures, with specific types like crash failures. Failure semantics define the likely behaviors of servers, influencing how failures are handled.
Hierarchical failure masking involves propagating failures through different abstraction levels, while group failure masking uses redundant servers to mask failures. The choice of failure semantics affects the design and cost of systems. Stronger failure semantics require more complex and expensive solutions, while weaker semantics may lead to higher costs and lower performance.
The article examines various hardware and software architectural issues, including replaceable units, failure semantics, and masking techniques. Examples of fault-tolerant systems, such as Tandem, VAX Cluster, IBM XRF, Stratus, and Sequoia, illustrate different approaches to achieving fault tolerance. These systems use redundancy, duplication, and masking to ensure reliability and availability, even in the face of hardware or software failures. The article concludes that balancing failure detection, recovery, and masking redundancy at different abstraction levels is crucial for optimal cost, performance, and dependability.Fault tolerance is essential in computing systems to ensure reliability and availability despite component failures. Fault-tolerant systems either exhibit predictable failure behavior or mask failures to users, ensuring continued service. As computing systems become more critical, the demand for fault-tolerant systems increases. Designing such systems is challenging due to the complexity of managing failures and the ambiguity in terminology. This article introduces a structured approach to understanding fault tolerance in distributed systems, focusing on architectural concepts, failure classification, and masking techniques.
The article discusses the basic architectural concepts of services, servers, and the "depends" relation. Servers provide services, and their behavior depends on other servers. Failure classification includes omission, timing, and response failures, with specific types like crash failures. Failure semantics define the likely behaviors of servers, influencing how failures are handled.
Hierarchical failure masking involves propagating failures through different abstraction levels, while group failure masking uses redundant servers to mask failures. The choice of failure semantics affects the design and cost of systems. Stronger failure semantics require more complex and expensive solutions, while weaker semantics may lead to higher costs and lower performance.
The article examines various hardware and software architectural issues, including replaceable units, failure semantics, and masking techniques. Examples of fault-tolerant systems, such as Tandem, VAX Cluster, IBM XRF, Stratus, and Sequoia, illustrate different approaches to achieving fault tolerance. These systems use redundancy, duplication, and masking to ensure reliability and availability, even in the face of hardware or software failures. The article concludes that balancing failure detection, recovery, and masking redundancy at different abstraction levels is crucial for optimal cost, performance, and dependability.