[slides and audio] Fail-stop processors%3A an approach to designing fault-tolerant computing systems

This paper presents a methodology for designing fault-tolerant computing systems based on the concept of fail-stop processors. A fail-stop processor automatically halts in response to internal failures and does not perform incorrect state transformations. The authors describe how to implement such processors and use axiomatic program verification techniques to develop provably correct programs for them. They illustrate the methodology with an example of a process control system. Fail-stop processors are characterized by their simple failure-mode operating characteristics. They have volatile and stable storage, with volatile storage being lost upon failure. Fail-stop processors halt upon failure, and the only visible effects are the processor stopping and losing volatile storage contents. Implementing fail-stop processors is challenging due to hardware limitations. However, systems can be designed that approximate fail-stop behavior with high probability. The authors describe a method using Byzantine Generals Problem solutions to achieve this. They also discuss recovery protocols that allow programs to restart after a failure. The paper outlines a framework for programming fail-stop processors, including recovery protocols that ensure correct state transformation after a failure. It presents axioms for fault-tolerant actions and discusses how to verify their correctness using axiomatic methods. The authors also address the challenge of ensuring termination and meeting response time constraints in the presence of failures. They propose strategies for designing systems that can tolerate up to k failures and meet real-time requirements. The paper concludes by discussing the application of the methodology to a process control system, demonstrating how fault-tolerant actions can be used to ensure correct operation despite failures. The methodology is shown to be effective in verifying existing fault-tolerant protocols and in developing new ones.This paper presents a methodology for designing fault-tolerant computing systems based on the concept of fail-stop processors. A fail-stop processor automatically halts in response to internal failures and does not perform incorrect state transformations. The authors describe how to implement such processors and use axiomatic program verification techniques to develop provably correct programs for them. They illustrate the methodology with an example of a process control system. Fail-stop processors are characterized by their simple failure-mode operating characteristics. They have volatile and stable storage, with volatile storage being lost upon failure. Fail-stop processors halt upon failure, and the only visible effects are the processor stopping and losing volatile storage contents. Implementing fail-stop processors is challenging due to hardware limitations. However, systems can be designed that approximate fail-stop behavior with high probability. The authors describe a method using Byzantine Generals Problem solutions to achieve this. They also discuss recovery protocols that allow programs to restart after a failure. The paper outlines a framework for programming fail-stop processors, including recovery protocols that ensure correct state transformation after a failure. It presents axioms for fault-tolerant actions and discusses how to verify their correctness using axiomatic methods. The authors also address the challenge of ensuring termination and meeting response time constraints in the presence of failures. They propose strategies for designing systems that can tolerate up to k failures and meet real-time requirements. The paper concludes by discussing the application of the methodology to a process control system, demonstrating how fault-tolerant actions can be used to ensure correct operation despite failures. The methodology is shown to be effective in verifying existing fault-tolerant protocols and in developing new ones.

Fail-Stop Processors: An Approach to Designing Fault-Tolerant Computing Systems

August 1983 | RICHARD D. SCHLICHTING and FRED B. SCHNEIDER