[slides] System structure for software fault tolerance

The paper presents a method for structuring complex computing systems using recovery blocks, conversations, and fault-tolerant interfaces to enable dependable error detection and recovery. The goal is to address errors caused by design inadequacies, not just hardware failures. The method relies on redundancy in design rather than simple replication. Recovery blocks are used to switch to a spare component if an error occurs, ensuring the system can continue functioning. The scheme includes acceptance tests to verify the correctness of operations and alternates to handle errors. The system state is automatically reset after an error, allowing the use of a spare component. The recovery block structure provides a general framework for fault tolerance, with the ability to restore the system state and ensure reliability. The scheme is based on the concept of 'standby sparing', where a spare component is used if the main component fails. The recovery block includes an acceptance test and alternates, with the system state being reset to the state before the main component was used. The scheme is designed to handle errors in software systems, not just hardware. The paper discusses the use of recovery blocks in various scenarios, including interactions between processes and multi-level systems. The scheme is based on the idea that software errors result from design errors, and that automated diagnosis is not necessary. The paper also discusses the use of recursive caches to save and restore the system state, and the importance of structured programming to ensure reliability. The paper concludes that the recovery block scheme provides a general framework for fault tolerance in software systems, with the ability to handle errors in a systematic and reliable manner.The paper presents a method for structuring complex computing systems using recovery blocks, conversations, and fault-tolerant interfaces to enable dependable error detection and recovery. The goal is to address errors caused by design inadequacies, not just hardware failures. The method relies on redundancy in design rather than simple replication. Recovery blocks are used to switch to a spare component if an error occurs, ensuring the system can continue functioning. The scheme includes acceptance tests to verify the correctness of operations and alternates to handle errors. The system state is automatically reset after an error, allowing the use of a spare component. The recovery block structure provides a general framework for fault tolerance, with the ability to restore the system state and ensure reliability. The scheme is based on the concept of 'standby sparing', where a spare component is used if the main component fails. The recovery block includes an acceptance test and alternates, with the system state being reset to the state before the main component was used. The scheme is designed to handle errors in software systems, not just hardware. The paper discusses the use of recovery blocks in various scenarios, including interactions between processes and multi-level systems. The scheme is based on the idea that software errors result from design errors, and that automated diagnosis is not necessary. The paper also discusses the use of recursive caches to save and restore the system state, and the importance of structured programming to ensure reliability. The paper concludes that the recovery block scheme provides a general framework for fault tolerance in software systems, with the ability to handle errors in a systematic and reliable manner.

SYSTEM STRUCTURE FOR SOFTWARE FAULT TOLERANCE

| B. Randell