A Survey of Rollback-Recovery Protocols in Message-Passing Systems

A Survey of Rollback-Recovery Protocols in Message-Passing Systems

| E.N. (MOOTAZ) ELNOZAHY, LORENZO ALVISI, YI-MIN WANG, AND DAVID B. JOHNSON
This survey covers rollback-recovery techniques in message-passing systems that do not require special language constructs. It classifies these protocols into *checkpoint-based* and *log-based* categories. *Checkpoint-based* protocols rely solely on checkpointing for system state restoration, which can be coordinated, uncoordinated, or communication-induced. *Log-based* protocols combine checkpointing with logging of nondeterministic events, encoded in tuples called *determinants*. Log-based protocols can be pessimistic, optimistic, or causal, depending on how determinants are logged. The survey highlights research issues and presents solutions for rollback recovery, comparing the performance of different protocols with respect to desirable properties and discussing practical implementation issues. It also addresses the challenges of rollback propagation, particularly in message-passing systems where inter-process dependencies during failure-free operation can lead to the domino effect, causing significant data loss. The survey covers the following topics: - System model and terminology - Consistent system states and interactions with the outside world - In-transit messages and their handling in reliable and unreliable communication channels - Logging protocols and the piecewise deterministic (PWD) assumption - Stable storage and garbage collection - Checkpoint-based rollback recovery, including uncoordinated, coordinated, and communication-induced checkpointing - Log-based rollback recovery, including pessimistic, optimistic, and causal logging The survey provides a comprehensive overview of the fundamental concepts, implementation issues, and performance metrics for rollback-recovery protocols in distributed systems.This survey covers rollback-recovery techniques in message-passing systems that do not require special language constructs. It classifies these protocols into *checkpoint-based* and *log-based* categories. *Checkpoint-based* protocols rely solely on checkpointing for system state restoration, which can be coordinated, uncoordinated, or communication-induced. *Log-based* protocols combine checkpointing with logging of nondeterministic events, encoded in tuples called *determinants*. Log-based protocols can be pessimistic, optimistic, or causal, depending on how determinants are logged. The survey highlights research issues and presents solutions for rollback recovery, comparing the performance of different protocols with respect to desirable properties and discussing practical implementation issues. It also addresses the challenges of rollback propagation, particularly in message-passing systems where inter-process dependencies during failure-free operation can lead to the domino effect, causing significant data loss. The survey covers the following topics: - System model and terminology - Consistent system states and interactions with the outside world - In-transit messages and their handling in reliable and unreliable communication channels - Logging protocols and the piecewise deterministic (PWD) assumption - Stable storage and garbage collection - Checkpoint-based rollback recovery, including uncoordinated, coordinated, and communication-induced checkpointing - Log-based rollback recovery, including pessimistic, optimistic, and causal logging The survey provides a comprehensive overview of the fundamental concepts, implementation issues, and performance metrics for rollback-recovery protocols in distributed systems.
Reach us at info@study.space
Understanding A survey of rollback-recovery protocols in message-passing systems