A Survey of Rollback-Recovery Protocols in Message-Passing Systems

A Survey of Rollback-Recovery Protocols in Message-Passing Systems

| E.N. (Mootaz) Elnozahy, Lorenzo Alvisi, Yi-Min Wang, David B. Johnson
This survey discusses rollback-recovery protocols in message-passing systems, focusing on techniques that do not require special language constructs. It classifies rollback-recovery protocols into checkpoint-based and log-based. Checkpoint-based protocols rely solely on checkpoints for system state restoration, while log-based protocols combine checkpointing with logging of nondeterministic events, encoded in tuples called determinants. Log-based protocols can be pessimistic, optimistic, or causal, depending on how determinants are logged. The survey highlights research issues in rollback recovery and presents current solutions, comparing the performance of different protocols with respect to desirable properties and discussing practical implementation issues. The survey covers the definitions, fundamental concepts, and implementation issues of rollback-recovery protocols in distributed systems. It excludes the use of rollback recovery in many related fields such as hardware-level instruction retry, distributed shared memory, real-time systems, and debugging. It also excludes issues related to using rollback recovery in systems with Byzantine failures or non-fail-stop models. The survey focuses on transparent techniques that do not require intervention from the application or programmer, with the system automatically taking checkpoints and recovering from failures. Message-passing systems complicate rollback recovery because messages induce inter-process dependencies during failure-free operation. Upon a failure, these dependencies may force some processes to roll back, creating rollback propagation. The domino effect occurs if each process takes checkpoints independently, leading to the loss of all work performed before the failure. Coordinated checkpointing prevents the domino effect by ensuring a consistent global state. Communication-induced checkpointing forces processes to take checkpoints based on information piggybacked on messages. Log-based rollback recovery relies on the piecewise deterministic (PWD) assumption, which allows processes to deterministically recreate their pre-failure state even if this state has not been checkpointed. Log-based rollback recovery has three flavors: pessimistic, optimistic, and causal. Pessimistic logging blocks waiting for determinants to be stored, optimistic logging does not block, and causal logging balances both. These approaches differ in their requirements for garbage collection and interactions with the outside world. The survey outlines the rest of the paper, covering system model, terminology, checkpoint-based and log-based protocols, implementation issues, and conclusions. It discusses the use of stable storage, garbage collection, and the impact of communication reliability on checkpointing. It also explores coordinated and communication-induced checkpointing, including the domino effect, non-blocking coordination, and minimal checkpoint coordination. The survey concludes with an overview of log-based rollback recovery, emphasizing its advantages in systems with frequent interactions with the outside world.This survey discusses rollback-recovery protocols in message-passing systems, focusing on techniques that do not require special language constructs. It classifies rollback-recovery protocols into checkpoint-based and log-based. Checkpoint-based protocols rely solely on checkpoints for system state restoration, while log-based protocols combine checkpointing with logging of nondeterministic events, encoded in tuples called determinants. Log-based protocols can be pessimistic, optimistic, or causal, depending on how determinants are logged. The survey highlights research issues in rollback recovery and presents current solutions, comparing the performance of different protocols with respect to desirable properties and discussing practical implementation issues. The survey covers the definitions, fundamental concepts, and implementation issues of rollback-recovery protocols in distributed systems. It excludes the use of rollback recovery in many related fields such as hardware-level instruction retry, distributed shared memory, real-time systems, and debugging. It also excludes issues related to using rollback recovery in systems with Byzantine failures or non-fail-stop models. The survey focuses on transparent techniques that do not require intervention from the application or programmer, with the system automatically taking checkpoints and recovering from failures. Message-passing systems complicate rollback recovery because messages induce inter-process dependencies during failure-free operation. Upon a failure, these dependencies may force some processes to roll back, creating rollback propagation. The domino effect occurs if each process takes checkpoints independently, leading to the loss of all work performed before the failure. Coordinated checkpointing prevents the domino effect by ensuring a consistent global state. Communication-induced checkpointing forces processes to take checkpoints based on information piggybacked on messages. Log-based rollback recovery relies on the piecewise deterministic (PWD) assumption, which allows processes to deterministically recreate their pre-failure state even if this state has not been checkpointed. Log-based rollback recovery has three flavors: pessimistic, optimistic, and causal. Pessimistic logging blocks waiting for determinants to be stored, optimistic logging does not block, and causal logging balances both. These approaches differ in their requirements for garbage collection and interactions with the outside world. The survey outlines the rest of the paper, covering system model, terminology, checkpoint-based and log-based protocols, implementation issues, and conclusions. It discusses the use of stable storage, garbage collection, and the impact of communication reliability on checkpointing. It also explores coordinated and communication-induced checkpointing, including the domino effect, non-blocking coordination, and minimal checkpoint coordination. The survey concludes with an overview of log-based rollback recovery, emphasizing its advantages in systems with frequent interactions with the outside world.
Reach us at info@study.space