Section: Scientific Foundations
High availability
“A distributed system is one that stops you getting any work done when a machine you've never even heard about crashes.” (Leslie Lamport)
The availability [73] of a system measures the ratio of service accomplishment conforming to its specifications, with respect to elapsed time. A system fails when it does not behave in a manner consistent with its specifications. An error is the consequence of a fault when the faulty part of the system is activated. It may lead to the system failure . In order to provide highly-available systems, fault tolerance techniques [79] based on redundancy can be implemented. Abstractions like group membership , atomic multicast , consensus , etc. have been defined for fault-tolerant distributed systems.
Error detection is the first step in any fault tolerance strategy. Error treatment aims at avoiding that the error leads to the system failure.
Fault treatment consists in avoiding that the fault be activated again. Two classes of techniques can be used for fault treatment: reparation which consists in eliminating or replacing the faulty module; and reconfiguration which consists in transferring the load of the faulty element to valid components.
Error treatment can be of two forms: error masking or error recovery . Error masking is based on hardware or software redundancy in order to allow the system to deliver its service despite the error. Error recovery consists in restoring a correct system state from an erroneous state. In forward error recovery techniques, the erroneous state is transformed into a safe state. Backward error recovery consists in periodically saving the system state, called a checkpoint , and rolling back to the last saved state if an error is detected.
A stable storage guarantees three properties in presence of failures: (1) integrity , data stored in stable storage is not altered by failures; (2) accessibility , data stored in stable storage remains accessible despite failures; (3) atomicity , updating data stored in stable storage is an all or nothing operation. In the event of a failure during the update of a group of data stored in stable storage, either all data remain in their initial state or they all take their new value.