Team grand-large

Overall Objectives
Scientific Foundations
Application Domains
New Results
Other Grants and Activities

Section: New Results

Volatility and Reliability Processing

Fault-Injection and Dependability Benchmarking for Grid Computing Middleware[48] , [29] , [47] , [42] In a network consisting of several thousands computers, the occurrence of faults is unavoidable. Being able to test the behavior of a distributed program in an environment where we can control the faults (such as the crash of a process) is an important feature that matters in the deployment of reliable programs.

We developped FAIL-FCI [29] (for Fault Injection Language, and FAIL Cluster Implementation, respectively), a software tool that permits to elaborate complex fault scenarios in a simple way, while relieving the user from writing low level code. In particular, we show that not only we are able to fault-load existing distributed applications (as used in most current papers that address fault-tolerance issues), we are also able to inject qualitative faults, i.e. inject special faults at very special moments in the program code of the application under test. Finally, and although this was not the primary purpose of the tool, we are also able to inject special patterns of workload, in order to stress test the application under test. Interestingly enough, the whole process is driven by a simple unified description language, that is totally independent from the language of the application, so that no code changes or recompilation are needed on the application side. We also investigated the possibility of injecting software faults in distributed java applications. Our scheme is by extending the FAIL-FCI software [47] , and does not require any modification of the source code of the application under test, while retaining the possibility to write high level fault scenarios. As a proof of concept, we use our tool to test FreePastry, an existing java implementation of a Distributed Hash Table (DHT), against node failures.

In the context of the Coregrid Network of Excellence, we presented in [42] an overview of the state of the art, followed by a presentation of the FAIL-FCI system from INRIA that provides a tool for fault-injection in large distributed systems. Then we presented DBGS, a dependable Benchmark for Grid Services and we present some experimental results.

Self-stabilization[46] , [22] , [14]

We generalized in [46] the classic dining philosophers problem to allow critical section entry conflicts between non-neighbor processes. We described a deterministic self-stabilizing solution to the new problem. We extended our solution to handle a similarly generalized drinking philosophers problem. As another extension, we described the variant that has finite failure locality. This extension allows our algorithm to tolerate process crashes.

We presented in [22] a generic distributed algorithm for solving silents tasks such as shortest path calculus, depth-first-search tree construction, best reliable transmitters, in directed networks where communication may be only unidirectional. Our solution is written for the asynchronous message passing communication model, and tolerates multiple kinds of failures (transient and intermittent). First, our algorithm is self-stabilizing, so that it recovers correct behavior after finite time starting from an arbitrary global state caused by a transient fault. Second, it tolerates fair message loss, finite message duplication, and arbitrary message reordering, during both the stabilizing phase and the stabilized phase. This second property is most interesting since, in the context of unidirectional networks, there exists no self-stabilizing reliable data-link protocol. A formal proof establishes its correctness for the considered problem, and subsumes previous proofs for solutions in the simpler reliable shared memory communication model.

We reported in [14] the first self-stabilizing Border Gateway Protocol (BGP). BGP is the standard inter-domain routing protocol in the Internet. Self-stabilization is a technique to tolerate arbitrary transient faults. The routing instability in the Internet can occur due to errors in configuring the routing data structures, the routing policies, transient physical and data link problems, software bugs, and memory corruption. This instability can increase the network latency, slow down the convergence of the routing data structures, and can also cause the partitioning of networks. Most of the previous studies concentrated on routing policies to achieve the convergence of BGP while the oscillations due to transient faults were ignored. The purpose of self-stabilizing BGP is to solve the routing instability problem when this instability results from transient failures. The self-stabilizing BGP presented here provides a way to detect and automatically recover from this type of faults. Our protocol is combined with an existing protocol to make it resilient to policy conflicts as well.

Byzantine Tolerance[52] , [32] We presented in [52] Byzantine-robust solutions to the topology discovery problem. Our programs allow each process to learn the complete topology of the network (up to the neighborhoods of the faulty nodes). The program tolerates up to a fixed number of faults. The network topology is arbitrary. The processes do not know either the diameter or the size of the network. The execution model is asynchronous. The processes do not use cryptographic cryptographic primitives such as digital signatures.

Self-stabilizing protocols can tolerate any type and any number of transient faults. However, in general, self-stabilizing protocols provide no guarantee about their behavior against permanent faults. We propose in [32] a self-stabilizing link-coloring protocol resilient to (permanent) Byzantine faults in arbitrary networks. The protocol assumes the central daemon, and uses 2 $ \Delta$-1 colors where $ \Delta$ is the maximum degree in the network. This protocol guarantees that any link ( u, v) between nonfaulty processes uand vis assigned a color within 2 $ \Delta$+ 2 rounds and its color remains unchanged thereafter.Our protocol is Byzantine insensitive in the sense that the subsystem of correct processes remains operating properly in spite of unbounded Byzantine faults.

Sensor Networks[33] , [51] In large scale multihop wireless networks, flat architectures are not scalable. In order to overcome this major drawback, clusterization is introduced to support self-organization and to enable hierarchical routing. When dealing with multihop wireless networks, the robustness is a main issue due to the dynamicity of such networks. Several algorithms have been designed for the clusterization process. As far as we know, very few studies check the robustness feature of their clusterization protocols. In [33] , we show that a clusterization algorithm, that seems to present good properties of robustness, is self-stabilizing. We propose several enhancements to reduce the stabilization time and to improve stability. The use of a Directed Acyclic Graph ensures that the self-stabilizing properties always hold regardless of the underlying topology. These extra criterion are tested by simulations.

We presented complexity analysis for a family of self-stabilizing vertex coloring algorithms in the context of sensor networks. First, we derived theoretical results on the stabilization time when the system is synchronous. Then, we provided simulations for various schedulings and topologies. We considered both the uniform case (where all nodes are indistinguishable and execute the same code) and the non-uniform case (where nodes make use of a globally unique identifier). Overall, our results show that the actual stabilization time is much smaller than the upper bound provided by previous studies. Similarly, the height of the induced DAG is much lower than the linear dependency to the size of the color domain (that was previously announced). Finally, it appears that symmetry breaking tricks traditionally used to expedite stabilization are in fact harmful when used in networks that are not tightly synchronized.

Space lower bounds for graph exploration[25] We consider the task of exploring graphs with anonymous nodes by a team of non-cooperative robots modeled as finite automata. These robots have no a priori knowledge of the topology of the graph, or of its size. Each edge has to be traversed by at least one robot. We first show that, for any set of qnon-cooperative K-state robots, there exists a graph of size O( qK) that no robot of this set can explore. This improves the O( KO( q)) bound by Rollik (1980). Our main result is an application of this improvement. It concerns exploration with stop, in which one robot has to explore and stop after completing exploration. For this task, the robot is provided with a pebble, that it can use to mark nodes. We prove that exploration with stop requires $ \Omega$(log n) bits for the family of graphs with at most nnodes. On the other hand, we prove that there exists an exploration with stop algorithm using a robot with O( Dlog $ \Delta$) bits of memory to explore all graphs of diameter at most Dand degree at most $ \Delta$ .


Logo Inria