CEPAGE is an INRIA ProjectTeam joint with University of Bordeaux and CNRS (LaBRI, UMR 5800).
The development of interconnection networks has led to the emergence of new types of computing platforms. These platforms are characterized by heterogeneity of both processing and communication resources, geographical dispersion, and instability in terms of the number and performance of participating resources. These characteristics restrict the nature of the applications that can perform well on these platforms. Due to middleware and application deployment times, applications must be longrunning and involve large amounts of data; also, only looselycoupled applications may currently be executed on unstable platforms.
The new algorithmic challenges associated with these platforms have been approached from two different directions. On the one hand, the parallel algorithms community has largely concentrated on the problems associated with heterogeneity and large amounts of data. On the other hand, the distributed systems community has focused on scalability and faulttolerance issues. The success of file sharing applications demonstrates the capacity of the resulting algorithms to manage huge volumes of data and users on large unstable platforms. Algorithms developed within this context are completely distributed and based on peertopeer (P2P for short) communication.
The goal of our project is to establish a link between these two directions, by gathering researchers from the distributed algorithms and data structures, parallel and randomized algorithms
communities. More precisely, the objective of our project is to extend the application field that can be executed on large scale distributed platforms. Indeed, whereas protocols designed for
P2P file exchange are actually distributed, computationally intensive applications executed on large scale platforms (BOINC
Projects must meet three basic technological requirements, to ensure benefits from grid computing:
Projects should have a need for millions of CPU hours of computation to proceed. However, humanitarian projects with smaller CPU hour requirements are able to apply.
The computer software algorithms required to accomplish the computations should be such that they can be subdivided into many smaller independent computations.
If very large amounts of data are required, there should also be a way to partition the data into sufficiently small units corresponding to the computations.
Given these constraints, applications using large data sets should be such that they can be arbitrarily split into small pieces of data (such as Seti@home
These constraints are both related to security and algorithmic issues. Security is of course an important issue, since executing noncertified code on noncertified data on a large scale, open, distributed platform is clearly unacceptable. Nevertheless, we believe that external techniques, such as Sandboxing, certification of data and code through hashcode mechanisms, should be used to solve these problems. Therefore, the focus of our project is on algorithmic issues and in what follows, we assume a cooperative environment of wellintentioned users, and we assume that security and cooperation can be enforced by external mechanisms. Our goal is to demonstrate that gains in performances and extension of the application field justify these extra costs but that, just as operating systems do for multiusers environments, security and cooperation issues should not affect the design of efficient algorithms nor reduce the application field.
Firstly, we aim both at building strong foundations for distributed algorithms (graph exploration, blackhole search,...) and distributed data structures (routing, efficient query, compact labeling...) to understand how to explore large scale networks in the context of failures and how to disseminate data so as to answer quickly to specific queries. Secondly, we aim at building simple (based on local estimations without centralized knowledge), realistic models to represent accurately resource performance and to build a realistic view of the topology of the network (based on network coordinates, geometric spanners, hyperbolic spaces). Then, we aim at proving that these models are tractable by providing low complexity distributed and randomized approximation algorithms for a set a basic scheduling problems (independent tasks scheduling, broadcasting, data dissemination,...) and associated overlay networks. At last, our goal is to prove the validity of our approach through softwares dedicated to several applications (molecular dynamics simulations, continuous integration) as well as more general tools related to the model we propose (AlNEM for automatic topology discovery, SimGRID for simulations at large scale).
We will concentrate on the design of new services for computationaly intensive applications, consisting of mostly independent tasks sharing data, with application to distributed storage, molecular dynamics and distributed continuous integration, that will be described in more details in Section .
Most of the research (including ours) currently carried out on these topics relies on a centralized knowledge of the whole (topology and performances) execution platform, whereas recent evolutions in computer networks technology yield a tremendous change in the scale of these networks. The solutions designed for scheduling and managing compact data structures must be adapted to these systems, characterized by a high dynamism of their entities (participants can join and leave at will), a potential instability of the large scale networks (on which concurrent applications are running), and the increasing probability of failure.
P2P systems have achieved stability and faulttolerance, as witnessed by their wide and intensive usage, by changing the view of the networks: all communication occurs on a logical network (fixed even though resources change over time), thus abstracting the actual performance of the underlying physical network. Nevertheless, disconnecting physical and logical networks leads to low performance and a waste of resources. Moreover, due to their original use (file exchange), those systems are well suited to exact search using Distributed Hash Tables (DHT's) and are based on fixed regular virtual topologies (Hypercubes, De Bruijn graphs...). In the context of the applications we consider, more complex queries will be required (finding the set of edges used for content distribution, finding a set of replicas covering the whole database) and, in order to reach efficiency, unstructured virtual topologies must be considered.
In this context, the main scientific challenges of our project are:
Models:
At a low level, to understand the underlying physical topology and to obtain both realistic and instanciable models. This requires expertise in graph theory (all the members of the project) and platform modelling (Olivier Beaumont, Nicolas Bonichon, Lionel Eyraud and Ralf Klasing). The obtained results will be used to focus the algorithms designed in Sections and .
At a higher level, to derive models of the dynamism of targeted platforms, both in terms of participating resources and resource performances (Olivier Beaumont, Philippe Duchon). Our goal is to derive suitable tools to analyze and prove algorithm performances in dynamic conditions rather than to propose stochastic modeling of evolutions (Section ).
Overlays and distributed algorithms:
To understand how to augment the logical topology in order to achieve the good properties of P2P systems. This requires knowledge in P2P systems and smallworld networks (Olivier Beaumont, Nicolas Bonichon, Philippe Duchon, Nicolas Hanusse, Cyril Gavoille). The obtained results will be used for developing the algorithms designed in Sections and .
To build overlays dedicated to specific applications and services that achieve good performances (Olivier Beaumont, Nicolas Bonichon, Philippe Duchon, Lionel Eyraud, Ralf Klasing). The set of applications and services we target will be described in more details in Section and .
To understand how to dynamically adapt scheduling algorithms (in particular collective communication schemes) to changes in network performance and topology, using randomized algorithms (Olivier Beaumont, Nicolas Bonichon, Nicolas Hanusse, Philippe Duchon, Ralf Klasing) (Section ).
To study the computational power of the mobile agent systems under various assumptions on few classical distributed computing problems (exploration, mapping problem, exploration of the network in spite of harmful hosts. The goal is to enlarge the knowledge on the foundations of mobile agent computing. This will be done by developing new efficient algorithms for mobile agent systems and by proving impossibility results. This will also allow us to compare the different models (David Ilcinkas, Ralf Klasing, Evangelos Bampas) (Section ).
Compact and distributed data structures:
To understand how to dynamically adapt compact data structures to changes in network performance and topology (Nicolas Hanusse, Cyril Gavoille) (Section )
To design sophisticated labeling schemes in order to answer complex predicates using local labels only (Nicolas Hanusse, Cyril Gavoille) (Section )
We will detail in Section how the various expertises in the team will be employed for the considered applications.
We therefore tackle several problems related to two priorities that INRIA identified in its strategic plan (20082012): "Modeling, Simulation and Optimization of Complex Dynamic Systems" and "Information, Computation and Communication Everywhere "
The recent evolutions in computer networks technology, as well as their diversification, yield a tremendous change in the use of these networks: applications and systems can now be designed at a much larger scale than before. This scaling evolution is dealing with the amount of data, the number of computers, the number of users, and the geographical diversity of these users. This race towards large scalecomputing has two major implications. First, new opportunities are offered to the applications, in particular as far as scientific computing, data bases, and file sharing are concerned. Second, a large number of parallel or distributed algorithms developed for average size systems cannot be run on large scale systems without a significant degradation of their performances. In fact, one must probably relax the constraints that the system should satisfy in order to run at a larger scale. In particular the coherence protocols designed for the distributed applications are too demanding in terms of both message and time complexity, and must therefore be adapted for running at a larger scale. Moreover, most distributed systems deployed nowadays are characterized by a high dynamism of their entities (participants can join and leave at will), a potential instability of the large scale networks (on which concurrent applications are running), and an increasing individual probability of failure. Therefore, as the size of the system increases, it becomes necessary that it adapts automatically to the changes of its components, requiring selforganization of the system to deal with the arrival and departure of participants, data, or resources.
As a consequence, it becomes crucial to be able to understand and model the behavior of large scale systems, to efficiently exploit these infrastructures, in particular w.r.t. designing dedicated algorithms handling a large amount of users and/or data.
In the case of parallel computation solutions, some strategies have been developed in order to cope with the intrinsic difficulty induced by resource heterogeneity. It has been proved that changing the metric (from makespan minimization to throughput maximization) simplifies most scheduling problems, both for collective communications and parallel processing. This restricts the use of target platforms to simple and regular applications, but due to the time needed to develop and deploy applications on large scale distributed platforms, the risk of failures, the intrinsic dynamism of resources, it is unrealistic to consider tightly coupled applications involving many tight synchronizations. Nevertheless, (1) it is unclear how the current models can be adapted to large scale systems, and (2) the current methodology requires the use of (at least partially) centralized subroutines that cannot be run on large scale systems. In particular, these subroutines assume the ability to gather all the information regarding the network at a single node (topology, resource performance, etc.). This assumption is unrealistic in a general purpose large size platform, in which the nodes are unstable, and whose resource characteristics can vary abruptly over time. Moreover, the proposed solutions for small to average size, stable, and dedicated environments do not satisfy the minimal requirements for selforganization and faulttolerance, two properties that are unavoidable in a large scale context. Therefore, there is a strong need to design efficient and decentralized algorithms. This requires in particular to define new metrics adapted to large scale dynamic platforms in order to analyze the performance of the proposed algorithms.
As already noted, P2P file sharing applications have been successfully deployed on large scale dynamic platforms. Nevertheless, since our goal is the design of efficient algorithms in terms of actual performance and resource consumption, we need to concentrate on specific P2P environments. Indeed, P2P protocols are mostly designed for file sharing applications, and are not optimized for scientific applications, nor are they adapted to sophisticated database applications. This is mainly due to the primitive goal of designing file sharing applications, where anonymity is crucial, exact queries only are used, and all large file communications are made at the IP level.
Unfortunately, the context strongly differs for the applications we consider in our project, and some of the constraints appear to be in contradiction with performance and resource consumption optimization. For instance, in these systems, due to anonymity, the number of neighboring nodes in the overlay network (i.e. the number of IP addresses known to each peer) is kept relatively low, much lower than what the memory constraints on the nodes actually impose. Such a constraint induces longer routes between peers, and is therefore in contradiction with performance. In those systems, with the main exception of the LAND overlay, the overlay network (induced by the connections of each peer) is kept as far as possible separate from the underlying physical network. This property is essential in order to cope with malicious attacks, i.e. to ensure that even if a geographic site is attacked and disconnected from the rest of the network, the overall network will remain connected. Again, since actual communications occur between peers connected in the overlay network, communications between two close nodes (in the physical network) may well involve many wide area messages, and therefore such a constraint is in contradiction with performance optimization. Fortunately, in the case of file sharing applications, only queries are transmitted using the overlay network, and the communication of large files is made at IP level. On the other hand, in the case of more complex communication schemes, such as broadcast or multicast, the communication of large files is done using the overlay network, due to the lack of support, at IP level, for those complex operations. In this case, in order to achieve good results, it is crucial that virtual and physical topologies be as close as possible.
Our aim is to target large scale platforms. From parallel processing, we keep the idea that resource heterogeneity dramatically complicates scheduling problems, what imposes to restrict ourselves to simple applications. The dynamism of both the topology and the performance reinforces this constraint. We will also adopt the throughput maximization objective, though it needs to be adapted to more dynamic platforms and resources.
From previous work on P2P systems, we keep the idea that there is no centralized large server and that all participating nodes play a symmetric role (according to their performance in terms of memory, processing power, incoming and outgoing bandwidths, etc.), which imposes the design of selfadapting protocols, where any kind of central control should be avoided as much as possible.
Since dynamism constitutes the main difficulty in the design of algorithms on large scale dynamic platforms, we will consider several layers in dynamism:
Stable:In order to establish the complexity induced by dynamism, we will first consider fully heterogeneous (in terms of both processing and communication resources) but fully stable platforms (where both topology and performance are constant over time).
Semistable:In order to establish the complexity induced by faulttolerance, we will then consider fully heterogeneous platforms where resource performance varies over time, but topology is fixed.
Unstable:At last, we will target systems facing the arrival and departure of participants, data or resources.
Year 2009 turns out to be a particular achievements for some members of CEPAGE:
Cyril Gavoille has been honored as a new member of IUF (Institut Universitaire de France);
Philippe Duchon has been recruited as a new professeur of Bordeaux I University;
Ralf Klasing and Nicolas Hanusse presented their HDR defense (Habilitation à diriger les recherches).
Modeling the platform dynamics in a satisfying manner, in order to design and analyze efficient algorithms, is a major challenge. In a semistable platform, the performance of individual nodes (be they computing or communication resources) will fluctuate; in a fully dynamic platform, which is our ultimate target, the set of available nodes will also change over time, and algorithms must take these changes into account if they are to be efficient.
There are basically two ways one can model such evolution: one can use a stochastic process, or some kind of adversary model.
In a stochastic model, the platform evolution is governed by some specific probability distribution. One obvious advantage of such a model is that it can be simulated and, in many wellstudied cases, analyzed in detail. The two main disadvantages are that it can be hard to determine how much of the resulting algorithm performance comes from the specifics of the evolution process, and that estimating how realistic a given model is – none of the current project participants are metrology experts.
In an adversary model, it is assumed that these unpredictable changes are under the control of an adversary whose goal is to interfere with the algorithms efficiency. Major assumptions on the system's behavior can be included in the form of restrictions on what this adversary can do (like maintaining such or such level of connectivity). Such models are typically more general than stochastic models, in that many stochastic models can be seen as a probabilistic specialization of a nondeterministic model (at least for bounded time intervals, and up to negligible probabilities of adopting "forbidden" behaviors).
Since we aim at proving guaranteed performance for our algorithms, we want to concentrate on suitably restricted adversary models. The main challenge in this direction is thus to describe sets of restricted behaviors that both capture realistic situations and make it possible to prove such guarantees.
On the other hand, in order to establish complexity and approximation results, we also need to rely on a precise theoretical model of the targeted platforms.
At a lower level, several models have been proposed to describe interference between several simultaneous communications. In the 1port model, a node cannot simultaneously send to (and/or receive from) more than one node. Most of the “steady state” scheduling results have been obtained using this model. On the other hand, some authors propose to model incoming and outgoing communication from a node using fictitious incoming and outgoing links, whose bandwidths are fixed. The main advantage of this model, although it might be slightly less accurate, is that it does not require strong synchronization and that many scheduling problems can be expressed as multicommodity flow problems, for which decentralized efficient algorithms are known. Another important issue is to model the bandwidth actually allocated to each communication when several communications compete for the same longdistance link.
At a higher level, proving good approximation ratios on general graphs may be too difficult, and it has been observed that actual platforms often exhibit a simple structure. For instance, many real life networks satisfy smallworld properties, and it has been proved, for instance, that greedy routing protocols on small world networks achieve good performance. It is therefore of interest to prove that logical (given by the interactions between hosts) and physical platforms (given by the network links) exhibit some structure in order to derive efficient algorithms.
In order to analyze the performance of the proposed algorithms, we first need to define a metric adapted to the targeted platform. In particular, since resource performance and topology may
change over time, the metric should also be defined from the optimal performance of the platform at any time step. For instance, if throughput maximization is concerned, the objective is to
provide for the proposed algorithm an approximation ratio with respect to
or at least
min
_{SimulationTime}
O
p
t
T
h
r
o
u
g
h
p
u
t(
t).
For instance, Awerbuch and Leighton , developed a very nice distributed algorithm for computing multiflows. The algorithm proposed in consists in associating queues and potential to each commodity at each node for all incoming or outgoing edges. These regular queues store the flow that did not reach its destination yet. Using a very simple and very natural framework, flow goes from high potential areas (the sources) to low potential areas (the sinks). This algorithm is fully decentralized since nodes make their decisions depending on their state (the size of their queues), the state of their neighbors (the size of their queues), and the capacity of neighboring links.
The remarkable property about this algorithm is that if, at any time step, the network is able to ship
(1 +
)
d
_{i}flow units for each capacity at each time step, then the algorithm will ship at least
d_{i}units of flow at steady state. The proof of this property is based on the overall potential of all the queues in the network, which remains bounded over time.
It is worth noting that this algorithm is quasioptimal for the metrics we defined above, since the overall throughput can be made arbitrarily close to
min
_{SimulationTime}
O
p
t
T
h
r
o
u
g
h
p
u
t(
t).
In this context, the approximation result is given under an adversary model, where the adversary can change both the topology and the performances of communication resources between any two
steps, provided that the network is able to ship
(1 +
)
d
_{i}.
Most of Scheduling problems are NPComplete and unapproximability results exist in online settings, especially when resources are heterogeneous. Therefore, we need to rely on simplified communication models (see next section) to prove theoretical results. In this context, resource augmentation techniques are very useful. It consists in identifying a weak parameter (a parameter whose value can be slightly increased without breaking any strong modeling constraint) and then to compare the solution produced by a polynomial time algorithm (with this relaxed constraint) with the optimal solution of the NPComplete problem (without resource augmentation). This technique is both pertinent in a difficult setting and useful in practice.
In the context of large scale dynamic platforms, it is unrealistic to determine precisely the actual topology and the contention of the underlying network at application level. Indeed, existing tools such as Alnem are very much based on quasiexhaustive determination of interferences, and it takes several days to determine the actual topology of a platform made up of a few tens of nodes. Given the dynamism of the platforms we target, we need to rely on less sophisticated models, whose parameters can be evaluated at runtime.
Therefore, we propose to model each node by an incoming and an outgoing bandwidth and to neglect interference that appears at the heart of the network (Internet), in order
to concentrate on local constraints. We are currently implementing a script, based on Iperf to determine the achieved bitrates for onetoone, onetomany and manytoone transfers, given
the number of TCP connections, and the maximal size of the TCP windows. The next step will be to build a communication protocol that enforces a prescribed sharing of the network resources. In
particular, if in the optimal solution, a node
P_{0}must send data at rate
x_{i}^{out}to node
P_{i}and receive data at rate
y_{j}^{in}from node
P_{j}, the goal is to achieve the prescribed bitrates, provided that all capacity constraints are satisfied at each node. Our aim is to implement using Java RMI a protocol able to both
evaluate the parameters of our model (incoming and outgoing bandwidths) and to ensure a prescribed sharing of communication resources.
Under this communication model, it is possible to obtain pathological results. For instance, if we consider a masterslave setting (corresponding to the distribution of independent tasks on a Volunteer Computing platform such as BOINC), the number of slaves connected to the master may be unbounded. In fact, opening simultaneously a large number of TCP connections may lead to a bad sharing of communication resources. Therefore, we propose to add a bound on the number of connexions that can be handled simultaneously by a given node. Estimating this bound is an important issue to obtain realistic communication models.
Once low level modeling has been obtained, it is crucial to be able to test the proposed algorithms. To do this, we will first rely on simulation rather than direct experimentation. Indeed, in order to be able to compare heuristics, it is necessary to execute those heuristics on the same platform. In particular, all changes in the topology or in the resource performance should occur at the same time during the execution of the different heuristics. In order to be able to replicate the same scenario several times, we need to rely on simulations. Moreover, the metric we have tentatively defined for providing approximation results in the case of dynamic platforms requires to compute the optimal solution at each time step, which can be done offline if all traces for the different resources are stored. Using simulation rather than experiments can be justified if the simulator itself has been proved valid. Moreover, the modeling of communications, processing and their interactions may be much more complex in the simulator than in the model used to provide a theoretical approximation ratio, such as in SimGrid. In particular, sophisticated TCP models for bandwidth sharing have been implemented in SimGRID.
At a higher level, the derivation of realistic models for large scale platforms is out of the scope of our project. Therefore, in order to obtain traces and models, we will collaborate with MESCAL, GANG and ASAP projects. We already worked on these topics with the members of GANG in the ACI PairAPair (ACI PairAPair finished in 2006, but ANR Aladdin Programme Blanc acts as a followup, with the members of GANG and Cepage projects). On the other hand, we also need to rely on an efficient simulator in order to test our algorithms. We have not yet chosen the discrete event simulator we will use for simulations. One attractive possibility would be to adapt SimGRID, developed in the Mescal project, to large scale dynamic environments. Indeed, a parallel version of SimGrid, based on activations is currently under development in the framework of USSSimgrid ANR Arpege project (with MESCAl, ALGORILLE and ASAP Teams). This version will be able to deal with platforms containing more than 10 ^{5}resources. SimGrid has been developed by Henri Casanova (U.C. San Diego) and Arnaud Legrand during his PhD (under the cosupervision of O. Beaumont).
Finally, we propose several applications that will be described in detail in Section . These applications cover a large set of fields (molecular dynamics, distributed storage, continuous integration, distributed databases...). All these applications will be developed and tested with an academic or industrial partner. In all these collaborations, our goal is to prove that the services that we propose in Section can be integrated as steering tools in already developed software. Our goal is to assert the practical interest of the services we develop and then to integrate and to distribute them as a library for large scale computing.
In order to test our algorithms, we propose to implement these services using Java RMI. The main advantages of Java RMI in our context are the ease of use and the portability. Multithreading is also a crucial feature in order to schedule concurrent communications and it does not interfere with adhoc routing protocols developed in the project.
A prototype has already been developed in the project as a steering tool for molecular dynamic simulations (see Section ). All the applications will first be tested on small scale platforms (using desktop workstations in the laboratory). Then, in order to test their scalability, we propose to implement them either on the GRID 5000 platform or the partner's platform.
The optimization schemes for content distribution processes or for handling standard queries require a good knowledge of the physical topology or performance (latencies, throughput, ...) of the network. Assuming that some rough estimate of the physical topology is given, former theoretical results described in Section show how to preprocess the network so that local computations are performed efficiently. Due to the dynamism of large distributed platforms, some requirements on the coding of local data structures and the udpating mechanism are needed. This last process is done using the maintenance of light virtual networks, socalled overlay networks(see Section ). In our approach, we focus on:
Compression.
The emergence of huge distributed networks does not allow the topology of the network to be totally known to each node without any compression scheme. There are at least two reasons for this:
In order to guarantee that local computations are done efficiently, that is avoiding external memory requests, it may be of interest that the coding of the underlying topology can be stored within fast memoryspace.
The dynamism of the network implies many basic message communications to update the knowledge of each node. The smaller the message size is, the better the performance.
The compression of any topology description should not lead to an extra cost for standard requests: distance between nodes, adjacency tests, ... Roughly speaking, a decoding process should not be necessary.
Routing tables.
Routing queries and broadcasting information on large scale platforms are tasks involving many basic message communications. The maximum performance objective imposes that basic messages are routed along paths of cost as low as possible. On the other hand, local routing decisions must be fast and the algorithms and data structures involved must support a certain amount of dynamism in the platform.
Local computations.
Although the size of the data structures is less constrained in comparison with P2P systems (due to security reasons), however, even in our collaborative framework, it is unrealistic that each node manages a complete view of the platform with the full resource characteristic. Thus, a node has to manage data structures concerning only a fraction of the whole system. In fact, a partial view of the network will be sufficient for many tasks: for instance, in order to compute the distance between two nodes (distance labeling).
Overlay and small world networks.
The processes we consider can be highly dynamic. The preprocessing usually assumed takes polynomial time. Hence, when a new process arrives, it must be dealt with in an onlinefashion, i.e., we do not want to totally recompute, and the (partial) recomputation has to be simple.
In order to meet these requirements, overlay networksare normally implemented. These are light virtual networks, i.e., they are sparse and a local change of the physical network will only lead to a small change of the corresponding virtual network. As a result, small address books are sufficient at each node.
A specific class of overlay networks are smallworldnetworks. These are efficient overlay networks for (greedy) routing tasks assuming that distance requests can be performed easily.
Mobile Agent Computing.
Mobile Agent Computing has been proposed as a powerful paradigm to study distributed systems. Our purpose is to study the computational power of the mobile agent systems under various assumptions. Indeed, many models exist but little is known about their computational power. One major parameter describing a mobile agent model is the ability of the agents to interact.
The most natural mobile agent computing problem is the exploration or mapping problem in which one or several mobile agents have to explore or map their environment. The rendezvous problem consists for two agents to meet at some unspecified node of the network. Two other fundamental problems deal with security, which is often the main concern of actual mobile agent systems. The first one consists in exploring the network in spite of harmful hosts that destroy incoming agents. An additional goal in this context is to locate the harmful host(s) to prevent further agent losses. We already mentioned the second problem related to security, which consists for the agents in capturing an intruder.
The goal is to enlarge the knowledge on the foundations of mobile agent computing. This will be done by developing new efficient algorithms for mobile agent systems and by proving impossibility results. This will also allow to compare the different models.
Of course, the main difficulty is to adapt the maintenance of local data structures to the dynamism of the network.
As mentioned in Section
, solutions provided by the parallel algorithm community are dedicated to stable
platforms whose resource performances can be gathered at a single node that is responsible for computing the optimal solution. On the other hand, P2P systems are fully distributed but the set
of available queries in these systems is much too poor for computationally intensive applications. Therefore, actual solutions for large scale distributed platforms such as BOINC
Requests and Task scheduling on large scale platforms;
New services for processing on large scale platforms.
Another interesting scheduling problem is the case of applications sharing (large) files stored in replicated distributed databases. We deal here with a particular instance of the scheduling problem mentioned in Section . This instance involves applications that require the manipulation of large files, which are initially distributed across the platform.
It may well be the case that some files are replicated. In the target application, all tasks depend upon the whole set of files. The target platform is composed of many distant nodes, with different computing capabilities, and which are linked through an overlay network (to be built). To each node is associated a (local) data repository. Initially, the files are stored in one or several of these repositories. We assume that a file may be duplicated, and thus simultaneously stored on several data repositories, thereby potentially speeding up the next request to access them. There may be restrictions on the possibility of duplicating the files (typically, each repository is not large enough to hold a copy of all the files). The techniques developed in Section will be used to dynamically maintain efficient data structures for handling files.
Our aim is to design a prototype for both maintaining data structures and distributing files and tasks over the network.
This framework occurs for instance in the case of MonteCarlo applications where the parameters of new simulations depend on the average behavior of the simulations previously performed. The general principle is the following: several simulations (independent tasks) are launched simultaneously with different initial parameters, and then the average behavior of these simulations is computed. Then other simulations are performed with new parameters computed from the average behavior. These parameters are tuned to ensure a much faster convergence of the method. Running such an application on a semistable platform is a particular instance of the scheduling problem mentioned in Section .
We will focus on a particular algorithm picked from Molecular Dynamics: calculation of Potential of Mean Force (PMF) using the technique of Adaptive Bias Force (ABF). This work is done via a collaboration with Juan Elezgaray, IECB, Bordeaux. Here is a quick presentation of this context. Estimating the time needed for a molecule to go through a cellular membrane is an important issue in biology and medicine. Typically, the diffusion time is far too long to be computed with atomistic molecular simulations (the average time to be simulated is of order of 1s and the integration step cannot be chosen larger than 10 ^{15}, due to the nature of physical interactions). Classical parallel approaches, based on domain decomposition methods, lead to very poor results due to the number of barriers. Another method to estimate this time is by calculating the PMF of the system, which is in this context the average force the molecule is subject to at a given position within or around the membrane. Recently, Darve et al. presented a new method, called ABF, to compute the PMF. The idea is to run a small number of simulations to estimate the PMF, and then add to the system a force that cancels the estimated PMF. With this new force, new simulations are performed starting from different configurations (distributed over the computing platform) of the system computed during the previous simulations and so on. Iterating this process, the algorithm converges quite quickly to a good estimation of the PMF with a uniform sampling along the axis of diffusion. This application has been implemented and integrated to the famous molecular dynamics software NAMD .
Our aim is to propose a distributed implementation of ABF method using NAMD. It is worth noting that NAMD is designed to run on highend parallel platforms or clusters, but not to run efficiently on instable and distributed platforms. The different problems to be solved in order to design this application are the following:
Since we need to start a simulation from a valid configuration (which can represent several Mbytes) with a particular position of the molecule in the membrane, and these configurations are spread among participating nodes, we need to be able to find and to download such configuration. Therefore, the first task is to find an overlay such that those requests can be handled efficiently. This requires expertise in overlay networks, compact data structures and graph theory. Olivier Beaumont, Nicolas Bonichon, Philippe Duchon, Nicolas Hanusse, Cyril Gavoille and Ralf Klasing will work on this part.
In our context, each participating node may offer some space for storing some configurations, some bandwidth and some computing power to run simulations. The question arising here is how to distribute the simulations to nodes such that computing power of all nodes are fully used. Since nodes may join and leave the network at any time, redistributions of configurations and tasks between nodes will also be necessary (but all tasks only contribute to update the PMF, so that some tasks may fail without changing the overall result). The techniques designed for content distribution will be used to spread and redistribute the set of configurations over the set of participating nodes. This requires expertise in task scheduling and distributed storage. Olivier Beaumont, Nicolas Bonichon, Philippe Duchon and Lionel EyraudDubois will work on this part.
A prototype of a steering tool for NAMD has been developed in the project, that may be used to validate our approach and that has been tested on GRID'5000 up to 200 processors. This prototype supports the dynamicity of the platform: contributing processors can come and leave. The managment of configurations' location is now performed using a distributed hash table. This was done by integrating the library Bamboo in the prototype. We still have to solve numerical instability.
Continuous Integration is a development method in which developers commit their work in a version control system (such as CVS or Subversion) very frequently (typically several times per day) and the project is automatically rebuilt. One of the advantages of this technique is that merge problems are detected and corrected early.
The build process not only generates the binaries, it also runs automated tests, generates documentation, checks the code coverage of tests and analyzes code style...
The whole process can take several hours for large projects. Therefore, the efficiency of this development method relies on the speed of the feedback. There is a real need to speed up the
build process, and thus to distribute it. This is one of the goal continuous integration server xooctory
In order to obtain an efficient distribution of the build, the build process can be decomposed into nearly independent sub processes, executed on different nodes. Nevertheless, to be completed, a sub process must be run on a node that holds the appropriate version of the tools (compiler, code auditing software, ...), the appropriate version of the libraries, and the appropriate version of source code. Of course, if the target node does not have all these items, it can download them from another node, but these communications may be more expensive than the execution of the sub processes.
This raises several challenging problems:
Build a distributed data structure that can efficiently provide
one of the nodes that stores a certain set
Sof files.
one of the nodes that stores a maximum subset
S^{'}of a set
Sof files.
one of the nodes that can obtain quickly a certain set
Sof files (i.e. a node that can download efficiently the files of
Sthat it does not already holds).
Design distribution strategies of the build that take advantage of the processing and communication capabilities of the nodes.
We are collaborating with Xavier Hanin and Jayasoft in order to solve distribution problems in the context of distributed continuous integration. Our goal is to incorporate some of the services developed in Cepage to obtain a large scale distributed version of the continuous integration server xooctory.
Ludovic Courtes (delegated to CEPAGE as INRIA SED Engineer) is currently working on a distributed version of the integration server Xooctory. We expect to distribute this new version at the end of 2010.
We recently focused on two problems:
Data cube queries represent an important class of OnLine Analytical Processing (OLAP) queries in decision support systems. They consist in a precomputation of the different groupbys of a database (aggregation for every combination of GROUP BY attributes) that is a very consuming task. For instance, databases of some megabytes may lead to the construction of a datacube requiring terabytes of memory and parallel computation has been proposed but for a static and wellidentified platform . This application is typically an interesting example for which the distributed computation and storage can be useful in an heterogeneous and dynamic setting. We just started a collaboration with Sofian Maabout (Assistant Professor in Bordeaux) and Noel Novelli (Assistant Professor of Marseille University) who is a specialist of datacube computation. Our goal is to rely on the set of services defined in Section to compute and maintain huge datacubes. For the moment, we developped a centralized tool that sums up an whole datacube until dimension 20 and that outperforms usual data cube reduction scheme.
Some work is required to share our tool to a wide public. We plan to do it within the next two years.
In the framework of the AlcatelLucent Bell collaboration, we are developping a simulator of routing algorithms. This developpement is performed by the engineer F. Majorczyk in Bordeaux site, and in collaboration with INRIA SophiaAntipolis site.
The main objective is to give a complete experimental study of the Compact Routing Scheme given recently
by Abraham, Gavoille, Malkhi, Nisan, Thorup in 2008. This algorithm
garantees, for every weighted
nnode network routing tables of size
while the stretch factor is at most 3, i.e., the length of the routes induced by the scheme is never more than three time the optimal length (the distance). The bound on the stretch
and in the memory are both optimal. Moreover, the scheme is “NameIndependent”, that is the routing decision at the source router is made on the base of the original name of the destination
node. No information can be implicitly encoded in the node names, like coordinates in a grid network. This extra feature is important in practice since in many contexts, node names cannot be
renamed according to some global state on the network, in particular whenever the network is growing and dynamic.
This study, if one succeeds, would be the first to report experiments on a nameindependent routing scheme. Our simulator implement several graph generators, and the target algorithm currently works on 3000 nodes. We plain to extend the experiments up to 10.000 nodes, and in parallel to give a message efficient distributed algorithm and implementation of this algorithm.
_{k}graphs are geometric graphs that appear in the context of graph navigation. The shortestpath metric of these graphs is known to approximate the Euclidean complete graph up to a
factor depending on the cone number
kand the dimension of the space.
TDDelaunay graphs, a.k.a. triangulardistance Delaunay triangulations introduced by Chew, have been shown to be plane 2spanners of the 2D Euclidean complete graph, i.e., the distance in the TDDelaunay graph between any two points is no more than twice the distance in the plane.
Orthogonal surfaces are geometric objects defined from independent sets of points of the Euclidean space. Orthogonal surfaces are well studied in combinatorics (orders, integer programming) and in algebra. From orthogonal surfaces, geometric graphs, called geodesic embeddings can be built.
We have introduce a specific subgraph of the _{6}graph defined in the 2D Euclidean space, namely the graph, composed of the evencone edges of the graph. Our main contribution is to show that these graphs are exactly the TDDelaunay graphs, and are strongly connected to the geodesic embeddings of orthogonal surfaces of coplanar points in the 3D Euclidean space.
Using these new bridges between these three fields, we establish:
Every
graph is the union of two spanning TDDelaunay graphs. In particular,
_{6}graphs are 2spanners of the Euclidean graph. It was not known that
_{6}graphs are
tspanners for some constant
t, and
_{7}graphs were only known to be
tspanners for
.
Every plane triangulation is TDDelaunay realizable, i.e., every combinatorial plane graph for which all its interior faces are triangles is the TDDelaunay graph of some point set in the plane. Such realizability property does not hold for classical Delaunay triangulations.
In collaboration with Ljubomir Perković, we have also worked on the question of bounded degree planar spanner: what is the minimum such that there exists a planar spanner of degree at most for any point set? We have proposed an algorithm that computes a 6spanner of degree at most 6. The best previous known bound on the maximum degree of planar spanner was 14 with a stretch factor of 3.53.
There are several techniques to manage sublinear size routing tables (in the number of nodes of the platform) while guaranteeing almost shortest paths (cf. for a survey of routing techniques).
Some techniques provide routes of length at most 1 + times the length of the shortest one (which is the definition of a stretch factor of 1 + ) while maintaining a polylogarithmic number of entries per routing table , , . However, these techniques are not universal in the sense that they apply only on some class of underlying topologies. Universal schemes exist. Typically they achieve entry local routing tables for a stretch factor of 3 in the worst case , . Some experiments have shown that such methods, although universal, work very well in practice, in average, on realistic scalefree or existing topologies .
While the fundamental question is to determine the best stretchspace tradeoff for universal schemes, the challenge for platform routing would be to design specific schemes supporting
reasonable dynamic changes in the topology or in the metric, at least for a limited class of relevant topologies. In this direction
have constructed (in polynomial time) network topologies for
which nodes can be labeled once such that whatever the link weights vary in time, shortest path routing tables with compacity
kcan be designed, i.e., for each routing table the set of destinations using the same first outgoing edge can be grouped in at most
kranges of consecutive labels.
One other aspect of the problem would be to model a realistic typical platform topology. Natural parameters (or characteristic) for this are its low dimensionality: low Euclidean or near Euclidean networks, low growing dimension, or more generally, low doubling dimension.
In 2007, we have improved compact routing scheme for planar networks, and more generally for networks excluding a fixed minor . This later family of networks includes (but is not rectrict to) networks embeddable on surfaces of bounded genus and networks of bounded treewidth. The stretch factor of our scheme is constant and the size of each routing table is only polylogarithmic (independently of the degree of the nodes), and the scheme does not require renaming (or a new addressing) of the nodes: it is nameindependent. More importantly, the scheme can be constructed efficiently in polynomial time, and complexities do not hid large constant as we may encounter in Minor Graph Theory. This construction has been achieved by the design of new sparse cover for planar graphs, solving a problem open since STOC '93.
In 2007, we also gave an invited lecture on compact routing schemes at a workshop on PeertoPeer, Routing in Complex Graphs, and Network Coding in Thomson Labs in Paris.
In 2008, we have proposed a minimum stretch compact nameindependent routing . This scheme is the based of the Compact Routing Simulator we are developping in the AlcatelLucent Bell project.
In order to optimize applications the platform topology itself must be discovered, and thus represented in memory with some data structures. The size of the representation is an important parameter, for instance, in order to optimize the throughput during the exploration phase of the platform.
Classical data structures for representing a graph (matrix or list) can be significantly improved when the targeted graph falls in some specific classes or obeys to some properties: the
graph has bounded genus (embeddable on surface of fixed genus), bounded treewidth (or
cdecomposable), or embeddabble into a bounded page number
,
. Typically, planar topologies with
nnodes (thus embeddable on the plane with no edge crossings) can by efficiently coded in linear time with at most
5
n+
o(
n)bits supporting adjacency queries in constant time. This improves the classical adjacency list within a non negligible
log
nfactor on the size (the size is about
6
nlog
nbits for edge list), and also on the query time
,
,
.
In 2008, we gave a compact encoding scheme of pagenumber
kgraphs
.
The basic routing scheme and the overlay networks must also allow us to route other queries than routing driven by applications. Typically, divideandconquer parallel algorithms require to compute many nearest common ancestor (NCA) queries in some tree decomposition. In a large scale platform, if the current tree structure is fully or partially distributed, then the physical location of the NCA in the platform must be optimized. More precisely, the NCA computation must be performed from distributed pieces of information, and then addressed via the routing overlay network (cf. for distributed NCA algorithms).
Recently, a theory of localized data structures has been developed (initialized by ; see for a survey). One associates with each node a label such that some given function (or predicate) of the node can be extracted from two or more labels. Theses labels are usually joined to the addresses or inserted into a global database index.
In relation with the project, queries involving the flow computation between any sinktarget pair of a capacitated network is of great interest . Dynamic labeling schemes are also available for tree models , , and need further work for their adaptation to more general topologies.
Finally, localized data structures have applications to platforms implementing large database XML file types. Roughly speaking pieces of a large XML file are distributed along some platform, and some queries (typically some SELECT ... FROM extractions) involve many tree ancestor queries , the XML file structure being a tree. In this framework, distributed labelbased data structures avoid the storing of a huge classical index database.
In 2007, we have proved that it is possible to assigned with each node of
nnode planar networks a label of
2log
n+
O(loglog
n)bits so that adjacency between two nodes can be retrieved from there labels
. Classical representations of planar graphs in the distributed
setting where based on the Three Schnyder Trees decomposition, leading to
3log
n+
O(log
^{*}
n)bit labels (FOCS '01). An intriguing question is to know whether
clog
nbit representation exists for planar graphs with
c<2.
For trees, we have can solve
kancestry and distance
kqueries with shorter labels
,
. Previous solutions achieve
log
n+
O(
k
^{2}loglog
n)bit labels [AlstrupBilleRauhe 2005], whereas we have prove that
log
n+
O(
kloglog
n)bit labels suffice. For interval graphs, we have given an optimal distance labeling scheme
, and we proposed a localized and compact data structure for
comparability graphs
.
In , , , we also analyzed the locality of the construction of sparse spanners. In , we proposed an efficient firstorder model checking using short labels.
Finally, we have started a collaboration with Andrew Twigg (Thomson  Labs) and Bruno Courcelle (LaBRI) about connectivity in semidynamic planar networks (see preliminary results
here
and here
). In this model, the must precompute some localized
datastructure (given as a label associate with each node) and for a planar graph
G, so that connectivity between any two nodes in
where
Xis any subset of nodes or edges, can be determined from the labels of the two nodes and the labels of the nodes (or endpoint of edges) of
X. This field looks promising since it capture a kind of dynamicity of the network, and we hope to generalize this model and our results.
Distributed Greedy Coloring is an interesting and intuitive variation of the standard Coloring problem. Given an order among the colors, a coloring is said to be
greedyif there does not exist a vertex for which its associated color can be replaced by a color of lower position in the fixed order without violating the property that neighboring
vertices must receive different colors. In
, we consider the problems of
Greedy Coloringand
Largest First Coloring(a variant of greedy coloring with strengthened constraints) in the Linial model of distributed computation, providing lower and upper bounds and a comparison
to the
(
+ 1)Coloringand
Maximal Independent Setproblems, with
being the maximum vertex degree in
G.
We also proposed a new algorithm that allow the administrator or user of a SGBD to choose which part of the data cube to optimize. This problem is called in the litterature the views selection problem. The goal consists in chosing the best part of the whole data cube to precompute. Our contribution is to consider that the main constraint is the time to answer to individual queries whereas the memory constraint is usually taken .
The next step consists in turning our approach into a parallel and distributed algorithm. We are currently experiencing a parallel algorithm with a theoretical guarantee of performance.
More precisaly, given a constant
f, the query time is at most
ftimes the optimal query (defined whenever the result has already been computed).
It turns out that our solution can be adapted to the problem of finding quikly the maximal frequent itemsets within a transaction tables . A transaction consists in a list of items. For a given frequency, we aim at computing the maximal itemsets that are frequent in list of transactions. To our knowledge, there is no parallel algorithm with a guarantee of performance that compute the maximal frequent itemsets. Our solution for the view selection algorithm should be experienced on real instances.
An overlay network is a virtual network whose nodes correspond either to processors or to resources of the network. Virtual links may depend on the application; for instance, different overlay networks can be designed for routing and broadcasting.
These overlay networks should support insertion and deletion of users/resources, and thus they inherently have a high dynamism.
We should distinguish structuredand unstructuredoverlay networks:
In the first case, one aims at designing a network in which queries can be answered efficiently: greedy routing should work well (without backtracking), the spreading of a piece of information should take a very short time and few messages. The natural topology of these networks are graph of small diameter and bounded degree (De Bruijn graph for instance). However, dynamic maintenance of a precise structure is difficult and any perturbation of the topology gives no guarantee for the desired tasks.
In the case of unstructured networks, there is no strict topology control. For the information retrieval task, the only attempt to bound the total number of messages consists of optimizing a flooding by taking into account statistics stored at each peer: number of requests that found an item traversing a given link, ...
In both approaches, the physical topology is not involved. To our knowledge, there exists only one attempt in this direction. The work of Abraham and Malhki deals with the design of routing tables for stable platforms.
We are interested in designing overlay topologies that take into account the physical topology.
Another work is promising. If we relax the condition of designing an overlay network with a precise topology but with some topological properties, we might construct very efficient overlay networks. Two directions can be considered: random graphsand smallworldnetworks.
Random graphs are promising for broadcast and have been proposed for the update of replicated databases in order to minimize the total number of messages and the time complexity , . The underlying topology is the complete graph but the communication graph (pairs of nodes that effectively interact) is much more sparse. At each pulse of its local clock, each node tries to send or receive any new piece of information. The advantage of this approach is faulttolerance. However, this epidemic spreading leads to a waste of messages since any node can receive many times the same update. We are interested in fixing this drawback and we think that it should be possible.
For several queries, recent solutions use smallworld networks. This approach is inspired from experiments in social sciences . It suggests that adding a few (non uniform) random and uncoordinated virtual long links to every node leads to shrink drastically the diameter of the network. Moreover, paths with a small number of hops can be found , , .
Solutions based on network augmentation (i.e. by adding virtual links to a base network) have proved to be very promising for large scale networks. This technique is referred to as
turning a network into a smallworld network, also called the
smallworldizationprocess. Indeed, it allows to transform many arbitrary networks into networks in which search operations can be performed in a greedy fashion and very quickly
(typically in time polylogarithmic in the size of the network). This property implies that some information can be easily (or locally) accessed like the distance between nodes. More
formally, a network is
fnavigable if a greedy routing can be used to get routing paths of
O(
f)hops. Recently, many authors aim at finding some networks that be turned into
log
^{O(1)}navigable network.
Our goal is to study more precisely the algorithmic performance of these new smallworld networks (w.r.t. time, memory, pertinence, faulttolerance, autostabilization, ...) and to propose new networks of this kind, i.e. to construct the augmentation of the base network as well as to conceive the corresponding navigation algorithm. Like classical algorithms for routing and navigation (that are essentially based on greedy algorithms), the proposed solutions have to take into account that no entity has a global knowledge of the network. A first result in this direction is promising. In , we proposed an economic distributed algorithm to turn a bounded growth network into a smallworld. Moreover, the practical challenge will be to adapt such constructions to dynamic networks, at least under the models that are identified as relevant.
Can the smallworldizationprocess be supported in dynamic platforms? Up to now, the literature on smallworld networks only deals with the routing task. We are convinced that smallworld topologies are also relevant for other tasks: quick broadcast, search in presence of faulty nodes, .... In general, we think that maintaining a smallworld topology can be much more realistic than maintaining a rigidly structured overlay network and much more efficient for several tasks in unstructured overlay networks.
In 2007, we have two contributions dealing with overlay networks: (1) in
, there is a formal description of an algorithm turning any
network into a
n^{1/3}navigable network. This article is particularly interesting since it is the first one that considers any input network in the smallworldization process; (2) in
,
, we prove that local knowledge is not enough to search quickly
for a target node in scalefree networks. Recent studies showed that many real networks are scalefree: the distribution of nodes degree follows a power law on the form
with
[2, 3], that is the number of nodes of degree
kis proportional to
. More precisely, we formally prove that in usual scalefree models, it takes
(
n^{1/2})steps to reach the target.
In 2008, we gave a small stretch polylogarithmic network navigability scheme using compact metrics .
In the effort to understand the algorithmic limitations of computing by a swarm of robots, the research has focused on the minimal capabilities that allow a problem to be solved. The
weakest of the commonly used models is
Asynchwhere the autonomous mobile robots, endowed with visibility sensors (but otherwise unable to communicate), operate in LookComputeMove
cycles performed asynchronously for each robot. The robots are often assumed (or required to be) oblivious: they keep no memory of observations and computations made in previous cycles. In
the paper
, we consider the setting when the robots are dispersed in an
anonymous and unlabeled graph, and they must perform the very basic task of
exploration: within finite time every node must be visited by at least one robot and the robots must enter a quiescent state. The complexity measure of a solution is the number of
robots used to perform the task. We study the case when the graph is an arbitrary tree and establish some unexpected results. We first prove that there are
nnode trees where
(
n)robots are necessary; this holds even if the maximum degree is 4. On the other hand, we show that if the maximum degree is 3, it is possible to explore with only
robots. The proof of the result is constructive. Finally, we prove that the size of the team is asymptotically
optimal: we show that there are trees of degree 3 whose exploration requires
robots.
In
, we consider the problem of
periodic graph explorationin which a mobile entity with constant memory,
an agent, has to visit all
nnodes of an arbitrary undirected graph
Gin a periodic manner. Graphs are supposed to be anonymous, that is, nodes are unlabeled. However, while visiting a node, the robot has to distinguish between edges incident to it.
For each node
vthe endpoints of the edges incident to
vare uniquely identified by different integer labels called
port numbers. We are interested in minimisation of the length of the exploration period. This problem is unsolvable if the local port numbers are set arbitrarily [
L. Budach: Automata and labyrinths, Math. Nachrichten 86(1): 195282 (1978)]. However, surprisingly small periods can be achieved when assigning carefully the local
port numbers. Dobrev et al. [
S. Dobrev, J. Jansson, K. Sadakane, W.K. Sung: Finding Short RightHandontheWall Walks in Graphs, 12th Colloquium on Structural Information and Communication
Complexity SIROCCO, LNCS 3499, 127139, 2005] described an algorithm for assigning port numbers, and an oblivious agent (i.e. agent with no memory) using it, such that the agent
explores all graphs of size
nwithin period
10
n. Providing the agent with a constant number of memory bits, the optimal length of the period was proved in
to be no more than
3.75
n(using a different assignment of the port numbers). In this paper, we improve both these bounds. More precisely, we show a period of length at most
for oblivious agents, and a period of length at most
3.5
nfor agents with constant memory. Moreover, we give the first nontrivial lower bound,
2.8
n, on the period length for the oblivious case.
The
rotorrouter model, also called the
Propp machine, was first considered as a deterministic alternative to the random walk. It is known that the route in an undirected graph
G= (
V,
E),where

V =
nand

E =
m,adopted by an agent controlled by the rotorrouter mechanism forms eventually an Euler tour based on arcs obtained via replacing each edge in
Gby two arcs with opposite direction. The process of ushering the agent to an Euler tour is referred to as the
lockin problem. In recent work [
V. Yanovski, I.A. Wagner, A.M. Bruckstein: A Distributed Ant Algorithm for Efficiently Patrolling a Network, Algorithmica 37: 165–186 (2003)], Yanovski et al. proved
that independently of the initial configuration of the rotorrouter mechanism in
Gthe agent locksin in time bounded by
2
m
D,where
Dis the diameter of
G.In
, we examine the dependence of the lockin time on the initial
configuration of the rotorrouter mechanism. The case study is performed in the form of a game between a player
intending to lockin the agent in an Euler tour as quickly as possible and its adversary
with the counter objective. First, we observe that in certain (easy) cases the lockin can be achieved in time
O(
m). On the other hand we show that if adversary
is solely responsible for the assignment of ports and pointers, the lockin time
(
m·
D)can be enforced in any graph with
medges and diameter
D.Furthermore, we show that if
provides its own port numbering after the initial setup of pointers by
, the complexity of the lockin problem is bounded by
O(
m·min{log
m,
D}). We also propose a class of graphs in which the lockin requires time
(
m·log
m).In the remaining two cases we show that the lockin requires time
(
m·
D)in graphs with the worstcase topology. In addition, however, we present nontrivial classes of graphs with a large diameter in which the lockin time is
O(
m).
if at some step the values of
kpointers
_{v}are arbitrarily changed, then a new Eulerian cycle is established within
O(
km)steps;
if at some step
kedges are added to the graph, then a new Eulerian cycle is established within
O(
km)steps;
if at some step an edge is deleted from the graph, then a new Eulerian cycle is established within
O(
m)steps, where
is the smallest number of edges in a cycle in graph
Gcontaining the deleted edge.
Our proofs are based on the relation between Eulerian cycles and spanning trees known as the “BEST” Theorem (after de Bruijn, van Aardenne Ehrenfest, Smith and Tutte).
Within the wider context of the project, we have published two book chapters on data gathering and energy consumption in wireless networks, respectively , . We have also considered the problems of modeling of wireless networks , energy efficiency in wireless networks , efficient realization of specific classes of permutation networks , and broadcasting in radio networks .
Even if the application field for large scale platforms is currently too poor, targeted platforms are clearly not suited to tightly coupled codes and we need to concentrate on simple scheduling problems in the context of large scale distributed unstable platforms. Indeed, most of the scheduling problems are already NPComplete with bad approximation ratios in the case of static homogeneous platforms when communication costs are not taken into account.
Recently, many algorithms have been derived, under several communication models, for master slave tasking , and Divisible Load Scheduling (DLS) , , .
In this case, we aim at executing a large bag of independent, samesize tasks. First we assume that there is a single master, that initially holds all the (data needed for all) tasks. The problem is to determine an architecture for the execution. Which processors should the master enroll in the computation? How many tasks should be sent to each participating processor? In turn, each processor involved in the execution must decide which fraction of the tasks must be computed locally, and which fraction should be sent to which neighbor (these neighbors must be determined too).
Parallelizing the computation by spreading the execution across many processors may well be limited by the induced communication volume. Rather than aiming at makespan minimization, a more relevant objective is the optimization of the throughput in steadystate mode. There are three main reasons for focusing on the steadystate operation. First is simplicity, as the steadystate scheduling is in fact a relaxation of the makespan minimization problem in which the initialization and cleanup phases are ignored. One only needs to determine, for each participating resource, which fraction of time is spent computing for which application, and which fraction of time is spent communicating with which neighbor; the actual schedule then arises naturally from these quantities.
In , we have considered the case task scheduling for parallel multifrontal methods, what corresponds to map a set of tasks whose dependencies are depicted by a tree. In , we have proposed several distributed scheduling algorithms when several applications are to be simultaneously mapped onto an heterogeneous platform.
Another important and still open issue for Divisible Load Scheduling deals with return communication. Under the classical model, it is assumed that the communication time of the results between the slaves and the master node can be neglected, what strongly limits the application field. In particular, the complexity of the problem with return messages is still opened. This question has been studied in cooperation with Abhay Ghatpande, from Waseda University in , , . In particular, we have proposed two heuristics for scheduling return messages with different computational costs.
In this context, we have participated in the writing of two book chapters, about different possible modelisations of communications and about steadystate scheduling.
We have revisited several classical scheduling problems (Broadcasting, independent tasks scheduling) under more realistic communication models, whose parameters can be instanciated at runtime. We have proved that the use of resource augmentation techniques enables to derive quasioptimal algorithms even if the underlying scheduling problems are strongly NPComplete.
In
,
, we have considered the problem of allocating a large number of
independent, equalsized tasks to a heterogeneous large scale computing platform. We model the platform using a set of servers (masters) that initially hold (or generate) the tasks to be
processed by a set of clients (slaves). All resources have different speeds of communication and computation and we model contentions using the bounded multiport model. This model
corresponds well to modern networking technologies, but for the sake of realism, another parameter needs to be introduced in order to bound the number of simultaneous connections that can
be opened at a server node. We prove that unfortunately, this additional parameter makes the problem of maximizing the overall throughput NPComplete. On the other hand, we also propose a
polynomial time algorithm, based on a slight resource augmentation, to solve this problem. More specifically, we prove that, if
d_{j}denotes the maximal number of connections that can be opened at server node
S_{j}, then the throughput achieved using this algorithm and
d_{j}+ 1simultaneous connections is at least the same as the optimal one with dj simultaneous connections. This algorithm also provides a good approximation for the dual problem of
minimizing the maximal number of connections that need to be opened in order to achieve a given throughput, and it can be turned into a standard approximation algorithm (i.e., without
resource augmentation).
We have also considered in , an extension of the above problem (MTBD) trying to represent the more realistic situation when clients can arrives or leaves the system at any time. This extension is called online MTBD. First, we have studied the complexity of the problem, and obtained a negative result saying that no totally online algorithm is able to guarantee a desired approximation factor, even if the algorithm uses resource augmentation. On the other hand, if new connections are allowed each time a client arrives or leaves the system, we propose an algorithm that provides the optimal throughput using resource augmentation and allowing only one new connection per server (each time a client arrives or leaves the system).
In many distributed applications on large distributed systems, nodes may offer some local resources and request some remote resources. For instance, in a distributed storage environment, nodes may offer some space to store remote files and request some space to duplicate remotely some of their files. In the context of broadcasting, offer may be seen as the outgoing bandwidth and request as the incoming bandwidth. In the context of load balancing, overloaded nodes may request to get rid of some tasks whereas underloaded nodes may offer to process them. In this context, we propose a distributed algorithm, called dating servicewhich is meant to randomly match demands and supplies of some resource of many nodes into couples. In a given round it produces a matching between demands and supplies which is of linear size (compared to the optimal one), even if available resources of individual nodes are very heterogeneous, and is chosen uniformly at random from all matchings of this size.
We believe that this basic operation can be of great interest in many practical applications and could be used as a building block for writing efficient software on large distributed unstable platforms. We plan to demonstrate its practical efficiency for content distribution, management of large databases and distributed storage applications described in Section .
We also have ongoing work on using this dating service for the maintenance of a randomized overlay network against arbitrary arrivals and departures of nodes, and are trying to remove the requirement for the algorithm to work in a succession of rounds.
In this context, we would like to propose a distributed algorithm to dynamically build clusters of nodes able to process large tasks. These sets of nodes should satisfy constraints on the overall available memory, on its processing power together with constraints on the maximal latency between nodes and the minimal bandwidth between two participating nodes.
We believe that such a distributed service would enable to consider a much larger application field. We plan to demonstrate first its practical efficiency for the application of molecular dynamics (based on NAMD) described in more detail in Section .
In
we present a modeling of this problem called
bincovering problem with distance constraintand we propose a distributed approximation algorithm in the case where the elements are in a space of dimension 1. In
, we describe a generic 2phases algorithm, based on resource
augmentation and whose approximation ratio is 1/3. We also propose a distributed version of this algorithm when the metric space is
(for a small value of
D) and the
norm is used to define distances. This algorithm takes
O((4
^{D})log
^{2}n)rounds and
O((4
^{D})
nlog
n)messages both in expectation and with high probability, where
nis the total number of hosts.
In many applications on large scale distributed platforms, the application data files are distributed among the platform and the volatility in the availability of resources forbids to rely on a centralized system to locate data.
In this context, complex queries, such as finding a node holding a given set of files, or holding a file whose index is close to a given value, or a set of (close) nodes covering a given set of files, should be treated in a distributed manner. Queries built for P2P systems are much too poor to handle such requests.
We plan to demonstrate the usefulness and efficiency of such requests on the molecular dynamics application and on the continuous integration application described in Section . Again, we strongly believe that these operations can be considered as useful building blocks for most large scale distributed applications that cannot be executed in a clientserver model, and that providing a library with such mechanisms would be of great interest.
A sound approach is to structure them in such a way that they reflect the structure of the application. Peers represent objects of the application so that neighbours in the peer to peer network are objects having similar characteristics from the application's point of view. Such structured peer to peer overlay networks provide a natural support for range and complex queries. We have proposed in to use complex structures such as a Voronoï tessellation, where each peer is associated to a cell in the space. Moreover, since the associated cost to compute and maintain these structures is usually extremely high for dimensions larger than 2, we have proposed to weaken the Voronoï structure to deal with higher dimensional spaces .
We are currently adapting the techniques proposed in these papers to the molecular dynamics application in collaboration with Juan Elezgaray from IECB.
The title of this study is “Dynamic Compact Routing Scheme”. The aim of this projet is to develop new routing schemes achieving better performances than current BGP protocols. The problems faced by the interdomain routing protocol of the Internet are numerous:
The underlying network is dynamic: many observations of bad configurations show the instability of BGP;
BGP does not scale well: the convergence time toward a legal configuration is too long, the size of routing tables is proportional to the number of nodes of network (the network size is multiplied by 1.25 each year);
The impact of the policies is so important that the many packets can oscillated between two Autonomous Systems.
In this collaboration, we mainly focus on the scalability properties that a new routing protocol should guarantee. The main measures are the size of the local routing tables, and the time
(or message complexity) to update or to generate such tables. The design of schemes achieving sublinear space per routers, say in
where
nis the number of AS routers, is the main challenge. The target networks are ASnetwork like with more than 100,000 nodes. This projet, in colloboration with the MASCOTE INRIAproject in
Nice SophiaAntipolis, makes the use of simulation, developped at both sites.
Cyril Banino (Yahoo!, Trondheim, Norway) did his Master degree at the University of Bordeaux in 2002 under the supervision of Olivier Beaumont and his PhD in Trondheim (N.T.N.U.). During his PhD, he worked with Olivier Beaumont on decentralized algorithms for independent tasks scheduling. This collaboration is manifested by several research visits (for a total of 5 weeks since 2003) and several joint papers (IEEE TPDS, Europar'06, IPDPS'03). He has been recently appointed at Yahoo! (Trondheim), and we started an informal collaboration with Yahoo! Research, that led to the publication of . We now plan to establish a formal collaboration on document storage in large distributed databases, request scheduling and independent tasks distribution across large distributed platforms.
We started an informal collaboration with Xavier Hanin (4SH) who has developed Xooctory and who has initiated the project Ivy which is now a project of the Apache Software Foundation. This collaboration is supported by INRIA who delegated Ludovic Courtès (INRIA SED Engineer) to work in Cepage for one year from July 2009 on a distributed version of Xooctory.
We are testing and implementing in Xooctory different scheduling algorithms that distributes the build process. We now plan to analyse the graph of dependencies of tasks in the build process to proposed dedicated scheduling algorithms.
Alpage, lead by Olivier Beaumont, focuses on the design of algorithms on large scale platforms. In particular, we will tackle the following problems
Large scale distributed platforms modeling
Overlay network design
Scheduling for regular parallel applications
Scheduling for applications sharing large files.
The project involves the following INRIA and CNRS teams : Cepage, Graal, Mescal, Algorille, ASAP, LRI and LIX
The scientific objectives of ALADDIN are to solve what are identified as the most challenging problems in the theory of interaction networks. The ALADDIN project is thus an opportunity to create a full continuum from fundamental research to applications in coordination with both INRIA projects CEPAGE and GANG.
The objective of the ADT INRIA Aladdin
The objectives of USS SimGrid is to create a simulation framework that will answer (i) the need for simulation scalability arising in the HPC community; (ii) the need for simulation accuracy arising in distributed computing. The Cepage team will be involved in the development of tools to provide realistic model instantiations.
The project involves the following INRIA and CNRS teams: AlGorille, ASAP, Cepage, Graal, MESCAL, SysCom, CC IN2P3.
.
The goal of this ANR is the study of identifying codes in evolving graphs. Ralf Klasing is the overall leader of the project.
Travel grant, 20062008, on "Models and Algorithms for ScaleFree Structures", in collaboration with the Department of Computer Science, King's College London, and the Department of Computer Science, the University of Liverpool. Funded by the EPSRC. Main investigators on the UK side: Colin Cooper (King's College London) and Michele Zito (University of Liverpool). Ralf Klasing is the principal investigator on the French side.
European COST Action: "COST 293, Graal", 20042008. The main objective of this COST action is to elaborate global and solid advances in the design of communication networks by letting
experts and researchers with strong mathematical background meet peers specialized in communication networks, and share their mutual experience by forming a multidisciplinary scientific
cooperation community. This action has more than 25 academic and 4 industrial partners from 18 European countries. (
http://
The COST 295 is an action of the European COST program (European Cooperation in the Field of Scientific and Technical Research) inside of the Telecommunications, Information Science and
Technology domain (TIST). The acronym of the COST 295 Action, is DYNAMO and stands for "Dynamic Communication Networks". The COST295 Action is motivated by the need to supply a convincing
theoretical framework for the analysis and control of all modern large networks induced by the interactions between decentralized and evolving computing entities, characterized by their
inherently dynamic nature. (
http://
The goal of ComplexHPC is to coordinate European groups working on the use of heterogeneous and hierarchical systems for HPC as well as the development of collaborative activities among
the involved research groups, to tackle the problem at every level (from cores to largescale environments) and to provide new integrated solutions for largescale computing for future
platforms (see
http://
Ralf Klasing is a member of the Editorial Board of Theoretical Computer Science, Networks, Parallel Processing Letters, Algorithmic Operations Research, Fundamenta Informaticae, and Computing and Informatics.
Cyril Gavoille is member of the Streering Commitee (as treasurer) of the PODC '10 conference.
OPODIS '09 (Dec., Corsica, France) International Conference on Principles of Distributed Systems (C. Gavoille)
JDIR '09 (Feb. 24, Belfort, UTBM, France) Journées Doctorales en Informatiques et Réseaux (C. Gavoille)
Conférence SIROCCO 2009 (16th International Colloquium on Structural Information and Communication Complexity), Piran, Slovenie, mai 2009 (D. Ilcinkas)
Conférence nationale AlgoTel 2008 (11èmes Rencontres Francophones sur les Aspects Algorithmiques de Télécommunications), CarryLeRouet, France, juin 2009 (D. Ilcinkas)
PODC 2010, TwentyNinth Annual ACM SIGACTSIGOPS Symposium on Principles of Distributed Computing Zurich, Switzerland, July 2528, 2010 (O. Beaumont)
PASCO 2010, International Workshop on Parallel symbolic Computation, July 21st  July 23rd, 2010, Grenoble, France (O. Beaumont)
IPDPS 2010 PhD Forum, IEEE International Parallel and Distributed Processing Symposium, Atlanta, EU, 2010 (O. Beaumont)
SSS 2009, The 11th International Symposium on Stabilization, Safety, and Security of Distributed Systems, Lyon, 2009 (O. Beaumont)
ISCIS 2009 Northern Cyprus, septembre 2009 (O. Beaumont)
ISPDC 2009 8th International Symposium on Parallel and Distributed Computing, Lisbon, Portugal (O. Beaumont)
HeteroPar 09 International Workshop on Algorithms, models, and tools for parallel computing on heterogeneous networks, August 2009, Delft, The Netherlands (O. Beaumont)
RenPar 09, Rencontre du Parallélisme, Toulouse, France, 2009 (O. Beaumont)
IPDPS 2009 PhD Forum, IEEE International Parallel and Distributed Processing Symposium, Rome, Italie, 2009 (O. Beaumont)
IPDPS 09 IEEE International Parallel and Distributed Processing Symposium, Rome, Italie, 2009 (O. Beaumont)
SIROCCO 2009 (16th International Colloquium on Structural Information and Communication Complexity), Piran, Slovenie, mai 2009 (R. Klasing)
MFCS 2009 34th International Symposium on Mathematical Foundations of Computer Science August 24  28, 2009 Novy Smokovec, High Tatras, Slovakia (R. Klasing)
ADHOCNOW 2009 The 8th International Conference on Ad hoc Networks and Wireless September 1619 2009 Murcia, Spain (R. Klasing)
OPODIS 2009 (Dec., Corsica, France) International Conference on Principles of Distributed Systems (R. Klasing)
ALGOTEL 2009 (June, CarryleRouet, France), Algorithmes de communications dans les réseaux, (N. Hanusse)
STACS 2010, 27th International Symposium on Theoretical Aspects of Computer Science, March 46, 2010  Nancy, France (R. Klasing)
ALGOSENSORS 2010, 6th International Workshop on Algorithmic Aspects of Wireless Sensor Networks, July 2010, Bordeaux, France (R. Klasing)
IWOCA 2010, 21st International Workshop on Combinatorial Algorithms, July 2628, 2010, London, United Kingdom (R. Klasing)
Nicolas Bonichon, Philippe Duchon, Cyril Gavoille, Nicolas Hanusse and Ralf Klasing have been involved in the organizing commitee of EuroComb 2009, held in September 2009 in Bordeaux.
Nicolas Bonichon, Lionel EyraudDubois, Cyril Gavoille, and Ralf Klasing are involved in the organization of ICALP 2010 to be held in July 2010 in Bordeaux. (Cyril Gavoille is the Conference CoChair, Ralf Klasing is the Workshops Chair.)
Nicolas Hanusse is responsible for the working group on "Distributed Algorithms" at the LaBRI.
Ralf Klasing is in charge of the seminary “Distributed Algorithms” at the LaBRI.
Olivier Beaumont was external Ph.D. reviewer (rapporteur) and member of the Ph.D. committee of Matthieu Gallet (INRIA Project Team GRAAL, ENS Lyon, France)
Olivier Beaumont was member of the Ph.D. committee of Gilles Tredan (INRIA Project Team ASAP, IRISA, Rennes, France)
Olivier Beaumont was external Ph.D. reviewer (rapporteur) and member of the Ph.D. committee of Daouda Traore (INRIA Project Team MESCAL, Grenoble, France).
Olivier Beaumont was member of the Habiltation thesis committee of François Pellegrini (INRIA Project Team BACCHUS, Bordeaux, France)
Ralf Klasing was external Ph.D. coreviewer (rapporteur) and member of the Ph.D. committee of Patricio Reyes (INRIA Project Team MASCOTTE, Université de Nice Sophia Antipolis, France, August 2009)
Ralf Klasing was member of the Ph.D. committee of Cristiana Gomes (INRIA Project Team MASCOTTE, Université de Nice Sophia Antipolis, France, December 2009)
Cyril Gavoille was Ph.D. reviewer (rapporteur) of Julien Robert, 12/2009, Ecole Normale Supérieure de Lyon, LIP
Cyril Gavoille was Ph.D. reviewer (rapporteur) of Ittai Abraham, 11/2009, Jerusalem University, Israel
Cyril Gavoille was Ph.D. coreviewer (rapporteur) of Patricio Reyes, 08/2009, Université Nice SophiaAntipolis (président de jury)
04.06.2009: Derandomizing Random Walks in Undirected Graphs Using Locally Fair Exploration Strategies. Dagstuhl Workshop on Dynamic Networks. (R. Klasing)
Nicolas Nisse, INRIA SophiaAntipolis, March 913, 2009 and November 26December 5, 2009 (D.I.)
Adrian Kosowski, Gdansk University of Technology, Poland, 1617/12/2009
Arnold L. Rosenberg, Colorado State University, USA, 2327/10/2009
Ljubomir Perkovic, School of Computing, DePaul University, USA, 29/0606/07/2009
Tomasz Radzik, King's College London, UK, 15/0622/06/2009
Leszek Gasieniec, University of Liverpool, UK, 11/0625/06/2009
Miroslaw Korzeniowski, Wroclaw University of Technology, Pologne, 13/0419/04/2009
Marek Klonowski, Wroclaw University of Technology, Pologne, 13/0419/04/2009
Leszek Gasieniec, University of Liverpool, UK, 03/0417/04/2009
Colin Cooper, King's College London, UK, 20/0227/02/2009
Univerité Paris 6 (LIP 6), 35 March 2009 (Nicolas Hanusse)
University of Ottawa and Carleton University (Canada), January 514, 2009 (David Ilcinkas)
Université du Québec en Outaouais (Canada), January 1423, 2009 (David Ilcinkas)
Weizmann Institute (IL), (aug. 2009  1 week)
The members of CEPAGE are heavily involved in teaching activities at undergraduate level (Licence 1, 2 and 3, Master 1 and 2, Engineering Schools ENSEIRB). The teaching is carried out by members of the University as part of their teaching duties, and for CNRS (at master 2 level) as extra work. It represents more than 500 hours per year.
At master 2 level, here is a list of courses taught the last two years:
Olivier Beaumont
Routing and P2P Networks (last year of engineering school ENSEIRB, 2009)
Cyril Gavoille
Algorithm Analysis (2nd year MASTER "Models and Algorithms"  2009)
Communication and Routing (last year of engineering school ENSEIRB 2009)
RealWorld Algorithms (2nd year MASTER "Models and Algorithms"  2009)
Graph Algorithms (3rd year Bachelor  2009)
Ralf Klasing
Communication Algorithms in Networks (2nd year MASTER "Algorithms and Formal Methods"  2010)