Team GRAAL

Members
Overall Objectives
Scientific Foundations
Application Domains
Software
New Results
Contracts and Grants with Industry
Other Grants and Activities
Dissemination
Bibliography

Section: New Results

Algorithms and Software Architectures for Service Oriented Platforms

Participants : Nicolas Bard, Julien Bigot, Laurent Bobelin, Yves Caniou, Eddy Caron, Ghislain Charrier, Florent Chuffart, Benjamin Depardon, Frédéric Desprez, Gilles Fedak, Haiwu He, Benjamin Isnard, Cristian Klein, Gaël Le Mahec, Mohamed Labidi, Georges Markomanolis, Adrian Muresan, Christian Pérez, Vincent Pichon, Daouda Traore.

Cluster Resource Allocation for Multiple Parallel Task Graphs

Many scientific applications can be structured as Parallel Task Graphs (PTGs), that is, graphs of data-parallel tasks. Adding data-parallelism to a task-parallel application provides opportunities for higher performance and scalability, but poses additional scheduling challenges. We studied the off-line scheduling of multiple PTGs on a single, homogeneous cluster. The objective was to optimize performance without compromising fairness among the PTGs. Many scheduling algorithms, both from the applied and the theoretical literature, are applicable to this problem, and we propose minor improvements when possible. Our main contribution is an extensive evaluation of these algorithms in simulation, using both synthetic and real-world application configurations, using two different metrics for performance and one metric for fairness. We identify a handful of algorithms that provide good trade-offs when considering all these metrics. The best algorithm overall is one that structures the schedule as a sequence of phases of increasing duration based on a makespan guarantee produced by an approximation algorithm.

Re-scheduling over the Grid

Each job submitted to a LRMS (Local Resources Manager System) must provide mandatory information like the number of requested computing resources and the requested duration of the resource usage, called walltime. Because the application is killed if not finished by the end of the reservation, the walltime is an over-estimation of the duration of the application launched by the job.

In the context of a Grid composed of several clusters managed by a Grid middleware which is able to tune, submit, and cancel LRMS jobs, such over-estimations have an impact on the local scheduling and performance. Consequently, previous grid scheduling, optimized at that moment, may not be relevant anymore. Thus, we have designed and studied non-intrusive mechanisms for a middleware to be able to migrate jobs still in the waiting files of the different LRMS in the Grid platform. We also proposed different scheduling heuristics integrated to the mechanims, which decide of the migration of jobs. We performed an exhaustive set of simulation experiments, in which parameters such as the load of each simulated parallel resource, the type of applications (rigid and moldable), the dedication of the platform resources, have been varied. We analyzed the performance of our propositions on different metrics which showed some counter-intuitive results.

Parallel constraint-based local search

Constraint Programming emerged in the late 1980's as a successful paradigm to tackle complex combinatorial problems in a declarative manner. It is somehow at the crossroads of combinatorial optimization, constraint satisfaction problems (CSP), declarative programming language and SAT problems (boolean constraint solvers and verification tools). Up to now, the only parallel method to solve optimization problems being deployed at large scale is the classical branch and bound, because it does not require much information to be communicated between parallel processes (basically: the current bound).

Adaptive Search was proposed by [89] , [90] as a generic, domain-independent constraint-based local search method. This meta-heuristic takes advantage of the structure of the problem in terms of constraints and variables and can guide the search more precisely than a single global cost function to optimize, such as for instance the number of violated constraints. A parallelization of this algorithm based on threads realized on IBM BladeCenter with 16 Cell/BE cores show nearly ideal linear speed-ups for a variety of classical CSP benchmarks (magic squares, all-interval series, perfect square packing, etc.).

We parallelized the algorithm using the multi-start approach and realized experiments on the HA8000 machine, an Hitachi supercomputer with a maximum of nearly 16000 cores installed at University of Tokyo, and on the Grid'5000 infrastructure, the French national Grid for the research, which contains 5934 cores deployed on 9 sites distributed in France. Results show that speedups may surprisingly be architecture dependant, but that if they continue to grow with the number of processors, the increase tends to stabilize for some problems after 128 processes. Work in progress considers communications between each computing resource.

Service Discovery in Peer-to-Peer environments

Service discovery becomes a challenge in a large scale and distributed context. Heterogeneity and dynamicitu are the two main constraints that have to be taken into account in order to ensure reliability and system efficiency. Thereby, in a heterogeneous context, it is needed to equilibrate service discovery system load to get performance. Moreover, QoS in such an uncertain and dynamic environment has to be ensured by fail-safe mechanisms (self-stabilization and replication). First, Self-stabilisation ensures a consistent configuration in a convergence time. Second replication injects redundancy when the system becomes consistent. All those mechanisms will be validated and implemented. Furthermore, the service discovery system will interact with system with schedulers, batch submission systems, and storage Resource Broker. So, these component’s exchange protocols have to be formally defined.

We decided to develop a new implementation, called Spades Based Middleware (Sbam ) that includes all the concepts described above. This implementation, written in Java, relies on an efficient communication bus and has been developed according to advanced software engineering methods. The communication layer is based on the Ibis Portability Layer (IPL). Sbam has been evaluated with regard to service research request response time. Our experiments demonstrate the efficiency and scalability of the proposed middleware system. It was demonstrated at SuperComputing 2010.

On-Line Optimization of Publish/Subscribe Overlays

We continued the collaboration with the University of Nevada Las Vegas. We studied the benefit of Publish/subscribe overlays for the SPADES project. Loosely coupled applications can take advantage of the publish/subscribe communication paradigm. In this paradigm, subscribers declare which events, or which range of events, they wish to monitor, and are asynchronously informed whenever a publishers throws an event. In such a system, when a publication occurs, all peers whose subscriptions contain the publication must be informed. In our approach, the subscriptions are represented by a DR-tree, which is an R-tree where each minimum bounding rectangle is supervised by a peer. Instead of attempting to statically optimize the DR-tree, we give an on-line algorithm, the work function algorithm, which continually changes the DR-tree in response to the sequence of publications, in an attempt to dynamically optimize the structure. The competitiveness of this algorithm is computed to be at most 5 for any example where there are at most three subscriptions and the R-tree has height 2. The benefit of the on-line approach is that no prior knowledge of the distribution of publications in the attribute space is needed.

Décrypthon

In 2010, we added new features to, and fixed bugs of, the DIET WebBoard (a web interface for managing the Décrypthon Grid through DIET): support for multiple users on a same application, improved the database dumping method, statistics and charts, and storage space management. We deployed the newest version of DIET and the DIET WebBoard on the Décrypthon grid.

The MaxDO “Help cure muscular dystrophy, phase 2” was ported on the World Community Grid. To determine the size of the work-units sent to the World Community Grid users we ran benchmarks on Grid'5000. Finally on May 14th 2009 the project was launched and it is running since then. On December 10th 2010 a total of 30,000,549 work-unit results had been sent back by the World Community Grid volunteers, this is 64,972,205,369 positions out of 137,652,178,995 (47.2% of the project, each work-unit contains hundreds of “positions” for two proteins: the result is an energy value for this configuration). We are also checking and sorting the result files, reducing their size, and making statistics for the volunteers (cf http://graal.ens-lyon.fr/~nbard/WCGStats/ ). The estimated end of the project is for the end of 2011. The most recent update of the MaxDO program on our university Décrypthon grid was to add a new interface to enable researchers to easily submit batches of workunits made from results missing or skipped by the World Community Grid's desktop grid.

Scheduling Applications with Complex Structure

As resources become more powerful but heteregeneous, applications' structures are also becoming more complex, not only for harnessing the available power but also for more accurate modeling of physical phenomena. Efficient mapping and scheduling of applications to resources are thus becoming more challenging. However, this is not possible with current resource management systems (RMS) that are assuming simple application models.

Therefore, we have done an initial, theoretical study of the gains one can obtain if RMS could support rigid, fully-predictable evolving applications. We have proposed an offline scheduling algorithm, with optional stretching capabilities. Experiments show that taking into account resource requirement evolvement leads to significant improvements in all measured metrics—such as resource utilization and completion time. However, considered stretching strategies do not appear very valuable.

Next, we have started revisiting RMS to enable efficient complex application resource selection. In 2010, we have focused on moldable applications. We have proposed CooRM , an RMS architecture which delegates the mapping and scheduling responsibility to the applications themselves. Simulations as well as a proof-of-concept implementation of CooRM show that the approach is feasible and performs well in terms of scalability and fairness.

As future work, we plan to extend CooRM to support evolving and malleable applications. With respect to its applicability to existing systems, we will study its integration into XtreemOS and Salome.

High Level Component Model

Most software component models focus on the reuse of existing pieces of code called primitive components. There are however many other elements that can be reused in component-based applications. Partial assemblies of components, well defined interactions between components and existing composition patterns (a.k.a. software skeletons) are examples of such reusable elements. It turns out that such elements of reuse are important for parallel and distributed applications.

Therefore, we have designed High Level Component Model (HLCM), a software component model that supports the reuse of these elements thanks to the concepts of hierarchy, genericity and connectors—and in particular the novel concepts of open connection. Moreover, HLCM supports multiple implementations for its elements so as to allow the optimization of applications for various hardware resources. HLCMi, an implementation of HLCM, has enabled us to validate the approach: algorithmic skeletons as well as parallel interactions such as data sharing, collective communications, and parallel method invocations have been successfully implemented.

Ongoing work includes further evaluations of HLCM with the OpenAtom application—in collaboration with Prof. Kale's team at the University of Illinois at Urbana-Champaign. Furthermore, the model will be used for the development of applications based on the MapReduce paradigm and for their efficient execution on Clouds and desktop grids in the context of the MapReduce ANR project.

Adaptive Mesh Refinement and Component Models

In 2010, we have studied whether component models can be useful to deal with complex application structure such as those found in adaptive mesh refinement applications (AMR). This kind of applications relies on dynamic and recursive data structures to adapt the computation grain to the simulation requirements. Though very relevant to decrease the computation load, AMR is seldom used as it is complex to implement.

Therefore, we have evaluated the feasibility of designing and implementing an AMR application—based on the heat equation—on two component models: ULCM and SALOME. Those models provide enough features but more are needed. Composite and dynamic management—such as found in ULCM—are very important to ease conception but user-defined skeletons and a mechanism to deal with domain decomposition are also welcome. HLCM enables to define user-defined skeletons but the issue of handling domain decomposition is left open.

We are investigating this problem targeting an application made of the coupling of several instances of Code_Aster, a thermomechanical calculation code from EDF R&D.

Cloud Resource Provisioning

Cloud client applications are able to dynamically scale based on their usage. This leads to a more efficient resource usage and, as a consequence, to expense saving. The problem is non-trivial as virtual resources have a setup time that cannot be neglected. In order to make accurate decisions when the Cloud client application needs to scale there are several valid approaches. We have focused our attention on identifying an approach that allows a Cloud client to scale his platform and compensate for the virtual resource setup time. Our approach uses self-similarities in Cloud client platform usage to predict resource usage in advance. In doing so, our approach identifies patterns in the Cloud client's past platform usage. This allows us to make usage predictions with considerable accuracy. We also shown that the prediction accuracy of our approach can be increased by increasing the size of the historic database that we use for matching.

Infrastructure as a Service clouds are a flexible and fast way to obtain (virtual) resources as demand varies. Grids, on the other hand, are middleware platforms able to combine resources from different administrative domains for tasks execution. Clouds can be used as providers of devices such as virtual machines by grids so they only use the resources they need at every moment, but this requires grids to be able to decide when to allocate and release those resources. We analyzed by simulation an economic approach to set resource prices and find when to scale resources depending on the users' demand. The results show how the proposed system can successfully adapt the to the demand, while at the same time ensuring that resources are fairly shared among users.

Towards Data Desktop Grid

Desktop Grids use the computing, network and storage resources from idle desktop PC's distributed over multiple-LAN's or the Internet to compute a large variety of resource-demanding distributed applications. While these applications need to access, compute, store and circulate large volumes of data, little attention has been paid to data management in such large-scale, dynamic, heterogeneous, volatile and highly distributed Grids. In most cases, data management relies on ad-hoc solutions, and providing a general approach is still a challenging issue.

We have proposed the BitDew framework which addresses the issue of how to design a programmable environment for automatic and transparent data management on computational Desktop Grids. BitDew relies on a specific set of meta-data to drive key data management operations, namely life cycle, distribution, placement, replication and fault-tolerance with a high level of abstraction.

Since July 2010, in collaboration with the University of Sfax, we are developing a data-aware and parallel version of Magik, an application for arabic writing recognition using the BitDew middleware. We are targeting digital libraries, which require distributed computing infrastructure to store the large number of digitalized books as raw images and at the same time to perform automatic processing of these documents such as OCR, translation, indexing, searching, etc.

In collaboration with the G.V.Kurdyumov Institute for Metal Physics and the LAL/IN2P3, we have developed a Desktop Grid version of the SLinCA (Scaling Laws in Cluster Aggregation) application. SLinCa simulates the several general scenarios of monomer aggregation in clusters with many initial configurations of monomers (random, regular, etc.), different kinetics law (arbitrary, diffusive, ballistic, etc.), various interaction laws (arbitrary, elastic, non-elastic, etc.). The typical simulation of one cluster aggregation process with 10 monomers takes approximately 1-7 days on a single modern processor, depending on the number of Monte Carlo steps (MCS). However, thousands of scenarios have to be simulated with different initial configurations to get statistically reliable results. To calculate the parameters of evolving aggregates (moments of probability density distributions, cumulative density distributions, scaling exponents, etc.) with appropriate accuracy (up to 2-4 significant digits), we need the better statistics (104 - 108 runs of many different statistical realizations of aggregating ensembles), which will be comparable with the same accuracy statistics of available experimental data. These separate runs of simulation for different physical parameters, initial configurations, and statistical realizations, are completely independent and can be easily split among available CPUs in a “parameter sweeping” manner of parallelism. A large number of runs, needed to reduce the standard deviation in Monte Carlo simulations, are distributed equally among available workers and are combined at the end to calculate the final result.

MapReduce programing model for Desktop Grid

MapReduce is an emerging programming model for data-intense application proposed by Google, which has recently attracted a lot of attention. MapReduce borrows from functional programming, where programmer defines Map and Reduce tasks executed on large sets of distributed data. In 2010, we have developed an implementation of the MapReduce programming model based on the BitDew middleware. Our prototype features several optimizations which make our approach suitable for large scale and loosely connected Internet Desktop Grid: massive fault tolerance, replica management, barriers-free execution, latency-hiding optimization as well as distributed result checking. We have presented performance evaluations of the prototype both against micro-benchmarks and real MapReduce applications. The scalability test shows that we achieve linear speedup on the classical WordCount benchmark. Several scenarios involving lagger hosts and host crashes demonstrate that the prototype is able to cope with an experimental context similar to real-world Internet.

SpeQuloS: Providing Quality-of-Service to Desktop Grids using Cloud resources

EDGI is an FP7 European project, following the successful FP7 EDGeS project, whose goal is to build a Grid infrastructure composed of "Desktop Grids", such as BOINC or XtremWeb, where computing resources are provided by Internet volunteers, and "Service Grids", where computing resources are provided by institutional Grid such as EGEE, gLite, Unicore and "Clouds systems" such as OpenNebula and Eucalyptus, where resources are provided on-demand. The goal of the EDGI project is to provide an infrastructure where Service Grids are extended with public and institutional Desktop Grids and Clouds.

The main problem with the current infrastructure is that it cannot give any QoS support for running their applications in the Desktop Grid (DG) part of the infrastructure. For example, a public DG system enables clients to return work-unit results in the range of weeks. Although there are EGEE applications (e.g. the fusion community’s applications) that can tolerate such a long latency most of the user communities want much smaller latencies.

In 2010, we have started the development and deployment of the SpeQuloS middleware to solve this critical problem.

We define QoS concretely as a probabilistic guarantee of job makespan or throughput. Providing QoS features even in Service Grids is hard and not solved yet satisfactorily. It is even more difficult in an environment where there are no guaranteed resources. In DG systems, resources can leave the system at any time for a long time or forever even after taking several work-units with the promise of computing them. Our approach is based on the extension of DG systems with Cloud resources. For such critical work-units the SpeQuloS system is able to dynamically deploy fast and trustable clients from some Clouds that are available to support the EDGI DG systems. It takes the right decision about assigning the necessary number of trusted clients and Cloud clients for the QoS applications. At this stage, the prototype is functional and the first version is planned to be delivered to the EDGI production infrastructure during spring 2011.

Performance evaluation and modeling

Simulation is a popular approach to obtain objective performance indicators of platforms that are not at one's disposal. It may for example help the dimensioning of compute clusters in large computing centers. In many cases, the execution of a distributed application does not behave as expected, it is thus necessary to understand what causes this strange behavior. Simulation provides the possibility to reproduce experiments under similar conditions. This is a suitable method for experimental validation of a parallel or distributed application.

The tracing instrumentation of a profiling tool is the ability to save all the information about the execution of an application at run-time. Every scientific application executed computes floating point operations (flops). The originality of our approach is that we measure the flops of the application and not its execution time. This means that if a distributed application is executed on N cores and we execute it again by mapping two processes per core then we need N/2 cores and more time for the execution time of the application. An execution trace of an instrumented application can be transformed into a corresponding list of actions. These actions can then be simulated by SimGrid. Moreover the SimGrid execution traces will contain almost the same data because the only change is the use of half cores but the same number of processes. This does not affect the number of the flops so the simulation time does not get increased because of the overhead. The Grid'5000 platform is used for this work and the NAS Parallel Benchmarks are used to measure the performance of the clusters.


previous
next

Logo Inria