Project : paris
Section: New Results
Large-scale data management for grids
Participant : Yvon Jégou.
Providing the data to the applications is a major problem in grid computing. The execution of an application on some site is possible only when the data of the application are present on the ``data-space'' of this site. It is necessary to move the data from the production sites to the execution sites. Moreover, in high performance simulation domain, the applications are themselves parallel programs and the grid sites are clusters of computation nodes. Each process of the parallel application needs only part of the input data and produces a part of the results. Duplicating the input data from a central server and then gathering the results after the execution can be expensive.
The participation of the Paris Project-Team to the e-Toile project ( http://www.urec.cnrs.fr/etoile/) aims at the experimentation of Distributed Shared Memory technology for the implementation of uniform data-naming and data-sharing services for grid computing. The current implementation is based on the Mome DSM. A Mome daemon process is launched in the background on each node of the grid. When the execution of some application starts on Mome-aware computation nodes, each of its parallel processes connects to local Mome daemon. The data-repository interface provides entry-points for the creation and for the localization of segments in the DSM (through a kind of directory), and for the mapping of these segments in the local address space of the process. The data repository is persistent: the segments retain their data after all application processes have disconnected. The application processes can fail safely (or be killed) without impacting the DSM. The system provides a kind of uniform data space to the grid applications.
The Mome DSM behaves as a COMA (Cache Only Memory) for page management: a copy of a page is present on the nodes recently using the page; a page-fault is served directly by one of the nodes with a valid copy of the page. This strategy is well suited for the case where the computation nodes are clusters. The data never transit through a centralized server.
The current version of Mome considers a flat organization of the DSM nodes. On a grid infrastructure, the performance of the communication system inside a grid node (a cluster) is higher than between grid nodes. The DSM should be aware of this structure. A new hierarchical organization of the Mome DSM has been defined and will be implemented in the near future. This new organization will favor local communications: inter-cluster communications will be avoided as long as page-faults can be served locally. This hierarchical structure will also allow the exploitation of the DSM on a large number of nodes (hundreds of DSM nodes).
Mome currently runs in the user space. Its implementation necessitates no modifications of the kernels. But the applications must use specific library calls in order to exploit the Mome data space. In the future, we plan to interface the Mome daemons with the Linux kernel through a kernel module. The DSM space will become accessible through a classical file system interface without modifications of the applications.
With JuxMem, we propose the concept of data sharing service for grid computing, as a compromise between two rather different kinds of data sharing systems: (1) DSM systems, which propose consistency models and protocols for efficient transparent management of mutable data, on static, small-scaled configurations (tens of nodes); (2) P2P systems, which have proven adequate for the management of immutable data on highly dynamic, large-scale configurations (millions of nodes).
These two classes of systems have been designed and studied in very different contexts. In DSM systems, the nodes are generally under the control of a single administration, and the resources are trusted. In contrast, P2P systems aggregate resources located at the edge of the Internet, with no trust guarantee, and loose control. Moreover these numerous resources are essentially heterogeneous in terms of processors, operating systems and network links, as opposed to DSM systems, where nodes are generally homogeneous. Finally, DSM systems are typically used to support complex numerical simulation applications, where data are accessed in parallel by multiple nodes. In contrast, P2P systems generally serve as a support for storing and sharing immutable files.
Our data sharing service targets physical architectures with features intermediate between DSM and P2P systems. We address scales of the order of thousands of nodes, organized as a federation of clusters, say tens of hundred-node clusters. At a global level, the resources are thus rather heterogeneous, while they can probably be considered as homogeneous within the individual clusters. The control degree and the trust degree are also intermediate, since the clusters may belong to different administrations, which set up agreements on the sharing protocol. Finally, we target numerical applications like heavy simulations, made by coupling individual codes. These simulations process large amounts of data, with significant requirements in terms of data storage and sharing.
The main contribution of such a service is to decouple data management from grid computation, by providing location transparency as well as data persistence in a dynamic environment.
In order to tackle the issues described above, we have defined an architecture proposal for a data sharing service. This architecture mirrors a federation of distributed clusters and is therefore hierarchical and is illustrated through a software platform called JuxMem  (for Juxtaposed Memory). A detailed description of this architecture is given in . The architecture consists of a network of peer groups (cluster groups), each of which generally corresponds to a cluster at the physical level. All the groups are inside a wider group which includes all the peers which run the service (the juxmem group). Each cluster group consists of a set of nodes which provide memory for data storage (called providers). In each cluster group, a node manages the memory made available by the providers of the group (the cluster manager). Any node (including providers and cluster managers) can use the service to allocate, read or write to data as a client. All providers which host copies of the same data block make up a data group, to which is associated an ID. To read/write a data block, clients only need to specify this ID: the platform transparently locates the corresponding data block. Consistency of replicated blocks is also handled transparently (according to the sequential consistency model, in the current version). In order to tolerate the volatility of peers, a dynamic monitoring of the number of copies of data block is used and new copies are created when necessary, in order to maintain a given redundancy degree. Cluster manager roles are also replicated, to enhance cluster availability.
As a proof of concept, we have built a software prototype using the JXTA  generic peer-to-peer framework, which provides basic building blocks for user-defined peer-to-peer services. As a first evaluation, we have measured the influence of the volatility on the overall behavior of the service. Details are given in .