Section: Scientific Foundations
Providing Access to HPC Servers on the Grid
Participants : Nicolas Bard, Julien Bigot, Raphaël Bolze, Julien Bigot, Hinde Bouziane, Yves Caniou, Eddy Caron, Aurélien Ceyden, Ghislain Charrier, Florent Chuffart, Benjamin Depardon, Frédéric Desprez, Gilles Fedak, Jean-Sébastien Gay, Haiwu He, Cristian Klein, David Loureiro, Christian Pérez, Vincent Pichon, Bing Tang.
Resource management is one of the key issues for the development of efficient Grid environments. Several approaches co-exist in today's middleware platforms. The computational (or communication) granularity and the dependences between the computations also have a great influence on the software choices. Two possible approaches are identified below.
One approach provides the user with a uniform view of resources. This is the case of GLOBUS (http://www.globus.org/ ) which provides transparent MPI communications (with MPICH-G2) between distant nodes but does not manage load balancing issues between these nodes. It is the user's task to develop a code that will take the heterogeneity of the target architecture into account. The classical batch processing paradigm can also be used on the Grid with projects like Condor-G (http://www.cs.wisc.edu/condor/condorg/ ) or Sun GridEngine (http://wwws.sun.com/software/gridware/ ). Finally, peer-to-peer  or Global computing  can be used for fine grain and loosely coupled applications.
Another approach provides a semi-transparent access to computing servers by submitting jobs to dedicated servers. This model is known as the Application Service Provider (ASP) model where providers offer, not necessarily for free, computing resources (hardware and software) to clients in the same way as Internet providers offer network resources to clients. The programming granularity of this model is rather coarse. One of the advantages of this approach is that end users do not need to be experts in parallel programming to benefit from high performance parallel programs and computers. This model is closely related to the classical Remote Procedure Call (RPC) paradigm. On a Grid platform, the RPC (or GridRPC  ,  ) offers an easy access to available resources to a Web browser, a Problem Solving Environment, or a simple client program written in C, Fortran, or Java. It also provides more transparency by hiding the search and allocation of computing resources. We favor this second approach.
In a Grid context, the second approach requires the implementation of middleware environments to facilitate the client access to remote resources. In the ASP approach, a common way for clients to ask for resources to solve their problem is to submit a request to the middleware. The middleware finds the most appropriate server that will solve the problem on behalf of the client using a specific software. Several environments, usually called Network Enabled Servers (Nes ), have developed such a paradigm: NetSolve  , Ninf  , NEOS  , OmniRPC  , and more recently Diet developed in the Graal project (see Section 5.1 ). A common feature of these environments is that they are built on top of five components: clients, servers, databases, monitors, and schedulers. Clients solve computational requests on servers found by the Nes . The Nes schedules the requests on the different servers using performance information obtained by monitors and stored in a database.
Two axis of generalization of this issue can be pursued. The first one is with respect to the targeted infrastructure. More volatile and insecure contexts such as desktop computing appear also important to be considered. The second axis consists in taking into account other forms of interactions than RPC. A general conceptual model for dealing with it is represented by software component models.
To achieve our goals, we need to address issues related to several well-known research domains. In particular, we focus on:
middleware and application platforms as a base to implement the necessary “glue” to broker clients requests, to find the best server available, and then to submit the problem and its data,
online and offline scheduling of requests,
link with data management,
distributed algorithms to manage the requests and the dynamic behavior of the platform,
programming models to offer an adequate level of functionality while hiding as many resource related details as possible.