Team Bacchus

Overall Objectives
Scientific Foundations
Application Domains
New Results
Contracts and Grants with Industry
Other Grants and Activities

Section: Scientific Foundations

Algorithms and high-performance solvers

Participants : Cécile Dobrzynski, Pascal Hénon, François Pellegrini, Pierre Ramet.

High-performance direct solvers for distributed clusters

Solving large sparse systems Ax = b of linear equations is a crucial and time-consuming step, arising in many scientific and engineering applications. Consequently, many parallel techniques for sparse matrix factorization have been studied and implemented.

Sparse direct solvers are mandatory when the linear system is very ill-conditioned; such a situation is often encountered in structural mechanics codes, for example. Therefore, to obtain an industrial software tool that must be robust and versatile, high-performance sparse direct solvers are mandatory, and parallelism is then necessary for reasons of memory capability and acceptable solving time. Moreover, in order to solve efficiently 3D problems with more than 50 million unknowns, which is now a reachable challenge with new SMP supercomputers (see Section  2.2 ), we must achieve good scalability in time and control memory overhead. Solving a sparse linear system by a direct method is generally a highly irregular problem that induces some challenging algorithmic problems and requires a sophisticated implementation scheme in order to fully exploit the capabilities of modern supercomputers.

In the BACCHUS project, we focused first on the block partitioning and scheduling problem for high performance sparse LDLT or LLT parallel factorization without dynamic pivoting for large sparse symmetric positive definite systems. Our strategy is suitable for non-symmetric sparse matrices with symmetric pattern, and for general distributed heterogeneous architectures the computation and communication performance of which are predictable in advance. This has led to software developments (see sections  5.3 , 5.5 )

High-performance iterative and hybrid direct/iterative solvers

In addition to the project activities on direct solvers, we also study some robust preconditioning algorithms for iterative methods. The goal of these studies is to overcome the huge memory consumption inherent to the direct solvers in order to solve 3D problems of huge size (several million of unknowns). Our studies focus on the building of generic parallel preconditioners based on ILU factorizations. The classical ILU preconditioners use scalar algorithms that do not exploit well CPU power and are difficult to parallelize. Our work aims at finding some unknown orderings and partitioning that lead to a dense block structure of the incomplete factors. Then, based on the block pattern, some efficient parallel blockwise algorithms can be devised to build robust preconditioners that are also able to fully exploit the capabilities of modern high-performance computers.

In this context, we study two approaches.

These works are also supported by the ANR-CIS project “SOLSTICE”.

Meshes and graph partitioning

Parallel graph partitioning and static mapping

Finding vertex separators for sparse matrix ordering is only one of the many uses of generic graph partitioning tools. For instance, finding balanced and compact domains in problem graphs is essential to the efficiency of parallel iterative solvers. Here again, because of the size of the problems at stake, parallel graph partitioning tools are mandatory to provide good load balance and minimal communication cost.

The execution of parallel applications implies communication between processes executed on the different cores. On NUMA architectures which are strongly heterogeneous in terms of latency and capacity, communication cost strongly depends on the repartition of tasks among cores. Architecture-aware load balancing must take into account both the characteristics of the parallel applications (including for instance task processing costs and the amount of communication between tasks) and the topology of the target architecture (providing the powers of cores and the costs of communication between all of them). When processes are assumed to coexist simultaneously for all the duration of the program, this optimization problem is called mapping. A mapping is called static if it is computed prior to the execution of the program and is never modified at run-time.

The sequential Scotch tool was able to perform static mapping since its first version, but this feature was not widely known nor used by the community. With the increasing need to map very large problem graphs onto very large and strongly heterogeneous parallel machines (whether hierarchical NUMA clusters or GPU-based systems), there is an increasing demand for parallel static mapping tools.

Adaptive dynamic mesh partitioning

Many simulations which model the evolution of a given phenomenon along with time (turbulence and unsteady flows, for instance) need to re-mesh some portions of the problem graph in order to capture more accurately the properties of the phenomenon in areas of interest. This re-meshing is performed according to criteria which are closely linked to the undergoing computation and can involve large mesh modifications: while elements are created in critical areas, some may be merged in areas where the phenomenon is no longer critical.

Performing such re-meshing in parallel creates additional problems. In particular, splitting an element which is located on the frontier between several processors is not an easy task, because deciding when splitting some element, and defining the direction along which to split it so as to preserve numerical stability most, require shared knowledge which is not available in distributed memory architectures. Ad-hoc data structures and algorithms have to be devised so as to achieve these goals without resorting to extra communication and synchronization which would impact the running speed of the simulation.

Most of the works on parallel mesh adaptationattempt to parallelize in some way all the mesh operations: edge swap, edge split, point insertion, etc. It implies deep modifications in the (re)mesher and often leads to bad performance in term of CPU time. An other work  [30] proposes to base the parallel re-meshing on existing mesher and load balancing to be able to modify the elements located on the frontier between several processors.

In addition, the preservation of load balance in the re-meshed simulation requires dynamic redistribution of mesh data across processing elements. Several dynamic repartitioning methods have been proposed in the literature  [40] , [39] , which rely on diffusion-like algorithms and the solving of flow problems to minimize the amount of data to be exchanged between processors. However, integrating such algorithms into a global framework for handling adaptive meshes in parallel has yet to be done.


Logo Inria