Team Sardes

Members
Overall Objectives
Scientific Foundations
Application Domains
Software
New Results
Contracts and Grants with Industry
Other Grants and Activities
Dissemination
Bibliography

Section: Scientific Foundations

Autonomous Distributed System Management

Management (or Administration ) is the function that aims at maintaining a system's ability to provide its specified services, with a prescribed quality of service. In general terms, administration may be viewed as a control activity, involving an event-reaction loop: the administration system detects events that may alter the ability of the administered system to perform its function, and reacts to these events by trying to restore this ability. The operations performed under system and application administration include configuration and deployment, reconfiguration, resource management, observation and monitoring.

Up to now, administration tasks have mainly been performed by persons. A great deal of the knowledge needed for administration tasks is not formalized and is part of the administrators' know-how and experience. As the size and complexity of the systems and applications are increasing, the costs related to administration are taking up a major part of the total information processing budgets, and the difficulty of the administration tasks tends to approach the limits of the administrators' skills. The traditional approach based on the manager-agent model, the base of widely used management protocols and frameworks such as SNMP and CMIS/CMIP, is also showing its inability to cope with the current highly dynamic managed systems.

The above remarks have motivated a new approach, in which a significant part of management-related functions is performed automatically, with minimal human intervention. This is the goal of the so-called autonomic computing movement [65] .

Autonomic computing aims at providing systems and applications with self-management capabilities, including self-configuration (automatic configuration according to a specified policy), self-optimization (continuous performance monitoring), self-healing (detecting defects and failures, and taking corrective actions), and self-protection (taking preventive measures and defending against malicious attacks).

Several research projects [51] are active in this area. [66] is a recent survey of the main research problems related to autonomic computing.

Section 3.4.1 examines a few issues in autonomous system management. Section 3.4.2 presents our approach to these problems.

Selected Issues in Autonomous System Management

We examine three specific aspects related to the current activity of the project: event distribution, resource management, and fault tolerance.

Event Channels

Many management functions (monitoring, resource management, fault tolerance) rely on events , propagated from the administered components of the system to the management components. In a large Internet-based system, these events may be generated at a very high rate. Thus efficient channels are needed to collect and to filter these events, and to propagate them to their recipients. This problem has emerged as a topic of active research. It is also related to the fast-developing area of sensor networks.

Event propagation usually follows the publish-subscribe pattern [63] . Scalable solutions, based on multicast groups, are known for publish-subscribe based on a fixed (or slowly changing) list of topics. However, designing an efficient publish-subscribe system based on the contents of the propagated messages is still an open problem. For Internet-scale systems, even filtering algorithms running in linear time with respect to the number of subscriptions are considered inefficient, and the goal is to design sub-linear algorithms. The Gryphon [55] and Siena [60] projects have made significant progress towards this goal. The application of event channels to large scale observation and management is the theme of the Astrolabe [73] project.

Architectural Aspects of Resource Management

The advent of resource-constrained applications, such as multimedia processing and real-time control, raised the need for delegating part of the resource management to the application, while leaving the operating system with the responsibility of fair overall resource sharing between its users. The rationale is that the application is in a better position to know its precise requirements and may dynamically adapt its demands to its needs, thus allowing global resource usage to be optimized, while guaranteeing a better service to each individual application.

Two main issues need to be considered in resource management: algorithmic (defining quality criteria, and optimizing the system's performance accordingly) and architectural (organizing the system to isolate resource-oriented concerns and to exhibit the relevant interfaces). Here we concentrate on this last issue.

The first challenge is to define relevant abstractions for resource management, both for the resource principals , i.e. the entities to which resources are allocated, and for the resources themselves. Thus, in the case of cluster computing, new abstractions such as ``cluster reserves'' and ``virtual clusters'', which group a dynamic set of cluster-wide resources, are being investigated. Another challenge is to define what part of resource management is delegated from the operating system to the application, and to specify the relevant interfaces.

Scheduler activations [56] is an example of early work in this respect in the area of multiprocessor scheduling. A more radical approach is illustrated by exo-kernels [62] in which the lower layer exports primitives allowing a kernel or application designer to define his own resource allocation policy. This is done at the expense of additional work, which may be alleviated by the use of appropriate frameworks, as done in Think [7] . However, since exo-kernels give access to low level system entities, they are prone to security problems.

Tracking the Causes of Internet Services Failures

As the size of current systems and applications is increasing (in terms of number of users, hardware elements, software components, geographical scale), it is quasi certain that some part of a given system or application will be faulty at any time. Maintaining the system's availability in spite of this condition is a major challenge, and one of the goals of autonomic computing (``self-repair'').

An analysis of the causes of failures of Internet services [70] shows that most of the service's downtime may be attributed to management errors (e.g. wrong configuration), and that software failures come second. However, few efforts have yet been devoted to remedy this situation.

The role of configuration and deployment has long been neglected. Current research [61] , [64] aims at giving a rigorous foundation to these phases of the lifecycle, thus reducing the likelihood of misconfiguration, and opening the way to autonomous configuration and deployment.

Software errors are notoriously hard to locate and to repair. Recent work [58] has shown that a semi-empirical approach (delimit the smallest possible part of a system affected by the fault, and reboot only that part) can be quite effective, as demonstrated on a full-size platform [59] .

Our Approach to Autonomous Systems Management

The main goal in the Autonomous Distributed System Management theme in Sardes is to develop component-based software infrastructure frameworks and tools for the control and administration of large scale, long-lived distributed systems, such as clustered application servers or Grid systems. Target application domains for our prototype autonomous management systems include large Web application servers such as J2EE servers, large information mediation servers such as enterprise service buses (ESBs), and Grids.

This theme includes the following research activities.

  1. Infrastructure for system management.

    In this work, the aim is to develop tools and middleware technology to support the principled implementation of system management loops. We consider the following subjects.

    • Instrumentation for component structures: the aim is to develop a set of tools complementing our component-based software engineering tools with specific low-level logging, monitoring and resource accounting functions. This should provide us with the technology required to efficiently and dynamically implement sensors for our systems control and management loops.

    • Asynchronous middleware: the aim is to develop a dynamically configurable and scalable technology for the construction of large scale (number of events, number of components, geographical dispersion) asynchronous information dissemination channels. Such channels are crucial in system management for the efficient transport of event notifications.

      Challenges in this area include: devising scalable routing schemes for event dissemination, defining composable abstractions for the construction of multiple asynchronous middleware personalities, fault-tolerance and overload management.

    • Dynamic system cartography: the aim is to develop a multi-level, multi-grain monitoring services, able to construct and maintain during a system or application life-time, a causally connected view of the system or application being managed.

  2. Distributed configuration management.

    In this work, the aim is to develop a comprehensive set of system/software deployment and configuration services, from low-level bootstrapping services for component loading and system initialization, to higher-level routing, scheduling and orchestration for complex distributed system on-line deployment and configuration, with multiple component bindings and dependencies.

    Challenges in this area include: automatic resources and services discovery, dealing with partial failures, dealing with multiple dependencies and multiple coexisting versions, controlling and optimizing configuration and deployment workflows, automated support for high-level configuration policies.

  3. Autonomic capabilities

    In this work, the aim is to study the automation of various systems management functions, and to allow in particular automated performability (i.e. the combination of performance and availability) management under quality of service and service level agreement constraints. The approach we follow as a first step is both architecture-based , i.e. leveraging an explicit component-based structure at run-time, and empirical , i.e. relying mostly on the empirical derivation of performance and availability models through the instrumentation and run-time monitoring of specific experimental systems. In a second step, we plan to investigate the use of more sophisticated modelling tools and techniques, including e.g. the use of control theory techniques for deriving and synthesizing control laws for distributed resource management and overload management. Capabilities we consider initially include: automated repair management and automated sizing, starting with cluster-size systems such as clustered Web application servers.

    Repair management is a direct complement to standard fault tolerance and fault recovery techniques in distributed systems, that allows a managed system to be brought back into a target regime of behavior, through reconfiguration and resource allocation, after the occurrence of a (non malicious) fault. Challenges in automating repair management include: dealing with a mixture of hardware and software faults, as well as different fault models; automatic fault detection and diagnosis; support for self-healing; support for repair management in large scale distributed systems.

    Automated sizing or self-sizing aims to adapt automatically the set of resources used by the provider of a service to the level of demand for this service, e.g. automatically acquiring or releasing resources for a Web application server, according to the load of client requests. Challenges in self-sizing include: load and overload characterization in a distributed system; identifying relevant control parameters for performance tuning; devising effective algorithms for distributed performance control and load balancing.


previous
next

Logo Inria