Section: Scientific Foundations
Computer assisted monitoring and diagnosis of physical systems
Our work on monitoring and diagnosis relies on model-based approaches developed by the Artificial Intelligence community since the founding studies by R. Reiter and J. de Kleer [83] , [49] . Our project investigates the on-line monitoring and diagnosis of systems, which are modeled as discrete events systems, focusing more precisely on monitoring by alarms management [64] . Computational efficiency is a crucial issue for real size problems. We are developing two approaches. The first one relies on diagnoser techniques [88] , for which we have proposed a decentralized and generic approach. The second one uses chronicle recognition techniques, focusing on learning chronicles.
Early work on model-based diagnosis dates back to the 70-80's by R. Reiter, the reference papers on the logical theory of diagnosis being [83] , [49] . In the same years was constituted the community known as DX , named after the workshop on the principles of diagnosis . Research in these areas is still very active and the workshop gathers about fifty people in the field every year. As opposed to the expert system approach, which has been the leading approach for diagnosis (medical diagnosis for instance) before 1990, the model-based approach lies on a deep model representing the expected correct behavior of the system to be supervised or on a fault model. Instead of acquiring and representing an expertise from experts, the model-based approach uses the design models of industrial systems. The approach has been initially developed for electronic circuits repair [50] , focusing on off-line diagnosis of so-called static systems. Two main approaches have been proposed then: (i) the consistency-based approach, relying on a model of the expected correct behavior, which aims at detecting the components responsible for a discrepancy between the expected observations and the ones actually observed ; (ii) the abductive approach which relies on a model of the failures that might affect the system, and which identifies the failures or the faulty behavior explaining the anomalous observations. See the references [24] , [26] for a detailed exposition of these investigations.
Since 1990, the researchers in the field have studied dynamic system monitoring and diagnosis, in a similar way as researchers in control theory do. What characterizes the AI approach is the use of qualitative models instead of quantitative ones and the importance given to the search for the actual source/causes of the faulty behavior. Model-based diagnosis approaches rely on qualitative simulation or on causal graphs in order to look for the causes of the observed deviations. The links between the two communities have been enforced, in particular for what concerns the work about discrete events systems and hybrid systems. Used formalisms are often similar (automata, Petri nets ,...) [33] , [64] .
Our team focuses on monitoring and on-line diagnosis of discrete events systems and in particular on monitoring by alarm management. In this context, a human operator is generally in charge of the system monitoring and receives time-stamped events (the alarms) which are emitted by the components themselves, in reaction to external events. These observations on the system are discrete pieces of information, corresponding to an instantaneous event or to a property associated to a time interval. The main difficulties for analyzing this flow of alarms are the following:
-
the huge number of received alarms: the supervisor may receive up to several hundreds of messages per second, many of which being insignificant,
-
the alarm overlapping: the order in which alarms are received may be different from the order in which alarms were emitted. Moreover, various sequences of alarms resulting from concurrent failures may overlap. The propagating delays, and sometimes the ways the alarms are transmitted, must be taken into account, not only for event reordering, but also to decide at what time all the useful messages can be considered as being received.
-
the redundancy of received alarms: some alarms are only routine consequence of other alarms. This can provoke a phenomenon known as cascading alarms.
-
the alarm loss or alarm masking: some alarms can be lost or masked to the supervisor when an intermediate component in charge of the transmission is faulty. The absence of an alarm must be taken into account, since it can give a useful information about the state of the system.
There are two cases focusing on very different issues. In the first one, the alarms must be dealt with on-line by the operator. In this case, alarm analysis must be done in real time. The operator must react in a very short period of time to keep the system working at best in spite of the inputs variability and the natural evolution of the processes. Consequently, the natural system damages (components wear, slow modification of the components properties, etc.) are not directly taken into account but are corrected by tuning some parameters.
This reactive treatment withstands the treatment of alarms maintenance. In this second case, a deeper off line analysis of the system is performed, by foreseeing the possible difficulties, by planning the maintenance operations in order to minimize significantly the failures and interruptions of the system.
The major part of our work focuses on on-line monitoring aid and it is assumed that the correct behavior model or the fault models of the supervised systems are available. However, an on-line use of the models is rarely possible because of its complexity with respect to real time constraints. This is especially true when temporal models are under concern. A way to tackle this problem is to make an off-line transformation (or compilation) of the models and to extract, in an adapted way, the useful elements for diagnosis.
We study two different methods:
-
In the first method, the automaton used as a model is transformed off-line into an automaton adapted to diagnosis. This automaton is called a diagnoser . The transitions of the automaton are only triggered by observable events and the states contain only information on the failures that occurred in the system. Diagnosing the system consists in going through all the different states of the diagnoser as observable events become available. This method has been proposed by M. Sampath and colleagues [88] . We have extended this method to the communicating automata formalism [86] (see also [84] ). We have also developed a more generic method which takes advantage of the symmetries in the architecture of the system [85] .
The main drawback of centralized approaches is that they require to explicitly build the global model of the system which is unrealistic for large and complex systems as telecommunication networks. It is why our more recent work deals with a decentralized approach [79] . This approach can be compared with R. Debouk and colleagues [48] and also to P. Baroni and colleagues [32] . Our method, unlike R. Debouk et al., relies on local models. We do not need to construct a global model. Indeed, the size of the global model would have been too large in our applications. Even if the methods are very close, P. Baroni et al. are concerned with an a posteriori diagnosis (off-line) whereas we propose an on-line diagnosis. Each time an alarm comes, it is analyzed and the diagnosis hypotheses are incrementally computed and given to the operator. Our main theme of study is close to E. Fabre and colleagues [58] , [28] . The main difference is that they propose a multi-agent approach where the diagnoses are computed locally at the component level using message exchanges, whereas we construct a global diagnosis which is given to the operator at the supervisor level.
-
In the second method, the idea is to associate each failure that we want to detect with a chronicle (or a scenario), i.e. a set of observable events interlinked by time constraints. The chronicle recognition approach consists in monitoring and diagnosing dynamic systems by recognizing those chronicles on-line [55] , [82] , [53] .
One of our research focus is to extend the chronicle recognition methods to a distributed context. Local chronicle bases and local recognizers are used to detect and diagnose each component. However, it is important to take into account the interaction model (messages exchanged by the components). Computing a global diagnosis requires then to check the synchronisation constraints between local diagnoses.
Another issue is the chronicle base acquisition. A chronicle base must contain all the chronicles characterizing the behaviors to monitor. Moreover, the base must be updated each time the supervised system evolves physically or structurally. An expert is often needed to create the chronicle base, and that makes the creation and the maintenance of the base very expensive. That is why we are working on an automatic method to acquire the base.
Applications generally deal with system monitoring (telecommunication network) and video-surveillance (underground, bank, etc...).
Developing diagnosis methodologies is not enough, especially when on-line monitoring is required. Two related concerns must be tackled, and are the topics of current research in the team:
-
The ultimate goal is usually not merely to diagnose, but to put back the system in some acceptable state after the occurrence of a fault. That calls for considering the repair capabilities of a system and designing the diagnoser in such a way that the diagnoses are sufficiently discriminating to be able to trigger a valid repair procedure.
-
When designing a system and equipping it with diagnosis capabilities, it may be crucial to be able to check off-line that the system will behave correctly, i.e. that the system is actually 'diagnosable'. Diagnosability is checked when two distinct faults (or one fault and the correct behavior) can never produce the same set of observations. A lot of techniques have been developed in the past (see Lafortune and colleagues [87] ), essentially in automata models. Extending them to deal with temporal patterns, permanent faults, multiple faults, fault sequences and some problems of intermittent faults, or trying to relate such techniques with diagnosability of continuous systems, has been the main focus of our studies up to now. We intend now to study diagnosability and repairability capabilities together, in order to build self-healing systems.