Section: New Results
Machine learning for model acquisition
Participants : Marie-Odile Cordier, Thomas Guyet, Christine Largouët, Alice Marascu, Véronique Masson, René Quiniou.
Model acquisition is an important issue for model-based diagnosis, especially as modeling dynamic systems. We investigate machine learning methods for temporal data recorded by sensors or spatial data resulting from simulation processes.
Learning and mining from sequential and temporal data
Our main interest is extracting knowledge, especially sequential and temporal patterns or prediction rules, from static or dynamic data (data streams). We are particularly interested in detecting and dealing with concept change in data stream mining in order to be able to adapt on-line the models used for diagnosis.
- Mining temporal patterns with numerical information
This work aims at exploring a new solution to learn the temporal extents of events in chronicle models. Mining sequential patterns as well as extracting frequent temporal episodes has been widely explored. However, in a monitoring context, numerical information associated to event duration or to the delay between event occurrences is very meaningful for discriminating between faults or diseases. Due to the added complexity of such patterns, numerical temporal pattern mining methods have been less explored than methods for mining sequential patterns using the simpler notion of precedence.
We focus on mining temporal interval patterns from time-series databases or from sequence databases where a temporal sequence is a set of events with time-stamped begin and end. Our first proposition relied on an algorithm that processes temporal sequences represented by hyper-cubes [65] . The candidate generation uses the classical Apriori schema [29] enriched by constraints on temporal intervals. Frequent temporal pattern selection estimates the interval distribution to extract the statistically significant temporal intervals. The algorithm is implemented in Matlab (see section 5.5 ).
We are currently working on improving the algorithm complexity, precisely on how to estimate the distribution of temporal intervals more efficiently. The method is evaluated on cardiac monitoring data for extracting typical patterns that can be associated to cardiac arrhythmias. We also plan to mine electrical consumption data in the context of a collaboration with EDF (Électricité de France).
- Mining sequential patterns from data streams
During her thesis [76] , Alice Marascu has proposed methods for data stream processing and analysis. Precisely, she has devised CLUSO, a method that computes clusters from a sequence stream and extracts frequent sequential patterns from the clusters and maintain their history in order to summarize the stream. During her post-doctoral stay, extensions to the CLUSO method will be investigated in order to deal with more complex patterns, in particular temporal patterns with numerical information, to detect and characterize changes in the data, such as trends and deviations, to take into account the quality of the data (missing values, noisy data, etc.).
- Dealing with concept change in data stream mining
We are investigating a multi-diagnoser approach for detecting changes from data stream and for adaptating the diagnosers of the surveillance systems.
In this framework, several diagnosers process the input data stream to construct their own diagnosis. The global diagnosis is constructed by making the fusion of the diagnoses. We use the Demspter-Shafer evidence theory [90] to model the diagnosis and to operate their fusion. The diagnosers are themselves continuously (meta-)diagnosed to assess whether a change has occurred and to decide whether one or several diagnosers should be adapted. The meta-diagnosis is computed by checking predefined relations between the diagnosers to detect faulty diagnoses. These relations takes the form of integrity constraints that express information redundancy between diagnosers. If a set of diagnoses does not satisfy an integrity constraint, it means that two or more diagnosers disagree given some input observations. In such a situation, the faulty diagnosers are detected by comparing their own diagnosis to the global diagnosis decision. The involved diagnosers have the ability to adapt themselves to concept change with respect to the recommended diagnosis decision.
The method has been evaluated on an adaptive intrusion detection system monitoring the queries addressed to a web server [13] , [23] , [19] . Several diagnosers have been implemented by taking different views of the input data: distribution of character in single queries or sessions (set of queries associated to the same user in a logical temporal window), distribution of tokens in queries or sessions, distribution of bi- grams (sequence of two characters), etc. The system is able to detect changes due to the arrival of new kinds of attacks not previously encountered by the system and to adapt the diagnoser model (the distributions of objects they rely upon) accordingly. The adaptive system shows a better performance (higher detection rate and lower false positive rate) than a non adaptive version of the same system.
We have also proposed a novel intrusion detection method that detects anomalies in an on line and adaptive fashion through dynamical clustering of unlabelled audit data streams [16] , [12] , [11] . The framework shows self-managing capabilities: self-labeling, self-updating and self-adapting. The method is based on a recently developed clustering algorithm, Affinity Propagation (AP) [59] . Given an audit data stream, our method identifies outliers (suspicious accesses or requests) with AP. If an outlier is identified, it is then marked as suspicious and put into a reservoir. Otherwise, the detection model is updated until a change is detected, triggering a new model rebuilding through clustering. The suspicious examples are considered as really anomalous if they are marked as suspicious once again after model rebuilding. Thus, our detection model does not need labeled data and can be used on line. The method has been evaluated on a very large size of real HTTP Log data collected at INRIA as well as on a subset of the KDD 1999 benchmark data. The experimental results show that the method obtains better results than three static methods in terms of effectiveness as well as efficiency.
Learning decision-oriented rules from simulation data
In the framework of the Sacadeau project, our aim is to build decision support systems to help catchment managers to preserve stream-water quality [3] . In collaboration with Inra researchers, three actions have been conducted in parallel [20] .
-
The first one consisted in building a qualitative model to simulate the pesticide transfer through the catchment from the time of its application by the farmers to the arrival at the stream. The originality of the model is the representation of water and pesticide runoffs with tree structures where leaves and roots are respectively up-streams and down-streams of the catchment. Though Inra is the main contributor, we have participated actively to its realization. This model has been implemented and used for simulation. An in-depth analysis of many simulation results leads us to refine the model. A paper on the Sacadeau model appeared in [5] .
-
The second action consisted in identifying some of the input variables as main pollution factors and in learning rules relating these pollution factors to the stream pesticide concentration. During the learning process, we focus on actionable factors, in order to get helpful rules for decision-makers. Moreover, we take a particular interest in spatial relations between the cultivated plots and in the characteristics of crop management practices [92] . To deal with the complex spatial relations existing between the catchment plots, we experimented two learning approaches. The first one consisted in extending Inductive Learning Programming to tree structured patterns. The choice of ILP has been motivated by the aim to get easy-to-read and explicative rules. This was done using the Aleph software (after a first experiment with ICL). The second approach consisted in propositionalizing the learning examples and using a propositional learning process, namely CN2. A comparison of these two approaches and an analysis of the results can be found in R. Trepos's PhD thesis [93] . A paper has been submitted on this issue and a revised version will be submitted soon to the Environmental Modelling and Software journal.
-
The final aim is to go beyond the simple use of classification rules for prediction, by assisting the user in the post-analysis and in the exploitation of a large set of rules. The goal is then to find advices in order to reduce pollution whereas the learned rules are classification rules predicting if a given farmer strategy or climate leads to a polluted or not polluted situation. The propositional rules learned in the second step are automatically analysed by the algorithm Dakar (Discovery of Actionable Knowledge And Recommendations) [91] to propose actions well-suited to improve a given situation. Another way to help the experts in dealing with a large set of rules is using visualization techniques which have been developed. The two approaches to learn rules, the algorithm Dakar and visualization techniques are fully described in [93] .
In the Appeau context, the idea is to study how the Sacadeau -style model can be used in a more generic way and to compare, and possibly unify, our work with what is done by our partners from Sas/Inra concerning nitrate transfer. The main difference between the two contexts, pesticide in one hand, nitrate in the other hand, is that the spatial issues (water paths, connected plots) are the most important one in the first case, when the temporal issues are the important ones in the second case. Two actions are planned:
-
The first action consisted in using the TNT2 model, an hydrological model build by our colleagues from Sas/Inra and dedicated to topography-based simulation of nitrogen transfer and transformation, in order to run scenarios and get interesting simulation tesults to be exploited in a next phase.
-
The second action just started and consists in analysing the simulation results got in the previous step in order to get information on the main influencial variables. The idea is to use adapted learning and data mining tools.
In the Accasya context, the challenge is to transform these simulation models into decision-aid tools, able to answer queries about future evolution of ecosystems. Two issues are studied :
-
When dealing with environmental problems, scenarios are widely used tools for evaluating future evolution of ecosystems given policy options, potential climatic changes or catastrophic event impacts. If the scenarios are generally expressed in natural language (especially in the field of environmental sciences), when using ecological model simulators, it is necessary to transform them into formalised queries that can be given as input to the model. A first step in this direction is described in 6.1.4 (see also [10] ). It concerns halieutic ecosystems but we intend to use a similar approach in the context of catchment management.
-
Another issue consists in answering user queries by using user-oriented incremental learning. A first approach dedicated to incremental learning has been undertaken. The idea is to learn more interesting rules by selecting well-suited learning examples. The idea is to iteratively run the model : each cycle consists in (i) setting parameters after analysing earlier cycle results; (ii) running the model and getting new data; (iii) learning rules from these data; (iv) selecting the fittest rules according to given criterias; and adding them to the global theory. One of the key issues is the choice of the parameters from the previous cycles results. Currently, the choice is done (either automatically or guided by an expert) in order to produce rules improving the quality of the global theory. In our case, the weed control strategies appeared to be the most relevant parameters in the sense that they heavily impact the final results. We proposed to group the strategies in six clusters according to their characteristics, and to assign choice probabilities to each cluster according to distances. this protocol. This preliminary work has been presented at the S IDE/Inforsid workshop “Systèmes d'Information et de Décision pour l'Environnement” [17] . The next step is then to design and use more sophisticated criteria for choosing rules (step (iv) above).
A thesis funded thanks to the ANR/ADD Accassya project will start at the beginning of 2010.