Overall Objectives
Scientific Foundations
Application Domains
New Results
Contracts and Grants with Industry

Section: New Results

Machine learning for model acquisition

Participants : Marie-Odile Cordier, Thomas Guyet, Christine Largouët, Alice Marascu, Véronique Masson, René Quiniou.

Model acquisition is an important issue for model-based diagnosis, especially as modeling dynamic systems. We investigate machine learning methods for temporal data recorded by sensors or spatial data resulting from simulation processes.

Learning and mining from sequential and temporal data

Our main interest is extracting knowledge, especially sequential and temporal patterns or prediction rules, from static or dynamic data (data streams). We are particularly interested in detecting and dealing with concept change in data stream mining in order to be able to adapt on-line the models used for diagnosis.

Mining temporal patterns with numerical information


This work aims at exploring a new solution to learn the temporal extents of events in chronicle models. Mining sequential patterns as well as extracting frequent temporal episodes has been widely explored. However, in a monitoring context, numerical information associated to event duration or to the delay between event occurrences is very meaningful for discriminating between faults or diseases. Due to the added complexity of such patterns, numerical temporal pattern mining methods have been less explored than methods for mining sequential patterns using the simpler notion of precedence.

We focus on mining temporal interval patterns from time-series databases or from sequence databases where a temporal sequence is a set of events with time-stamped begin and end. Our first proposition relied on an algorithm that processes temporal sequences represented by hyper-cubes [65] . The candidate generation uses the classical Apriori schema [29] enriched by constraints on temporal intervals. Frequent temporal pattern selection estimates the interval distribution to extract the statistically significant temporal intervals. The algorithm is implemented in Matlab (see section 5.5 ).

We are currently working on improving the algorithm complexity, precisely on how to estimate the distribution of temporal intervals more efficiently. The method is evaluated on cardiac monitoring data for extracting typical patterns that can be associated to cardiac arrhythmias. We also plan to mine electrical consumption data in the context of a collaboration with EDF (Électricité de France).

Mining sequential patterns from data streams


During her thesis [76] , Alice Marascu has proposed methods for data stream processing and analysis. Precisely, she has devised CLUSO, a method that computes clusters from a sequence stream and extracts frequent sequential patterns from the clusters and maintain their history in order to summarize the stream. During her post-doctoral stay, extensions to the CLUSO method will be investigated in order to deal with more complex patterns, in particular temporal patterns with numerical information, to detect and characterize changes in the data, such as trends and deviations, to take into account the quality of the data (missing values, noisy data, etc.).

Dealing with concept change in data stream mining


We are investigating a multi-diagnoser approach for detecting changes from data stream and for adaptating the diagnosers of the surveillance systems.

In this framework, several diagnosers process the input data stream to construct their own diagnosis. The global diagnosis is constructed by making the fusion of the diagnoses. We use the Demspter-Shafer evidence theory [90] to model the diagnosis and to operate their fusion. The diagnosers are themselves continuously (meta-)diagnosed to assess whether a change has occurred and to decide whether one or several diagnosers should be adapted. The meta-diagnosis is computed by checking predefined relations between the diagnosers to detect faulty diagnoses. These relations takes the form of integrity constraints that express information redundancy between diagnosers. If a set of diagnoses does not satisfy an integrity constraint, it means that two or more diagnosers disagree given some input observations. In such a situation, the faulty diagnosers are detected by comparing their own diagnosis to the global diagnosis decision. The involved diagnosers have the ability to adapt themselves to concept change with respect to the recommended diagnosis decision.

The method has been evaluated on an adaptive intrusion detection system monitoring the queries addressed to a web server [13] , [23] , [19] . Several diagnosers have been implemented by taking different views of the input data: distribution of character in single queries or sessions (set of queries associated to the same user in a logical temporal window), distribution of tokens in queries or sessions, distribution of bi- grams (sequence of two characters), etc. The system is able to detect changes due to the arrival of new kinds of attacks not previously encountered by the system and to adapt the diagnoser model (the distributions of objects they rely upon) accordingly. The adaptive system shows a better performance (higher detection rate and lower false positive rate) than a non adaptive version of the same system.

We have also proposed a novel intrusion detection method that detects anomalies in an on line and adaptive fashion through dynamical clustering of unlabelled audit data streams [16] , [12] , [11] . The framework shows self-managing capabilities: self-labeling, self-updating and self-adapting. The method is based on a recently developed clustering algorithm, Affinity Propagation (AP) [59] . Given an audit data stream, our method identifies outliers (suspicious accesses or requests) with AP. If an outlier is identified, it is then marked as suspicious and put into a reservoir. Otherwise, the detection model is updated until a change is detected, triggering a new model rebuilding through clustering. The suspicious examples are considered as really anomalous if they are marked as suspicious once again after model rebuilding. Thus, our detection model does not need labeled data and can be used on line. The method has been evaluated on a very large size of real HTTP Log data collected at INRIA as well as on a subset of the KDD 1999 benchmark data. The experimental results show that the method obtains better results than three static methods in terms of effectiveness as well as efficiency.

Learning decision-oriented rules from simulation data

In the framework of the Sacadeau project, our aim is to build decision support systems to help catchment managers to preserve stream-water quality [3] . In collaboration with Inra researchers, three actions have been conducted in parallel [20] .

In the Appeau context, the idea is to study how the Sacadeau -style model can be used in a more generic way and to compare, and possibly unify, our work with what is done by our partners from Sas/Inra concerning nitrate transfer. The main difference between the two contexts, pesticide in one hand, nitrate in the other hand, is that the spatial issues (water paths, connected plots) are the most important one in the first case, when the temporal issues are the important ones in the second case. Two actions are planned:

In the Accasya context, the challenge is to transform these simulation models into decision-aid tools, able to answer queries about future evolution of ecosystems. Two issues are studied :


Logo Inria