## Section: Scientific Foundations

### Machine learning and data mining

The machine learning and data mining techniques investigated in the group aim at acquiring and improving models automatically. They belong to the field of machine or artificial learning [47] . In this domain, the goal is the induction or the discovery of hidden objects characterizations from their descriptions by a set of features or attributes. Our work has been grounded on Inductive Logic Programming (ILP) for several years but we are now also investigating the use of data-mining techniques.

We are especially interested in structural learning which aims at making explicit dependencies among data where such links are not known. The relational (temporal or spatial) dimension is of particular importance in applications we are dealing with, such as process monitoring in health-care, environment or telecommunications. Being strongly related to the dynamics of the observed processes, attributes related to temporal or spatial information must be treated in a special manner. Additionally, we consider that the legibility of the learned results is of crucial importance as domain experts must be able to evaluate and assess these results.

The discovery of spatial patterns or temporal relations in sequences of events involve two main steps: the choice of a data representation and the choice of a learning technique.

#### Temporal and spatial data representation

Temporal data comes mainly in two forms: time series and temporal sequences. Time series are sequences of real values which are often the result of sampling a signal at a regular rate. Temporal sequences are series of symbolic events being either ordered lists of simple events denoting the simple precedence temporal relation or list of timestamped events which provide more precise temporal information.

Temporal sequences come often from the abstraction of time series data. This is especially true for monitoring data recorded by sensors located on the observed system. For example, an ECG can be viewed as a time series of real values, i.e. the raw sampled signal, or as a series of time stamped events describing the waves that appears on the raw signal, e.g. P-waves, QRS-complexes, T-waves, etc. The main issue of data abstraction is to select the relevant features that will represent the raw data and the granularity of abstraction that are the most convenient for the particular task. On the one hand, data should be sufficiently abstracted to enable an efficient (learning) computation, and on the other hand, they should be precise enough in order to avoid omitting essential information that could affect the accuracy of learned models.

Since they are often noisy and some features are difficult to detect, we have mainly investigated signal processing techniques for abstracting health-care data. In collaboration with the LTSI University of Rennes 1, we have studied, among other techniques, wavelet transforms and neural networks for wave detection and classification [2] . For telecommunication data, where the semantics were not so obvious, we have enhanced discretization methods transforming a time series in a sequence of linear segments which are next associated to specific symbols [80] .

Generally, simple sequences of symbolic locations are not sufficient to model spatial information. This is why graphs are often used to represent spatial data. In this model, nodes represent a geographic area and edges spatial relations between these areas. Sometimes the data are such that trees, which are specific graphs, can be used. This is the case for hydrological modeling, for instance. The considered area is split into sub-areas whose granularity can be adapted to particular tasks. In our case, the relationships between sub-areas model the runoff transfer from an area to a lower one. This kind of representation is used to compute simulations of water and pesticides transfer, in tight collaboration with the SAS-INRA research group.

#### Learning from structured data

We distinguish supervised and unsupervised learning methods.
A learning method is supervised if samples of objects to be
classified are available and labeled by the class they belong to.
Such samples are often called *learning examples* . If the
examples cannot be classified a priori, the learning method is
unsupervised. Kohonen maps, association rule extraction in data
mining or reinforcement learning are typical unsupervised learning
methods. From another point of view, learning methods can be
symbolic, such as inductive rule or decision tree learning, or
numerical, such as artificial neural networks.
We are mainly interested in symbolic supervised and unsupervised methods. Furthermore, we are investigating methods that can cope with temporal relationships in data. In the sequel, we will give some details about relational learning, relational data-mining and data streams mining.

- Relational learning
Relational learning, called inductive logic programming (ILP) in the past, is a research topic at the intersection of machine learning, logic programming and automated deduction. The main goal of relational learning is the induction of classification or prediction rules from examples and from domain knowledge. As relational learning relies on first order logic, it provides a very expressive and powerful language for representing learning hypotheses especially those learnt from temporal data. Furthermore, domain knowledge represented in the same language, can also be used. This is a very interesting feature which enables taking into account already available knowledge and avoids starting learning from scratch.

Concerning temporal data, our work is more concerned with the application of relational learning rather than developing or improving the techniques. Nevertheless, as noticed by Page and Srinivasan [78] , the target application domains (such as signal processing in health-care) can benefit from the adaptation of relational learning scheme to the particular features of the application data. Therefore, constraint programming has been associated to ILP for relational learning in order to infer numerical values efficiently [89] . Extensions, such as QSIM [66] , have also been used for learning a model of the behavior of a dynamic system [57] . Precisely, we investigate how to associate temporal abstraction methods to learning and to chronicle recognition. We are also interested in constraint clause induction, particularly for managing temporal aspects. In this setting, some variables are devoted to the representation of temporal phenomena and are managed by a constraint system [81] in order to deal efficiently with the associated computations (such as the covering tests, for example).

Concerning environmental data, we have investigated tree structures where nodes are described by a set of attributes. Our goal is to find patterns expressed as sub-trees [43] with attribute selectors associated to nodes.

- Data mining
Data mining is an unsupervised learning method which aims at discovering interesting knowledge from data. Association rule extraction is one of the most popular approach and has deserved a lot of interest in the last 10 years. For instance, many enhancements have been proposed to the well-known Apriori algorithm [29] . It is based on a levelwise generation of candidate patterns and on efficient candidate pruning based on a notion of minimal relevance, usually related to the frequency of the candidate pattern in the dataset (i.e. the support): the most frequent patterns should be the most interesting. Later, Agrawal and Srikant proposed a framework for "mining sequential patterns" [30] , an extension of Apriori where the order of elements in patterns are considered.

In [75] , Mannila and Toivonen extended the work of Aggrawal et al. by introducing an algorithm for mining patterns involving temporal episodes with a distinction between parallel and sequential event patterns. Later, in [54] , Dousson and Vu Duong introduced an algorithm for mining chronicles. Chronicles are sets of events associated with temporal constraints on their occurrences. They generalize the temporal patterns of Mannila and Toivonen. The candidate generation is achieved by an Apriori-like algorithm. The chronicle recognizer CRS [52] is used to compute the support of patterns. Then, the temporal constraints are computed as an interval whose bounds are the minimal and the maximal temporal extent of the delay separating the occurrences of two given events in the dataset. Chronicles are very interesting because they can model a system behavior with sufficient precision to compute fine diagnoses, they can be extracted reasonably efficiently from a dataset and they can be efficiently recognized on an input data stream.

Relational data-mining [25] can be seen as a generalization of these works to first order patterns. Interesting propositions have been made in this field, for instance the work of Dehaspe for extracting first-order association rules which have strong links with chronicles. Another interesting research concerns inductive databases which aim at giving a theoretical and logical framework to data-mining [67] , [51] . In this view, the mining process is considered as querying a database containing raw data as well as patterns that are implicitly coded in the data. The answer to a query is computed, either directly if the solution patterns are already present in the database, or computed by a mining algorithm, e.g. Apriori. The original work is concerned with sequential patterns only [70] . We have investigated an extension of inductive database where patterns are very close to chronicles [94] .

- Mining data streams
During the last years, a new challenge has appeared in the data mining community: mining from data streams [27] . Data coming for example from monitoring systems observing patients or from telecommunication systems arrive in such huge volumes that they cannot be stored in totality for further processing: the key feature is that "you get only one look at the data" [61] . Many investigations have been made to adapt existing mining algorithms to this particular context or to propose new solutions: for example, methods for building synopses of past data in the form of or summaries have been proposed, as well as representation models taking advantage of the most recent data. Sequential pattern stream mining is still an issue [77] . At present, research topics such as, sampling, summarizing, clustering and mining data streams are actively investigated.

A major issue in data streams is to take into account the fact that the process generating data is dynamic, i.e. that the underlying model is evolving, and so the extracted patterns have to be adapted constantly. This feature, known as

*concept drift*[95] , [69] , occurs within an evolving system when the state of some hidden system variables changes. This is the source of important challenges for data stream mining [60] because it is impossible to store all the data for off-line processing or learning. Thus, changes must be detected on-line and the current mined models must be updated on line as well.