## Section: New Results

Keywords : AUC-based Learning, Feature Selection, Human Computer Interaction and Visual Data Mining (human-machine interaction and visual data mining, visual data mining, human computer interaction), Methodological aspects, Meta-learning and Competence Maps (meta-learning, competence maps), Inductive Logic Programming, Constraint Satisfaction and Phase Transition (constraint satisfaction, phase transition), Bounded Relational Reasoning, Phase Transitions.

### Fundamentals of Machine Learning, Knowledge Extraction and Data Mining

Participants : Nicolas Baskiotis, Nicolas Bredèche, Antoine Cornuéjols, Sylvain Gelly, Michèle Sebag, Olivier Teytaud.

**Abstract**:
This theme focuses on machine learning, knowledge discovery and data
mining (ML/KDD/DM), investigating: i) the learning criteria; ii) the
selection of features and hypotheses; iii) the randomized and quasi-
randomized selection of examples; iv) the specificities of relational
learning, in relation with phase transitions; v) the Multi-Armed Bandit
framework.

Two book chapters (in French) have been published on these topics, about the fundamental statistical elements of machine learning [9] and the challenges of data mining [13] .

Many activities below will refer to the PASCAL (Pattern Analysis, Statistical Modelling and Computational Learning) Network of Excellence (http://www.pascal-network.org ), 2003-2007, which involves most major research groups in ML in Europe, including TAO and SEQUEL from INRIA Futurs. M. Sebag, in charge of the Université Paris-Sud site in Pascal, is manager of the Pascal Challenge programme and member of the Pascal Steering Committee.

#### New learning criteria

**Non-convex criteria**.

The flexible and effective evolutionary optimization (EC) framework enables us
to explore new and non-convex learning criteria.
The combinatorial Area Under the ROC Curve (AUC criterion) is optimized by the
*ROC-based GEnetic learneR* (ROGER) algorithm, which has been successfully applied
to bio-informatics and text mining, to ranking
candidate terms during the terminology extraction step(Preference Learning in Terminology Extraction: A ROC-based approach.
J. Aze, M. Roche, Y. Kodratoff, M. Sebag
In Applied Stochastic Models and Data Analysis (AMSDA), 2005.).
ROGER has been extended to other criteria,
motivated by low quality datasets in bioinformatics
and inspired by the Energy-based learning framework proposed by Y. Le Cun (2006);
these criteria are investigated by A. Rimmel (PhD student under A. Cornuéjols'
and M. Sebag's supervision).
Along the same line, the search for stable patterns in
spatio-temporal data mining was formalized as a multi-objective
multi-modal optimization problem, applied to functional brain imaging(A Multi-Objective Multi-Modal Optimization Approach for Mining Stable Spatio-Temporal Patterns.
M. Sebag and N. Tarrisson and O. Teytaud and S. Baillet and J. Lefevre
In Proc. Int. Conf. on Artificial Intelligence, IJCAI 2005, L. Kaelbling Ed, 2005. IOS Press pp 859-864)
(ACI NIM NeuroDyne contract, in coll. with Hôpital La Pitié Salpétrière, LENA).
In this perspective, the discriminant power of a hypothesis/pattern is handled as
one among the objectives [33] (invited Keynote speech at COGIS 2006).

Interestingly, EC can most naturally be used to generate many hypotheses (each run provides a new hypothesis, conditionally independent of the others given the dataset). These hypotheses can be plugged for free in an ensemble learning setting, with significant gains in terms of accuracy.

This line of research differs from the mainstream ML, mostly considering convex criteria for the sake of solution unicity and optimization feasibility. Interestingly, while recent advances in ML have been concerned with integrating the AUC criterion in the convex optimization setting through a quadratic number of constraints (Joachims 2005), one switches to greedy heuristics in order to keep the computational cost under acceptable limits.

**Criteria and bounds**

Learning Bayesian Networks (BN) mixes non-parametric and parametric learning, as one must identify the structure of the network together with its weights (conditional dependency tables). S. Gelly and O. Teytaud have proposed a new complexity measure, accounting for the non-parametric complexity besides the standard number of weights [7] .

Furthermore, a loss-based criterion has been proposed for the parametric learning task. While this criterion is more computationally demanding, it is more stable than the standard one, and should be preferred in particular when dealing with small datasets.

Empirical results demonstrate substantial improvements compared to the state of the art, even in the limit of large samples. Other results, combining the above with classical learning theory, include proofs of convergence to a minimal sufficient structure.

#### Selection of Features and Patterns

Feature Selection arises as a pre-processing selection task for ML, while Pattern/Hypothesis Selection is viewed as a post-processing selection task in ML or DM.

Actually, irrelevant features severely hinder the learning task, in terms of computational cost as well as predictive accuracy, particularly when the number of examples is small and/or when the signal to noise ratio is low, as is the case for micro-array analysis. Two methods inspired from ensemble methods were proposed for feature selection in 2005. The theoretical study of these approaches is tackled by Romaric Gaudel (PhD student under M. Sebag and A. Cornuéjols's supervision).

Theoretical studies about feature and pattern selection have been pursued, in relation with the PASCAL Network of Excellence. The general problem of type I and type II errors for simultaneous hypothesis testing (respectively corresponding to the selection of irrelevant hypotheses and the pruning of relevant ones) is thoroughly investigated. One conference paper [35] and one book chapter [34] have been accepted for publication about the quality-measures and statistical validation of pattern-extraction. They involve bootstrap estimates for non-independent-rules-selection as in the case of rule-extraction and surveys of standard measures of quality.

A Pascal Theoretical Challenge was launched by O. Teytaud et al. (http://www.lri.fr/~teytaud/risq/ ). The Challenge workshop is scheduled May 14-15th, Paris.

#### Resampling algorithms

Resampling (e.g., bootstrap) is a well-known stochastic technique for building more robust estimates. Basically, the idea is to use several subsamples of the whole sample set in order to i) estimate confidence intervals; ii) reducing the learning bias; or iii) reducing the computational cost. In practice however, resampling must achieve some tradeoff between the achieved stability improvement and the overall computational cost.

An original approach was proposed, based on quasi-random resampling and
inspired from low-discrepancy sequences [47] . While quasi-random
sequences are commonly used e.g. in [0, 1]^{d} and can be defined for various continuous distributions (through the use of copula functions), they are
not straightforward in discrete spaces. The goal is to build M subsamples
of a sample of size N, such as they are more uniformly distributed than
if independently uniformly drawn.
The proposed approach is based on rewriting bootstrap laws using
multinomial laws and cumulative-distribution-functions. The generality
of the approach is demonstrated as it applies to cross-validation,
BSFD (a data-mining algorithm for simultaneous-hypothesis-testing), and
bagging (ensemble methods for learning), with stability improvements.

#### Relational learning and phase transitions

Relational Learning, a.k.a. Inductive Logic Programming (ILP) is concerned with learning from relational examples such as chemical molecules (graphs), XML data (trees), and/or learning structured hypotheses such as toxicological patterns (graphs, sequences) or dimensional differential equations (mechanical models).

One additional difficulty of learning in structured domains is that the covering test checking whether a given hypothesis covers an example (theta-subsumption), is equivalent to a NP hard constraint satisfaction problem (CSP). A most efficient theta-subsumption algorithm, Django (still the best) based on the reformulation of theta-subsumption as a binary CSP and using specific datastructures, has been devised by Jérôme Maloberti (see Section 5.6 ).

As for CSPs, the worst-case complexity framework proves to be exceedingly
pessimistic and useless for ILP. For this reason, the statistical
complexity framework based on the use of order parameters, first developed
for CSP and referred to as *Phase transition* paradigm, has been
transported to ILP (coll. L. Saitta and A. Giordana, U. Piemonte, Italie),
with many important results about the scalability of ILP.
This work has been extended to the grammatical inference framework in 2005(Phase Transitions within Grammatical Inference.
N. Pernot and A. Cornuéjols and M. Sebag
In Proc. Int. Conf. on Artificial Intelligence, IJCAI 2005, L. Kaelbling Ed, 2005. IOS Press pp 811-816).

On-going work by Raymond Ros (PhD student under A. Cornuéjols and M. Sebag's supervision) resumes the above study applied to the cribbling of molecules for the ACCAMBA IMPBIO ACI, exploring the representation of molecules as SMILE sequences.

The so-called phase transition paradigm relies on the definition of order
parameters
(e.g. tightness and hardness of the constraints, size of clauses and examples
in relational learning, alphabet size and number of states in finite state automata),
and studies the empirical behavior of the algorithm at hand through extensive
experimentations on random problems built uniformly from the order parameters.
The result of such studies can most conveniently be summarized through the
algorithm *Competence
Map* ; these competence maps in turn provide a principled way for selecting
the algorithm most appropriate on
average conditionally to the position of the problem at hand.
On-going experiments aim at the competence
maps of learning algorithms as function value estimators in the OpenDP framework
(see section
5.7 ).

It must be emphasized that this approach significantly differs from an analytical algorithmic study; instead, it postulates that many heuristics are packed into really efficient algorithms, the interaction of which is hardly amenable to analytical modeling. Therefore, an empirical framework originating from natural and physical sciences is a useful tool to determine the regions in the problem space where an algorithm generally fails or succeeds.

#### Exploration vs Exploitation and Multi-armed Bandits

Many problems can be cast as an Exploration *vs* Exploitation dilemma, where one wants to both
identify some best action (and must thus explore the set of actions) and maximize its current reward (and wants
thus play the best action identified so far).

The maximization of the cumulated reward, referred to as Multi-Armed Bandit problem, has been intensively studied in Game Theory and Machine Learning; an optimal algorithm dubbed UCB (Upper Confidence Bound) has been proposed by (Auer et al. 2002). Its extension to tree-structured options, referred to as UCT, has been proposed by (Kocsys and Szepesvari 2006).

During his internship, Yzao Wang jointly supervised by S. Gelly and O. Teytaud, and R. Munos and P.A. Coquelin from the Center for Applied Maths in Ecole Polytechnique, has built a computer Go program based on UCT, named MoGo. MoGo has been extremely successful: it won the last four tournaments in KGS-computer-Go (http://www.weddslist.com/kgs/past/index.html ) and it is ranked first among 142 programs in the championship since August 2006 (http://cgos.boardspace.net/9x9.html ).

MoGo was presented at the Demos session at NIPS 2006, with an oral presentation at the Online Trading of Exploration and Exploitation NIPS Workshop 2006 [45] , [42] . It must be emphasized that the game of Go has replaced the game of Chess as touchstone of modern AI; the extreme difficulty of Go is due to i/ the lack of a reliable evaluation function; ii/ a huge branching factor.

The Exploration *vs* Exploitation dilemma has also been studied with respect to fast dynamic environments, motivated by News
Recommendations. This application was explored as a Challenge of the Pascal Network of Excellence proposed by the Clarity
Touch Company (http://www.pascal-network.org/Challenges/EEC/ ). The Adapt-EVE algorithm proposed by C. Hartland, S. Gelly (both are PhD student under N. Bredèche and M. Sebag's supervision), N. Baskiotis (PhD student under M. Sebag's supervision), O. Teytaud
and M. Sebag won the prize from the Clarity Touch Company and was presented at the Online Trading of Exploration and
Exploitation NIPS Workshop 2006 [46] . Adapt-EVE combines UCB with i) a standard
change-point-detection test based on Page-Hinkley statistics; ii) a transient strategy, referred to as Meta-Bandit, handling the
sequel of a change-point detection; iii) a discount strategy, allowing for more forgetful bandits.

Last, the Exploration *vs* Exploitation dilemma was considered in the framework of Statistical Software Testing. A previous
approach, pioneered by A. Denise, M.-C. Gaudel and S. Gouraud, built test sets by uniformly sampling the paths in the control
flow graph of the program. The limitation is that, for large programs, a huge percentage of program paths are infeasible
(no input values would lead to exert the path). A generative approach, iteratively exploiting and updating some distribution
on the program paths, was proposed by N. Baskiotis and M. Sebag; the gain is about two orders of magnitude compared to the
previous approach(A Machine Learning Approach For Statistical
Software Testing, Nicolas Baskiotis, Michèle Sebag, Marie-Claude Gaudel, Sandrine-Dominique Gouraud, 20th International Joint Conference on Artificial Intelligence, 2007, to appear.) (invited Keynote speech at *Learning Dialogue* , Barcelona 2006).