Section: New Results
Keywords : AUC-based Learning, Feature Selection, Human Computer Interaction and Visual Data Mining (human-machine interaction and visual data mining, visual data mining, human computer interaction), Methodological aspects, Meta-learning and Competence Maps (meta-learning, competence maps), Inductive Logic Programming, Constraint Satisfaction and Phase Transition (constraint satisfaction, phase transition), Bounded Relational Reasoning, Phase Transitions.
Fundamentals of Machine Learning, Knowledge Extraction and Data Mining
Abstract: This theme focuses on machine learning, knowledge discovery and data mining (ML/KDD/DM), investigating: i) the learning criteria; ii) the selection of features and hypotheses; iii) the randomized and quasi- randomized selection of examples; iv) the specificities of relational learning, in relation with phase transitions; v) the Multi-Armed Bandit framework.
Many activities below will refer to the PASCAL (Pattern Analysis, Statistical Modelling and Computational Learning) Network of Excellence (http://www.pascal-network.org ), 2003-2007, which involves most major research groups in ML in Europe, including TAO and SEQUEL from INRIA Futurs. M. Sebag, in charge of the Université Paris-Sud site in Pascal, is manager of the Pascal Challenge programme and member of the Pascal Steering Committee.
New learning criteria
The flexible and effective evolutionary optimization (EC) framework enables us to explore new and non-convex learning criteria. The combinatorial Area Under the ROC Curve (AUC criterion) is optimized by the ROC-based GEnetic learneR (ROGER) algorithm, which has been successfully applied to bio-informatics and text mining, to ranking candidate terms during the terminology extraction step(Preference Learning in Terminology Extraction: A ROC-based approach. J. Aze, M. Roche, Y. Kodratoff, M. Sebag In Applied Stochastic Models and Data Analysis (AMSDA), 2005.). ROGER has been extended to other criteria, motivated by low quality datasets in bioinformatics and inspired by the Energy-based learning framework proposed by Y. Le Cun (2006); these criteria are investigated by A. Rimmel (PhD student under A. Cornuéjols' and M. Sebag's supervision). Along the same line, the search for stable patterns in spatio-temporal data mining was formalized as a multi-objective multi-modal optimization problem, applied to functional brain imaging(A Multi-Objective Multi-Modal Optimization Approach for Mining Stable Spatio-Temporal Patterns. M. Sebag and N. Tarrisson and O. Teytaud and S. Baillet and J. Lefevre In Proc. Int. Conf. on Artificial Intelligence, IJCAI 2005, L. Kaelbling Ed, 2005. IOS Press pp 859-864) (ACI NIM NeuroDyne contract, in coll. with Hôpital La Pitié Salpétrière, LENA). In this perspective, the discriminant power of a hypothesis/pattern is handled as one among the objectives  (invited Keynote speech at COGIS 2006).
Interestingly, EC can most naturally be used to generate many hypotheses (each run provides a new hypothesis, conditionally independent of the others given the dataset). These hypotheses can be plugged for free in an ensemble learning setting, with significant gains in terms of accuracy.
This line of research differs from the mainstream ML, mostly considering convex criteria for the sake of solution unicity and optimization feasibility. Interestingly, while recent advances in ML have been concerned with integrating the AUC criterion in the convex optimization setting through a quadratic number of constraints (Joachims 2005), one switches to greedy heuristics in order to keep the computational cost under acceptable limits.
Criteria and bounds
Learning Bayesian Networks (BN) mixes non-parametric and parametric learning, as one must identify the structure of the network together with its weights (conditional dependency tables). S. Gelly and O. Teytaud have proposed a new complexity measure, accounting for the non-parametric complexity besides the standard number of weights  .
Furthermore, a loss-based criterion has been proposed for the parametric learning task. While this criterion is more computationally demanding, it is more stable than the standard one, and should be preferred in particular when dealing with small datasets.
Empirical results demonstrate substantial improvements compared to the state of the art, even in the limit of large samples. Other results, combining the above with classical learning theory, include proofs of convergence to a minimal sufficient structure.
Selection of Features and Patterns
Feature Selection arises as a pre-processing selection task for ML, while Pattern/Hypothesis Selection is viewed as a post-processing selection task in ML or DM.
Actually, irrelevant features severely hinder the learning task, in terms of computational cost as well as predictive accuracy, particularly when the number of examples is small and/or when the signal to noise ratio is low, as is the case for micro-array analysis. Two methods inspired from ensemble methods were proposed for feature selection in 2005. The theoretical study of these approaches is tackled by Romaric Gaudel (PhD student under M. Sebag and A. Cornuéjols's supervision).
Theoretical studies about feature and pattern selection have been pursued, in relation with the PASCAL Network of Excellence. The general problem of type I and type II errors for simultaneous hypothesis testing (respectively corresponding to the selection of irrelevant hypotheses and the pruning of relevant ones) is thoroughly investigated. One conference paper  and one book chapter  have been accepted for publication about the quality-measures and statistical validation of pattern-extraction. They involve bootstrap estimates for non-independent-rules-selection as in the case of rule-extraction and surveys of standard measures of quality.
A Pascal Theoretical Challenge was launched by O. Teytaud et al. (http://www.lri.fr/~teytaud/risq/ ). The Challenge workshop is scheduled May 14-15th, Paris.
Resampling (e.g., bootstrap) is a well-known stochastic technique for building more robust estimates. Basically, the idea is to use several subsamples of the whole sample set in order to i) estimate confidence intervals; ii) reducing the learning bias; or iii) reducing the computational cost. In practice however, resampling must achieve some tradeoff between the achieved stability improvement and the overall computational cost.
An original approach was proposed, based on quasi-random resampling and inspired from low-discrepancy sequences  . While quasi-random sequences are commonly used e.g. in [0, 1]d and can be defined for various continuous distributions (through the use of copula functions), they are not straightforward in discrete spaces. The goal is to build M subsamples of a sample of size N, such as they are more uniformly distributed than if independently uniformly drawn. The proposed approach is based on rewriting bootstrap laws using multinomial laws and cumulative-distribution-functions. The generality of the approach is demonstrated as it applies to cross-validation, BSFD (a data-mining algorithm for simultaneous-hypothesis-testing), and bagging (ensemble methods for learning), with stability improvements.
Relational learning and phase transitions
Relational Learning, a.k.a. Inductive Logic Programming (ILP) is concerned with learning from relational examples such as chemical molecules (graphs), XML data (trees), and/or learning structured hypotheses such as toxicological patterns (graphs, sequences) or dimensional differential equations (mechanical models).
One additional difficulty of learning in structured domains is that the covering test checking whether a given hypothesis covers an example (theta-subsumption), is equivalent to a NP hard constraint satisfaction problem (CSP). A most efficient theta-subsumption algorithm, Django (still the best) based on the reformulation of theta-subsumption as a binary CSP and using specific datastructures, has been devised by Jérôme Maloberti (see Section 5.6 ).
As for CSPs, the worst-case complexity framework proves to be exceedingly pessimistic and useless for ILP. For this reason, the statistical complexity framework based on the use of order parameters, first developed for CSP and referred to as Phase transition paradigm, has been transported to ILP (coll. L. Saitta and A. Giordana, U. Piemonte, Italie), with many important results about the scalability of ILP. This work has been extended to the grammatical inference framework in 2005(Phase Transitions within Grammatical Inference. N. Pernot and A. Cornuéjols and M. Sebag In Proc. Int. Conf. on Artificial Intelligence, IJCAI 2005, L. Kaelbling Ed, 2005. IOS Press pp 811-816).
On-going work by Raymond Ros (PhD student under A. Cornuéjols and M. Sebag's supervision) resumes the above study applied to the cribbling of molecules for the ACCAMBA IMPBIO ACI, exploring the representation of molecules as SMILE sequences.
The so-called phase transition paradigm relies on the definition of order parameters (e.g. tightness and hardness of the constraints, size of clauses and examples in relational learning, alphabet size and number of states in finite state automata), and studies the empirical behavior of the algorithm at hand through extensive experimentations on random problems built uniformly from the order parameters. The result of such studies can most conveniently be summarized through the algorithm Competence Map ; these competence maps in turn provide a principled way for selecting the algorithm most appropriate on average conditionally to the position of the problem at hand. On-going experiments aim at the competence maps of learning algorithms as function value estimators in the OpenDP framework (see section 5.7 ).
It must be emphasized that this approach significantly differs from an analytical algorithmic study; instead, it postulates that many heuristics are packed into really efficient algorithms, the interaction of which is hardly amenable to analytical modeling. Therefore, an empirical framework originating from natural and physical sciences is a useful tool to determine the regions in the problem space where an algorithm generally fails or succeeds.
Exploration vs Exploitation and Multi-armed Bandits
Many problems can be cast as an Exploration vs Exploitation dilemma, where one wants to both identify some best action (and must thus explore the set of actions) and maximize its current reward (and wants thus play the best action identified so far).
The maximization of the cumulated reward, referred to as Multi-Armed Bandit problem, has been intensively studied in Game Theory and Machine Learning; an optimal algorithm dubbed UCB (Upper Confidence Bound) has been proposed by (Auer et al. 2002). Its extension to tree-structured options, referred to as UCT, has been proposed by (Kocsys and Szepesvari 2006).
During his internship, Yzao Wang jointly supervised by S. Gelly and O. Teytaud, and R. Munos and P.A. Coquelin from the Center for Applied Maths in Ecole Polytechnique, has built a computer Go program based on UCT, named MoGo. MoGo has been extremely successful: it won the last four tournaments in KGS-computer-Go (http://www.weddslist.com/kgs/past/index.html ) and it is ranked first among 142 programs in the championship since August 2006 (http://cgos.boardspace.net/9x9.html ).
MoGo was presented at the Demos session at NIPS 2006, with an oral presentation at the Online Trading of Exploration and Exploitation NIPS Workshop 2006  ,  . It must be emphasized that the game of Go has replaced the game of Chess as touchstone of modern AI; the extreme difficulty of Go is due to i/ the lack of a reliable evaluation function; ii/ a huge branching factor.
The Exploration vs Exploitation dilemma has also been studied with respect to fast dynamic environments, motivated by News Recommendations. This application was explored as a Challenge of the Pascal Network of Excellence proposed by the Clarity Touch Company (http://www.pascal-network.org/Challenges/EEC/ ). The Adapt-EVE algorithm proposed by C. Hartland, S. Gelly (both are PhD student under N. Bredèche and M. Sebag's supervision), N. Baskiotis (PhD student under M. Sebag's supervision), O. Teytaud and M. Sebag won the prize from the Clarity Touch Company and was presented at the Online Trading of Exploration and Exploitation NIPS Workshop 2006  . Adapt-EVE combines UCB with i) a standard change-point-detection test based on Page-Hinkley statistics; ii) a transient strategy, referred to as Meta-Bandit, handling the sequel of a change-point detection; iii) a discount strategy, allowing for more forgetful bandits.
Last, the Exploration vs Exploitation dilemma was considered in the framework of Statistical Software Testing. A previous approach, pioneered by A. Denise, M.-C. Gaudel and S. Gouraud, built test sets by uniformly sampling the paths in the control flow graph of the program. The limitation is that, for large programs, a huge percentage of program paths are infeasible (no input values would lead to exert the path). A generative approach, iteratively exploiting and updating some distribution on the program paths, was proposed by N. Baskiotis and M. Sebag; the gain is about two orders of magnitude compared to the previous approach(A Machine Learning Approach For Statistical Software Testing, Nicolas Baskiotis, Michèle Sebag, Marie-Claude Gaudel, Sandrine-Dominique Gouraud, 20th International Joint Conference on Artificial Intelligence, 2007, to appear.) (invited Keynote speech at Learning Dialogue , Barcelona 2006).