Section: New Results
Keywords : AUC-based Learning, Feature Selection, Human Computer Interaction and Visual Data Mining, Methodological aspects, Meta-learning and Competence Maps, Inductive Logic Programming, Constraint Satisfaction and Phase Transition, Bounded Relational Reasoning, Phase Transitions.
Fundamentals of Machine Learning, Knowledge Extraction and Data Mining
Abstract: This theme focuses on machine learning, knowledge discovery and data mining (ML/KDD/DM) considered as optimisation problems, and particularly on the key issues of the search space/hypothesis language, and the learning criteria.
EC-based Learning and Mining
TAO participates to the PASCAL (http://www.pascal-network.org )(Pattern Analysis, Statistical Modelling and Computational Learning, 2003-2007) Network of Excellence, which includes the major European Research Centers in Machine Learning; M. Sebag is responsible for the UPS site, including the TAO and SELECT projects, and member of the Pascal Steering Committee as manager of the Challenge Programme.
While the mainstream of Machine Learning considers quadratic learning criteria and well posed optimisation problems, e.g., the structural risk minimization underlying kernel methods, our expertise in evolutionary computation allows us to consider non convex optimisation criteria such as the Wilcoxon statistics, a.k.a. area under the Receiver Operating Characteristics (ROC) curve. The ROC curve, describing the trade-off between the two types of error of a hypothesis, is well suited to imbalanced example distributions and cost-sensitive learning.
Our experiments with the evolutionary optimisation of the area under the ROC curve (AUC) have shown very good learning performances compared to prominent approaches such as Support Vector Machines. The ROGER algorithm, ROC-based GEnetic learneR , has been successfully applied to bio-informatics, more particularly to ranking candidate solutions for a protein docking problem  ,  , and to text mining, to ranking candidate terms during the terminology extraction step  ,  .
Although the optimization of the AUC criterion has been formalized as a constrained quadratic optimization (Joachims 2005), it must be emphasized that this approach is actually quadratic in the number Nof examples; in opposition, ROGER complexity is in Nlog N .
Along the same line, the search for stable patterns in spatio-temporal data mining was formalized as a multi-objective multi-modal optimization problem, applied to functional brain imaging and presented at IJCAI 2005  (ACI NIM NeuroDyne contract, in coll. with Hôpital La Pitié Salpétrière, LENA).
Feature Selection (FS)
Most available databases were not constructed with data mining in mind, and they usually involve a number of features that are irrelevant to the learning task at hand. The irrelevant features not only result in a significant increase of the computational and memory resources needed; they might also mislead the search for a good hypothesis, ultimately resulting in a poor predictive accuracy. Feature Selection (FS) is thus recognized as a central task for ML/KDD/DM applications, and particularly in bio-informatics. In collaboration with the bio-informatics team of LRI and INSERM, a novel algorithm inferring the relevance of attributes from the structure and parameters of hypotheses was proposed; this work resulted in the discovery of a biological process likely to occur as a response to weak radiation exposures (G. Mercier, N. Berthault, J. Mary, A. Antoniadis, J-P. Comet, A. Cornuéjols, Ch. Froidevaux, and M. Dutreix. Biological detection of low radiation by combining results of two analysis methods. Nucleic Acids Research , 32:1, pp 1–8, 2004.)
In response to the uncertainty that plague feature ranking in the case of data with very low signal to noise ratio, like in genomics, a new method was developed. It is based on a new method to measure the correlation between ranking techniques. Relying on a maximum likelihood method, it allows one to combine the results of two or more ranking methods to get a high precision estimation of the number of relevant features and their identity. The method has been applied to microarray data  .
Theoretical studies about feature selection have been pursued, firstly in relation with the ACI MIST-R contract (Nouvelles Interfaces des Maths), and secondly in the PASCAL NoE framework.
The MIST-R contract, in collaboration with J.-M. Loubès, Lab. Math. UPS, is concerned with the modelling of road traffic; in this context, almost sure convergence proofs of feature selection under a VC-penalized scheme were obtained by Merve Amil and Olivier Teytaud  , as well as non-convergence proofs for some non-VC penalized schemes.
Further, a theoretical Pascal challenge was launched ( http://www.lri.fr/~teytaud/risq/ regarding the type I and type II error rates when extracting sets of solutions; in the FS framework, such errors respectively correspond to the selection of irrelevant and the pruning of relevant features.
Bayesian networks, a now classical framework for probability density estimation, involve the identification of both a structure (the dependency graph) and its parameters. Whereas many works have been devoted to the optimization of the structure, only a few papers deal with parameter learning, classically handled in a frequentist approach. A new approach related to parameter learning has been proposed in  – one of the 10 papers suggested for publication in the RIA journal. The results are :
a new complexity measure that shows that the number of parameters is not the only important term in the complexity of bayesian networks ; the structure of the network, for a fixed number of parameters, has a strong influence that is theoretically predicted by a result in the paper above (confirmed by practical results in the journal version) ;
a new criterion for parameter fitting, computationally harder than the usual one (which reduces in general to the frequentist approach), that leads to substantially better results, even in the limit of large samples, in the same way than loss minimization is preferable to likelihood maximization when the model is unperfect.
Some other results, coming from classical learning theory coupled with results above, include proofs of convergence to a minimal sufficient structure.
Relational Learning, a.k.a. Inductive Logic Programming (ILP) is about learning from structured examples, such as chemical molecules (graphs), XML data (trees), and/or learning structured hypotheses such as toxicological patterns (graphs, sequences) or dimensional differential equations (mechanical models).
The TAO team developed an internationally acknowledged competence in ILP and Relational Learning; Céline Rouveirol and Michèle Sebag co-chaired the 11th International Conference in Inductive Logic Programming in 2001. Compared to propositional learning and data mining, relational learning faces two additional difficulties. On one hand, the covering test, checking whether a given hypothesis covers an example, is equivalent to a constraint satisfaction problem (CSP), with exponential complexity; on the other hand, the search space is doubly exponential.
A principled and computationally effective approach, based on the junction of ILP and CSP, was proposed in Jerome Maloberti's PhD toward the covering test. The resulting algorithm, termed Django,shows an improvement of several orders of magnitude on artificial and real-world problems over the existing algorithms; this achievement has been cited among the major UPS ones over 2005. Django, available under GPL (see section 5.4 ), has been widely used and cited in the literature (coll. with the Yokohama University, Japan, U. of Tufts in Arizona, U. of Bari, Italy).
Recent developments, done during Alexandre Termier's post-doc visiting Pr. Motoda's Lab at Osaka University after his PhD co-supervised by M.-C. Rousset and M. Sebag, were concerned with XML and more generally tree mining  ,  .
Other developments have been related to the ACI IMPBIO ACCAMBA, concerned with cribbling molecules, and using to this aim the relational learner STILL, based on the stochastic sampling of the substitution space, initially designed by M. Sebag.
The abovementioned relationship between relational learning and CSPs led to extend the phase transition paradigm (Hogg et al., 96) into the ML framework. This study, initialized in collaboration with L. Saitta and A. Giordana from the University of Alessandria, Italy, was exceptionally fruitful: negative results were obtained concerning the scalability of ILP learners (M. Botta and A. Giordana and L. Saitta and M. Sebag. Relational Learning as Search in a Critical Region. Journal of Machine Learning Research , Vol. 4, pp 431–463, 2003.)and the failure region of the prominent FOIL algorithm was identified in 2003. It was shown in 2004 that phase transition could be observed also in propositional learning (Nicholas Baskiotis and Michele Sebag. C4.5 Competence Map: a Phase Transition-inspired Approach. Proc. ICML 2004.), and the failure region of the widely known C4.5 algorithm was identified.
In 2005, we similarly investigated the grammatical inference framework, the complexity of which is intermediate between relational and propositional learning. Unexpected results regarding the behavior of the prominent RedBlue and RPNI grammatical inference algorithms, and showing the shortcomings of the stopping criterion, were found and presented at IJCAI 2005  ; this paper was proposed for submission in the International Journal of Intelligent Information Systems.
This study was resumed during Raymond Ros's Master  . Contrary to all expectations, the results obtained from extensive experimentations show striking discontinuities in the exploration of the research space in the case of Deterministic Finite state Automata and sharp variations in the case of Non deterministic Finite state Automata  ,  ,  .
The proposed phase transition paradigm considers computational complexity, coverage rate or learning success as random variables conditioned by some order parameters (e.g. tightness and hardness of the constraints, size of clauses and examples in relational learning, alphabet size and number of states in finite state automata). The average behavior of the learning operators is observed through extensive experimentations, using a problem sampling mechanism relying on the order parameters. Accordingly, such studies allow for drawing the Competence Map of the learners in the landscape defined from the order parameters.
Such competence maps partially address the critical Meta-Learning problem, widely acknowledged as the main bottleneck for Machine Learning, and concerned with determining a priori the best algorithm for a given dataset. Indeed, competence maps estimate the average error rate of the algorithm given the characteristics of the dataset. Relatedly, the phase transition approach offers a methodology for the principled validation and verification of learning algorithms (invited talk at the NIPS 2004 Wshop of Verification and Validation of Learning Algorithms, M. Sebag). As an example of the "competence map" principle, a computationally expensive and parallel test is under progress for the competence maps of learning algorithms as function value estimators, in the OpenDP framework (see section 5.5 ).
It must be emphasized that this approach significantly differs from an analytical algorithmic study; instead, it postulates that many heuristics are packed into really efficient algorithms, the interaction of which is hardly amenable to analytical modeling. Therefore, an empirical framework originating from natural and physical sciences is a useful tool to determine the regions in the problem space where an algorithm generally fails or succeeds.
DataMining on the Grid
Cécile Germain, former member of the Parall group at LRI, joined the TAO project in 2005. Her strong expertise in grid computing, as member of the EGEE (Enabling Grid for E-Science in Europe) Network of Excellence, opens new and strategic research avenues for Grid-Aware Mining Algorithms.
Currently, Cécile Germain chairs the ACI MD AGIR (http://www.aci-agir.org )contract (starting sept. 2004), concerned with medical data mining and more precisely medical imaging through grid computing  ,  . Our specific interests in AGIR regard the interactive exploration and retrieval of relevant images and medical exams; these tasks require specific grid services and intelligent prefetch mechanisms, supporting the physician's search. A multi-disciplinary project, AGIR gathers researchers in computer science, physics and medicine from CNRS, University, INRIA, INSERM and hospitals.
A general architecture for interactive grid access has been designed, and is in the process of integration in the EGEE software (http://egee-na4.ct.infn.it/wiki/index.php/ShortJobs ). The application to volume reconstruction  , which grid-enables the PTM3D software developed at LIMSI, has been part of the first EGEE review and of the HealthGrid demonstrations at SC'05 .
An on-going project termed DEMAIN ( Des DonnéEs MAssives Aux InterpretatioNs ), centered on e-Science (http://www.lri.fr/~cecile/DEMAIN/DemainSc.htm ), is a PPF proposal to Paris-Sud University (approbation pending). DEMAIN, gathering computer scientists, mathematicians (Lab. de Mathématiques), physicists (Lab. Accélérateur Linéaire), and biologists (IBBMC), aims to develop principled and efficient algorithms to deal with huge amounts of data, describing complex natural or artificial systems, in the absence of ground truth. Among the first considered applications is the control of the EGEE system itself (result-checking algorithm based on Wald's sequential test for a typical class of grid computations (Monte-Carlo simulations)  ), and the analysis of its failures from the grid logs (Samuel Deberles's master thesis  ); another application concerns the analysis of the Auger experiment, in collaboration with A. Cordier (LAL). This junction between the LAL and the Tao group was instrumental in the recruitment of Balazs Kegl, CNRS CR1, researcher in machine learning at LAL.