Section: New Results
Data Reduction and Classification
Data reduction and classification is needed to cluster large data sets in concise ways. We use two different formalisms for clustering data: grid-based conceptual hierarchies, for database summarization; and parametric probabilistic models, for continuous multivariate spaces typically encountered with multimedia data. To deal with distributed data sources, we have addressed the problem of integration of (possibly hierarchical) structures. Our focus is on integration of data descriptions, without resorting to raw data. We have also addressed the problem of efficient querying of database summaries.
Database Summaries
Participants : Guillaume Raschia, Mounir Bechchi, Quang-Khai Pham.
Our database summarization system DBSum provides multi-level summaries of tabular data stored in a centralized database. Summaries are computed online by means of a grid-based conceptual hierarchical clustering algorithm. Along this research area, we pursued two distinct directions : (i) approximate answering in large and distributed databases, and (ii) summarization of transaction DB.
In [13] , we first proposed an efficient and effective algorithm coined Explore-Select-Rearrange Algorithm (ESRA), based on the DBSum model, to quickly provide users with hierarchical clustering schemas of their query results. Each node (or summary) of the hierarchy provided by ESRA describes a subset of the result set in a user-friendly form based on domain knowledge. The user then navigates through this hierarchy structure in a top-down fashion, exploring the summaries of interest while ignoring the rest. Experimental results show that the ESRA algorithm is efficient and provides well-formed (tight and clearly separated) and well-organized clusters of query results [47] .
The ESRA algorithm assumes that the summary hierarchy of the queried data is already built using DBSum and available as input. However, DBSum requires full access to the data which is going to be summarized. This requirement severely limits the applicability of the ESRA algorithm in a distributed environment, where data is distributed across many sites and transmitting the data to a central site is not feasible or even desirable. Therefore, we proposed a solution for summarizing distributed data without a prior `unification' of the data sources. We assume that the sources maintain their own summary hierarchies (local models), and we propose new algorithms for merging them into a single final one (global model). An experimental study shows that our merging algorithms result in high quality clustering schemas of the entire distributed data and are very efficient in terms of computational time.
As a second contribution to data reduction, we went one step further [39] into the definition of Time Sequence Summarization to support chronology-dependent applications on massive data sources. Time sequence summarization takes as input a sequence of events where each event is described by a set of descriptors. Time sequence summarization produces a concise time sequence that can be substituted for the original time sequence in chronology-dependent applications. We proposed an algorithm that achieves time sequence summarization based on a generalization, grouping and concept formation process. Generalization expresses event descriptors at higher levels of abstraction using taxonomies while grouping gathers similar events. Concept formation is responsible for reducing the size of the input time sequence of events by representing each group created by one concept. The process is performed in a way such that the overall chronology of events is preserved. The algorithm computes the summary incrementally and has reduced algorithmic complexity. The resulting output is a concise representation, yet, informative enough to directly support chronology-dependent applications. We validate our approach by summarizing one year of ?nancial news provided by Reuters.
Distributed Learning of Probabilistic Class Models
Participants : Pierrick Bruneau, Ali El Attar, Marc Gelgon.
Learning a probabilistic model that describes the distribution of numerical features in a multidimensional continuous space, for supervised or unsupervised classification, is a fundamental and widely studied task. When data sources are distributed and dynamic, existing solutions must be reconsidered. We are indeed witnessing a strongly rising attention to classification and recognition from distributed data, supplied i.e. sensor networks or social networks.
Our proposal focuses on mixture aggregation, based on a probabilistic modelling over the parameters of the aggregated model and a variational Bayesian estimation procedure. To improve the model, we introduce a prior, based on a Poisson distribution, that favours grouping components coming from distinct models. While we showed that this generally improves both quality of the result and significantly speeds up computation, it has required the design of a new optimization scheme for the variational-Bayes EM algorithm [18] , [25] , [44] . We have recently extended this work to handle aggregation of models on different manifolds, i.e. variational-Bayes aggregation, at parameter-level, of mixtures of probabilistic PCA (paper submitted).
We are also extending this study to handle statistically robust clustering of distributed data. For this purpose, we have considered the counterpart of the above scheme, in its Student mixture model version. Student distributions may indeed be viewed as gamma-weighted infinite mixtures, thanks to which mixture model estimation can be made insensitive to some moderate amount of outlier data [31] .
Finally, in cooperation with the COD team at LINA, we have proposed a scheme for interactive, semi-supervised clustering of a set of mixture models [19] and explored connections with biomimetic techniques for distributed clustering [26] .