Team AxIS

Members
Overall Objectives
Application Domains
Software
New Results
Contracts and Grants with Industry
Other Grants and Activities
Bibliography

Section: New Results

Discovering Informative Feature Set over High-dimensions

Participants : Chongsheng Zhang, Florent Masseglia.

This work takes place in the context of Chongsheng Zhang's Ph.D thesis. Part of it is funded by MIDAS (Mining Data Streams), an ANR project (cf. 8.2.2 ).

In this work, we tackle the problem of informative feature set selection over unlabeled high-dimensional data. Differing from frequent pattern mining, which counts the frequencies of the patterns when the features appear together in the transactions, informative feature set selection has to take into account many other existing cases. For instance, when the features did not appear together, e.g., some of the features appeared in a transaction but other features in the feature set did not. Selecting the most informative feature set having size k in high-dimensional data is a difficult problem. The difficulties are on two aspects: first, there are many candidate sets with k features, and for each candidate we have to count the probability for every existing case; second, high-dimensional data make it even more difficult as we have massive candidates to check. To tackle the problem, we propose a heuristic theory to reduce the candidate features for informative feature set to a quite small subset. In addition, we build a forward selection algorithm to discover the most informative feature set using the carefully selected features. Moreover, we make a data structure to promptly compute the entropy of the features and introduce a pruning strategy at each forward extension so as to minimize the candidates to evaluate. This work hasn't been published yet, but our experiments on real-world data sets demonstrate the efficiency and effectiveness of the heuristic theory.


previous
next

Logo Inria