Section: New Results
Discovering Informative Feature Sets in Data Streams
Participants : Chonghseng Zhang, Florent Masseglia.
One particular case of data streams is that of categorical data, where each object is a set of features. Unfortunately, in the real world, objects do not arrive directly as a set of features. Let us consider, for instance, the usages of a Web site. Each user will be considered as an object and each page requested by that user will be considered as a feature. The users make their request one after the other. Which means that, at each step, the set of features for that client will evolve. Analyzing such kind of data streams, made of evolving objects, is an important topic today, associated to numerous challenges.
Chongsheng Zhang is studying these challenges in his Ph.D thesis. Part of this thesis is funded by MIDAS (Mining Data Streams), an ANR project (cf. section 8.2.3 ).
This year, Chongsheng has particularly studied feature extraction from such data streams. Feature selection is the task of selecting interesting or important features, and removing irrelevant or redundant ones. There is a lot of existing works on feature selection [75] . Depending on different applications and needs, we may have different interestingness measures to assess the weights of the features or feature sets [77] . Information based feature selection methods require us to compute the probabilities for all the features and all the possible feature subsets. As a result, it is extremely time consuming and exhaustive and the case is much worse for the streaming data because the probabilities for feature and feature sets are always changing. To deal with this problem, we introduce the heuristic algorithm StreamHI . It is based on a candidate generation and a pruning principle, the former will keep and monitor as many high likely candidates as possible while the latter will remove the redundant and hopeless candidates. Our contributions are i) a definition of the problem of online informative feature set selection from the data streams and ii) StreamHI , a heuristic method for mining the informative feature sets in real time. We ran StreamHI against naive methods through a series of experiments, which demonstrate its efficiency and effectiveness:
-
Execution times of StreamHI are one order of magnitude lower than those of a naive approach based on a method from the litterature designed for static data.
-
The itemset extracted by StreamHI is usually the same, and sometimes better (in terms of entropy), than the itemset extracted by the naive approach.
This work has been submitted and accepted as a long paper by a national conference (EGC 2010).
For 2010, Chongsheng will work on discovering different schemas from data streams where objects are sets of features. Today, in usage data streams, most objects are made of only one feature. Extracting relevant and useful knowledge from these data streams is a challenge since one single feature is not very informative. Our goal is to improve that knowledge despite the rare occurences of objects having more than one feature.