Section: New Results
Keywords : Data streams (data stream, data stream, data stream), Outlier, Multi-resolution, Wavelets, Self-adjusting.
Detecting outliers in data streams: a self-adjusting method
Participants : Alice Marascu, Florent Masseglia, Yves Lechevallier.
Outlyingness is a subjective concept relying on the level of isolation of a (set of) record(s). Clustering-based outlier detection is a field that aims at clustering data and detecting outliers depending on their characteristics (small, tight and/or dense clusters might be considered as outliers). In order to separate the common behaviours from the outliers, the existing methods require a parameter such as a percent of small cluster to be considered as outliers or the top - n outliers. However, using a parameter is not always possible in a data stream environment. Starting from this idea, we propose a parameterless outlier detection method. We propose WOD (Wavelet-based Outlier Detection), a parameterless method intending to automatically extract outliers from a dataset. In contrast to previous work, our goal is to find the best division of a distribution and to automatically separate values into two sets corresponding to clusters on the one hand and outliers on the other hand. The tail of the distribution will be found thanks to a wavelet technique and will not depend on a user threshold. Our method will fit any distribution that depends on any characteristic such as distances between objects, objects density or clusters size. The key idea of WOD is to use a wavelet transform to cut down such a distribution. With a prior knowledge on the number of plateaux (we want two plateaux, the first one standing for small groups, or outliers, and the second one standing for big groups, or clusters) we can cut the distribution in a very effective manner. The advantages of WOD are i) to automatically adjust when the distribution shape changes and ii) to give a relevant and accurate detection of outliers with very natural results. Our experiments, performed on real data, confirm this separation feature of WOD compared to well-known outlier detection principles such as the top-k outliers or the percentage filter.
Two papers (PAKDD, SAC) and one poster (EGC) have been accepted in conferences in 2009.