Keywords : web usage mining, pre-processing, http logs.
Web Log Preprocessing and Sequential Pattern Extraction
Participants : Brigitte Trousse [ co-correspondant ] , Yves Lechevallier [ co-correspondant ] , Anli Abdouroihamane, Celine Fiot, Cristina Isai.
AxISLogMiner is a software application that implements our preprocessing methodology  for Web Usage Mining (WUM) and our work on sequential pattern extraction with low support.
We used Java to implement our application as this gives several benefits both in terms of added functionality and in terms of implementation simplicity. The application uses Perl modules for the operations carried on the log file such as: log files join, log cleaning, robot requests filtering and session/visit/episode identification. To store the preprocessed log file, in our relational model we used JDBC with Java. The result of this preprocessing is then used in data mining tool to extract, for instance, sequential patterns consisting in sequences of Web pages frequently requested by users. We endowed this software with the ability of recording the keywords employed by users in search engines to find the browsed pages.
This year in the context of Eiffel project and based on AxISlogMiner preprocessing tool, we isolated and redesigned the core (called AWLH) composed of a set of tools for pre-processing web log files. It can extract and structure log files from one or several Web servers, using different input format. The web log files are cleaned as usually before to be used by the datamining tool, as they contains many noisy entries (for example, robots bring a lot of noise in the analysis of user behaviour then it is important in this case to identify robot requests). The data are stored within a database whose model has been improved. The features of the current version of the AWHL are:
Processing of several log files from several servers (different formats);
Support of several input formats (CLF, ECLF, IIS, custom, ...);
Java API to help integration of AWLH in external application.
We also developed a tool based on an open source project called "OpenSympony ClickStream". Using OpenSympony ClickStream we recorded the click actions made by a user in real time. During the capture process we create a table that is used by the AWLH tool to fulfill the tables required for the preprocessing and processing phases of the WUM process.