Section: Scientific Foundations
Information Systems Data Mining
There are two main motivations for usage mining in the context of ISs or search engines:
supporting the re-design process of ISs or search engines by better understanding the user practices and by comparing the IS structure with usage analysis results;
supporting information retrieval by reusing user groups'practices, what is called “collaborative filtering,” via the design of adaptive recommender systems or ISs (cf. section 3.4 ).
Usage mining corresponds to data mining (or more generally KDD) applied to usage data. By usage data, we mean the traces of user behaviours in log files.
Let us consider the KDD process represented by Fig. 2 .
This process involves four main steps:
data selectionaims at extraction, from the database or data warehouse, of the information required by the data mining step.
data transformationwill then use parsers in order to create data tables usable by the data mining algorithms.
data miningtechniques ranging from sequential patterns to association rules or cluster discovery.
finally the last step will support re-usingprevious results into an usage analysis process.
More precisely the first three steps involve five important research directions:
Data selection and transformation
We insist on the importance of the pre-processing step in the KDD process. This step can be decomposed into selection and transformation sub-steps.
The considered KDD methods applied on usage data rely on the notion of user session, represented through a tabular model (items), an association rules model (itemsets) or a graph model. This notion of session enables us to act at the appropriate level in the knowledge extraction process from log files. Our goal is to build summaries and generate statistics on these summaries. At this level of formalization we can consider rules and graphs, define hierarchical structures on variables, extract sequences and thus build new types of data by using KDD methods.
Then, as the analysis methods come from various research fields (data analysis, statistics, data mining, AI., etc.), data transformations may be required and will be managed by appropriate parsers. Input data will come from intermediary databases, standard formatted files (XML) or a private format.
Data mining: extracting association rules
Our preprocessing tools (or generalization operators) introduced in the previous paragraph were designed to build summaries and to generate statistics on these summaries. At this level of formalization we can consider rules and graphs, define hierarchical structures on variables, extract sequences and thus build new types of data by using methods for extracting frequent itemsets or association rules.
These methods were first proposed in 1993 by R. Agrawal, T. Imielinski and A. Swami (researchers in databases at the IBM research center, Almaden). They are available in market software for data mining (IBM's intelligent miner or SAS's enterprise miner).
Our approach will rely on works from the field of generalization operators and data aggregation. These summaries can be integrated in a recommendation mechanism for helping the user. We propose to adapt frequent itemset research methods or association rules discovery methods to the Web Usage Mining problem. We may get inspired by methods coming from the genomic methods (which present common characteristics with our field). If the goal of the analysis can be written in a decisional framework then the clustering methods will identify usage groups based on the extracted rules.
Data mining: discovering sequential patterns
Knowledge about the user can be extracted based on sequential pattern discovery (which are inter transactions patterns).
Sequential patterns offer a strong correlation with Web Usage Mining purposes (and more generally with usage analysis problems). Our goal is to provide extraction methods which are as efficient as possible, and also to improve the relevance of their results. For this purpose, we plan to improve sequential pattern extraction methods by taking into account the context where those methods are involved. This can be done:
by analyzing the causes of sequential pattern extraction failure on large access logs. It is necessary to understand and incorporate the great variety of potential behaviours on a Web site. This variety is mainly due to the large size of the trees representing the Web sites and the very large number of combination of navigations on those sites.
by incorporating all the available information related to usage. Taking into account several information sources in a single sequential pattern extraction process is challenging and can lead to numerous opportunities.
finally, sequential pattern mining methods will be adapted to a new and growing domain: data streams. In fact, in many practical cases, data cannot be stored for more than a specific period of time (and possibly not at all). We need to develop good solutions for adapting data mining methods to the specific constraints related to this domain (no multiple scans over the data, no blocking actions, etc.).
Data mining: clustering approach to reduce the volume of data in data warehouses
Clustering is one of the most popular techniques in knowledge acquisition and it is applied in various fields including data mining and statistical data analysis. Clustering involves organizing a set of individuals into clusters in such a way that individuals within a given cluster have a high degree of similarity, while individuals belonging to different clusters have a high degree of dissimilarity.
The definition of 'homogeneous' cluster depends on a particular algorithm: this is indeed a simple structure, which, in the absence of prior knowledge about the multidimensional shape of the data, may be a reasonable starting point towards the discovery of richer and more complex structures
Clustering methods reduce the volume of data in data warehouses, preserving the possibility to perform needed analysis. The rapid accumulation of large databases of increasing complexity poses a number of new problems that traditional algorithms are not equipped to address. One important feature of modern data collection is the ever increasing size of a typical database: it is not so unusual to work with databases containing data from a few thousands to a few million individuals and hundreds or thousands of variables. Currently, most clustering algorithms of the traditional type are severely limited regarding the number of individuals they can comfortably handle.
Cluster analysis may be divided into hierarchical and partitioning methods. Hierarchical methods yield complete hierarchy, i.e., a nested sequence of partitions of the input data. Hierarchical methods can be agglomerative or divisive. Agglomerative methods yield a sequence of nested partitions starting with the trivial clustering in which each individual is in a unique cluster and ending with the trivial clustering in which all individuals are in the same cluster. A divisive method starts with all individuals in a single cluster and performs divisions until a stopping criterion is met. Partitioning methods aim at obtaining a partition of the set of individuals into a fixed number of clusters. These methods identify the partition that optimizes (usually locally) an adequacy criterion.
Data mining: reusing usage analysis experiences
This work aims at re-using previous analysis results in current analysis: In the short term we will start with an incremental approach to the discovery of sequential motives; in the longer term, we intend to experiment with a case-based reasoning approach Very fast algorithms able to efficiently search for dependences between attributes (e.g. research algorithms with association rules), or dependences between behaviours (research algorithms with sequential motives) within large databases already exist.
Unfortunately, even though these algorithms are very efficient, but depending on the size of the database, it can take up to several days to retrieve relevant and useful information. Furthermore, the variation of parameters available to the user requires to re-start the algorithms without taking previous results into account. Similarly, when new data is added or suppressed from the base, it is often necessary to re-start the retrieval process to maintain the extracted knowledge.
Considering the size of the handled data, it is essential to propose both an interactive (parameters variation) and incremental (data variation in the base) approach in order to rapidly meet the needs of the end user.
This problem is currently regarded as an open research problem within the framework of Data Mining; Existing solutions only provide a partial solution to the problem.
Content and Structure Document Mining
With the increasing amount of available information, sophisticated tools for supporting users in finding useful information are needed. In addition to tools for retrieving relevant documents, there is a need for tools that synthesize and exhibit information that is not explicitly contained in the document collection, using document mining techniques. Document mining objectives also include extracting structured information from rough text.
The involved techniques are mainly clustering and classification. Our goal is to explore the possibilities of those techniques for document mining.
Classification aims at associating documents to one or several predefined categories, while the objective of clustering is to identify emerging classes that are not known in advance. Traditional approaches for document classification and clustering rely on various statistical models, and representation of documents are mostly based on bags of words.
Recently much attention has been drawn towards using the structure of XML documents to improve information retrieval, classification and clustering, and more generally information mining. In the last four years, the INEX (Initiative for the Evaluation of XML retrieval) has focused on system performance in retrieving elements of documents rather than full documents and evaluated the benefits for end users. Other works are interested in clustering large collections of documents using representations of documents that involve both the structure and the content of documents, or the structure only (  ,  ,  ,  ).
Approaches for combining structure and text range from adding a flat representation of the structure to the classical vector space model or combining different classifiers for different tags or media, to defining a more complex structured vector model  , possibly involving attributes and links.
When using the structure only, the objective is generally to organize large and heterogeneous collections of documents into smaller collections (clusters) that can be stored and searched more effectively. Part of the objective is to identify substructures that characterize the documents in a cluster and to build a representative of the cluster  , possibly a schema or a DTD.
Since XML documents are represented as trees, the problem of clustering XML documents is the same as clustering trees. However, it is well known that algorithms working on trees have complexity issues. Therefore some models replace the original trees by structural summaries or s-graphs that only retain the intrinsic structure of the tree: for example, reducing a list of elements to a single element, flattening recursive structures, etc.
A common drawback of the approaches above is that they reduce documents to their intrinsic patterns (sub-patterns, or summaries) and do not take into account an important characteristic of XML documents, - the notion of a list of elements and more precisely the number of elements in those lists. While it may be fine for clustering heterogeneous collection, suppressing lists of elements may result in losing document properties that could be interesting for other types of XML mining.