Overall Objectives
Scientific Foundations
Application Domains
New Results
Contracts and Grants with Industry
Other Grants and Activities

Section: New Results

Database summaries

DBMS has become a very mature technology that is ubiquitous in information systems. Over time, the extensive use of DBMS technology has had major consequences in large organizations: the production of very large databases, the production of heterogeneous databases, and the increasing requirement of diverse applications to access those very large, heterogeneous databases. This creates difficult technical problems which get worse as DBMS technology improves and is more able to produce very large, heterogeneous databases. The SaintEtiQ system provides a novel solution for representing, querying and accessing large databases. We recently completed our work on summary querying techniques as well as decision support systems. We also pursued our work on summary management over P2P systems.

Summary query evaluation

Participants : Noureddine Mouaddib, Guillaume Raschia, Amenel Voglozin.

We proposed a querying mechanism for users to efficiently exploit the hierarchical summaries produced by SaintEtiQ . The first idea is to query the summaries with their own vocabulary, taking advantage of the hierarchical organization of the summaries [24] . The query evaluation matches summaries in the tree with fuzzy selection predicates of the query. The algorithm performs boolean set comparisons and uses the tree structure to cut branches and prune the search space. This leads to important gains in response time, in particular, in the case of null answers (i.e., of an empty result set), as only a small part of the summary hierarchy must be parsed, instead of the entire database.

As an extension of this work, we proposed to formulate the query predicates with a free user vocabulary rather than with the summary descriptors. We studied the query evaluation including the mapping between user concepts and summaries, using the symbolic-numerical interface of the fuzzy set theory [59] .

Querying summaries: multidimensional indexing

Participants : Noureddine Mouaddib, Guillaume Raschia, Amenel Voglozin.

We investigated the area of multidimensional indexing from the point of view of space-partitioning. Through its architectural aspects, a summary hierarchy shares many features with multidimensional indexes (R-Tree, UB-Tree, X-Tree, ...). Current work on flexible querying uses the hierarchy as an index to select the appropriate database records, since in multidimensional indexing, each selection criterion reduces the search space for the other criteria.

Thus, we proposed to use summary hierarchies from the SaintEtiQ system as an index structure for a PostgreSQL access method. The objective of this work is to study the feasibility of using summaries as indexes, and determine the parameters that have an impact on the access method's performance. The study is limited to searching because defining a fully functional access method is a tedious task: updates and inserts are not yet supported. The index file is a binary version of the XML file produced by the SaintEtiQ prototype. The point in not modifying the tree structure is to evaluate the prototype's output as faithfully as possible. Although a summary hierarchy is intended for a different purpose and not optimized for querying, it provides acceptable response time for queries other than one-column queries. However, explaining the response time remains difficult. The immediate perspective is to use larger data sets so as to make the influence factors more distinct. Since it does not exist any benchmark data set for evaluating multidimensional indexing techniques, we are working on generating random data with a variable search space occupation ratio. Tuning that ratio will help simulate real data. Once the performance factors are known, it will be possible to adapt the construction of summaries for the purpose of using them as an index structure. very promising.

On-Line Analytical Processing of summaries

Participants : Lamiaa Naoum, Noureddine Mouaddib, Guillaume Raschia.

We proposed a general framework to explore and analyze database summaries built from massive data sets. Summaries are self-descriptive and higher-level views of groups of raw data. The overall on-line summarization processing is then intended to support a new approach to On-Line Analytical Processing of large data sets [12] . It aims at providing an effective and rich tool for visualizing, querying and accessing summaries considered as compressed semantic views of raw data.

Our contributions are as follows. First, we defined a logical data model called summary partitions , by analogy with OLAP datacubes. The aim is to provide the end-user with an effective way of presenting a reduced version of the data set as well as to support analysis. Pre-built and ordered partitions are considered on the basis of a process dedicated to the generation of summaries at different levels of granularity. Second, we defined a collection of algebraic operators over the space of summary partitions: relational, granularity and structuring operators are designed for on-line analytical processing of summarized versions of the data [50] . Third, we addressed the issue of representing the summary partitions, especially to make as simple and informative as possible the summaries to the end-user. To achieve this, we tried to build fuzzy prototypes for the summaries, as a pre-visualization mechanism [49] .

Summaries over a P2P architecture

Participants : Rabab Hayek, Noureddine Mouaddib, Guillaume Raschia, Patrick Valduriez.

We started to study the integration of a new service for managing summaries in P2P systems. In such a context, summaries have two main virtues. First, they can be directly queried and used to approximately answer a query without exploring the original data. Second, as semantic indexes, they support locating relevant nodes based on data content.

The first idea was to incrementally construct a global summary which describes all the data shared in the network. Distributed storage of such a global summary is, for instance, managed by a dedicated service and peers call that service with the right global summary key. For a given query, the global summary is first used to determine the set of nodes having relevant data. Then, those nodes are directly contacted. Simulation results have shown that the cost of query routing is significantly reduced compared to flooding approaches. However, converging to, and maintaining such a global summary is hard and costly in a P2P environment. Current work consists in retrieving a sort of natural partitioning of unstructured networks in peer domains, each managing its global summary. Our approach relies only on scale-free network properties such as the power law degree distribution and the associated clustering coefficient distribution. The intra-domain links will be used as summary links (i.e. index links) to maintain the global summary, while the inter-domain links will be used as search links to propagate the query among domains. We aim at finding the optimal number of domains that minimizes the total cost of query routing and summary maintenance.


Logo Inria