Section: Contracts and Grants with Industry
Interactive data-intensive workflows for scientific applications
Participants : Jean-Daniel Fekete [ correspondant ] , Ioana Manolescu [ INRIA GEMO Project Team ] , Véronique Benzaken.
Today's scientific data management applications involve huge and increasing data volumes. Data can be numeric, e.g. output of measure instruments, textual, e.g. corpora studied by social scientists which may consist of news archives over several years, structured as is the case of astronomy or physics data, or highly unstructured as is the case of medical patient files. Data, in all forms, is increasingly large in volume, as a result of computers capturing more and more of the work scientists used to do based on paper, and also as a result of better and more powerful automatic data gathering tools, e.g. space telescopes, focused crawlers, archived experimental data (mandatory in some types of government-funded research programs) and so on.
The availability of such large data volumes is a gold mine for scientists which may carry research based on this data. Today's scientists, however, more often than not rely on proprietary, ad-hoc information systems, consisting perhaps of a directory structure organized by hand by the scientist, a few specialized data processing applications, perhaps a few scripts etc.
For example, social scientists are interested in analyzing online social networks such as Wikipedia where new forms of group organization emerge. Visualizing the hypertext network that connects articles together requires accessing the hypertext data, computing some “shape” to visualize the network and using visualization tools to navigate the representation effectively. We have designed the Zoomable Adjacency Matrix Explorer (ZAME [22] ) that allows the exploration by computing a linear ordering of the articles contained in Wikipedia using a fast and complex dimension reduction algorithm (see figure 1). However, all the required steps to access the data, compute the ordering, store it for reuse, visualize it and navigate on the representation is done using ad-hoc methods, very tedious to implement and out of reach of the sociologists who are interested by the study.
Off-the-shelf databases are not well adapted for scientific data management for several reasons.
First, database systems are not very flexible: changing the schema in a relational database management system (RDBMS) is very difficult, whereas exploratory usage of data routinely requires adding it new dimensions e.g., building summary categories to help the user tame the data complexity and volume. More flexible formats, such as XML or RDF, bring their own problems, which for the time being are mostly performance ones!
Second, database systems are tuned towards specific declarative search operations, typically expressed using a query language. In contrast, exploring scientific data involves operations such as clustering and finding interesting data orders, which cannot be specified based on stored attributes, but have to be discovered by complex, possibly iterative computations.
Finally, databases support query-based interactions, but lack more friendly interfaces, allowing the user to inspect a large data set, with varying level of detail for different, dynamically specified subsets [30] .
The purpose of the project is to investigate models, algorithms, and propose an architecture of a system helping scientists to organize and make the most out of their data. The research work spans over three related, yet distinct areas, among which we expect it to build bridges: workflow modeling; database execution and optimization; and information visualization.