Participants : Laurent Amsaleg, Mathieu Ben, Sébastien Campion [ correspondent ] , Patrick Gros, Pascale Sébillot.
Until 2005, we used various computers to store our data and to carry out our experiments. In 2005, we began some work to specify and set-up dedicated equipment to experiment on very large collections of data. During 2006 and 2007, we specified, bought and installed our first complete platform. It is organized around a very large storage capacity (up to 70TB), and contains 4 acquisition devices (for Digital Terrestrial TV), 3 video servers, and 15 computing servers partially included in the local cluster architecture. In 2008, we acquire a new server with 96 GB of memory which enable to improve the speed of building index or language model. A memory upgrade was also done on servers.
A dedicated website has been developed in 2009 to provide a user support. It contains useful informations such as references of available and ready to use software on the cluster, list of corpus stored on the platform, pages for monitoring disk space consumption and cluster loading, tutorials for best practises and cookbooks for treatments of large datasets.
The platform will be completed with dedicated software to manage all the metadata associated with the data.
In 2008, we build up a corpus of multimedia data. It consists in a continuous recording (6 months) of two TV channels and three radios. It also includes web pages related to these contents captured on broadcaster's website. This corpus is to be used for different studies like the treatment of news along the time and to provide sub-corpus like TV news within the Quaero project (see below). The manual annotation of all the TV programs is under progress.
This platform is funded by a joint effort of INRIA, INSA Rennes and University of Rennes 1.