Team flowers

Overall Objectives
Scientific Foundations
Application Domains
New Results
Contracts and Grants with Industry

Section: New Results

Multimodal self-supervised acquisition of language from unsegmented audio-video streams

Participants : Louis ten Bosch, David Filliat, Pierre-Yves Oudeyer.

Through the visit of invited researcher Louis ten Bosch, from Radboud University, Holland, we have developped a computational framework which allows a machine to learn to recognize new acoustic words and new visual objects, as well as their associations, from initially unsegmented flow of audio and video and no initial knowledge of phonetics or global object shapes. This allows to reproduce some of the properties of human infant language acquisition in their first years. The techniques behind rely on the one hand on the use of non-negative matrix factorization techniques, and on the other hand on the use of audio and video encoding based on bag-of-words local representations (histograms of local spectral descriptors for audio and SIFT for videos). Based on observations of unsegmented and unlabelled associations between acoustic waves, potentially comprising several words in a sentence, and images, potentially comprising several visual objects, non-negative matrix factorizations allows to find sparse decompositions of the audio-video flow which allow the machine to reconstruct and find later an image containing the object denoted by an acoustic wave or the feature of the acoustic wave of a word given an image of it. Furthermore, we have been able to use incremental versions of non-negative matrix factorization in this setup. An article describing those techniques and experiments is being written.


Logo Inria