Section: New Results
Scene and camera reconstruction
Participants : Marie-Odile Berger, Srikrishna Bhat, Evren Imre, Nicolas Noury, Gilles Simon, Frédéric Sur.
On the theme of scene and camera reconstruction, we investigate both fully automatic methods and learning-based techniques for pose and structure recovery. Interactive techniques are also considered in order to obtain well-structured description of the scene and to meet the required robustness with the help of the user.
Structure from motion via a contrario models
Structure from motion problems call for probabilistic frameworks to meet robustness requirements. Features (e.g. points of interest) are extracted from images, then they are matched under projective constraints. This determines both the structure of the scene and the position of the camera. The problem is difficult since the position of the features may not be accurately known, and matching may introduce false correspondences which can endanger the reconstruction process. We aim at developing new probabilistic methods to tackle these problems. This year we focused on two directions. On the one hand we studied a probabilistic a contrario model to incorporate point location uncertainty in a Ransac-like robust matching algorithm. On the other hand we brought to completion our previous work about point of interest matching based on epipolar constraint and photometric consistency. The proposed algorithm gives interesting results with respect to repeated patterns and strong viewpoint changes  .
Improved inverse-depth parameterization for SLAM
The monocular simultaneous localization and mapping (SLAM) problem involves the estimation of the location of a set of landmarks in an unknown environment (mapping) as well as the estimation of the camera pose via the photometric measurements of these landmarks by a camera. Since the computational complexity of the structure from motion techniques is deemed prohibitively high, the literature is dominated by extended Kalman filter (EKF) and particle filter (PF) based approaches. However, the non-Gaussianity of the depth estimate uncertainty degrades the performance of EKF-SLAM systems that use a 3-d cartesian landmark parameterization, especially in low parallax configurations. The inverse depth parameterization (IDP) proposed in  alleviates this problem through a redundant representation. In addition, this approach successfully deals with the feature initialization problem in monocular SLAM. However, it is computationally expensive, and when a set of landmarks is initialized from the same image, it fails to enforce the common origin constraint. We thus proposed in  two improvements of the classical inverse-depth parameterization. The key-idea is to factor out the common pose parameters when several landmarks are initialized from the same image. In the first extension (IDP1), only the pose is factored out whereas in the second one (IDP2), both the pose and the orientation are factored out. Experiments proved that IDP2 is superior to the classical IDP both in computational cost and in performance, whereas IDP1 delivers a similar performance at a much lower computational cost. This approach is also useful in particle filter based SLAM systems as the landmarks are estimated with a Kalman filter  .
Learning-based techniques for pose computation
Recent advances in object recognition based on local descriptors have shown the possibility of efficient image matching and retrieval from a database  and pave the way towards more robust methods for pose computation. Most approaches attempt to quantize the SIFT descriptors extracted from a set of images of the environment into clusters, called visual words, which are likely to represent a unique feature of the world. This year, we started to investigate the joint use of tracking based methods and recognition methods with a view to handle large environments in AR applications.
Because Euclidean distance between SIFT descriptors fails to provide a good dissimilarity measure, we have devised a different way of forming visual words using transitive closure relationships: two SIFT features belong to the same visual word if their Euclidean distance is less than a given threshold. An object is then represented as a set of feature vectors instead of a single feature vector. Our experiments proved that this representation allows noticeable improvements of the robustness of detection. Unfortunately, representing a word with a list of vectors is not scalable. We are thus investigating methods to find a suitable distance measure from the visual words obtained on a short video of the environment where the AR applications has to take place. Our objective is to obtain specific representations which can be efficiently matched.
Online reconstruction for AR tasks
Acquiring the 3D geometry of arbitrary scenes has been a primary objective of both the computer vision and graphics communities for many decades. Applications are numerous in various domains such as construction, GIS and 3D maps, virtual tours, visual effects and AR. Existing modeling methods usually rely on two separate stages. First, some data about the scene (photographs, videos, laser measurements, ...) are acquired on-site. Then these data are processed off-line using specific manipulations and algorithms. Unfortunately, this process can be time-consuming and tedious. Moreover, there is no guarantee after the first stage that the required model is fully extractable from the acquired data and additional acquisitions are sometimes needed to supplement the missing parts. We thus propose to bridge the gap between data acquisition and their exploitation. A purely image-based system has been developed, which allows a user to interactively capture the 3D geometry of a polyhedral scene with the aid of its physical presence  .
This system can be seen as an immersive version of the widely used 3D drawing software Google SketchUp (http://sketchup.google.com ). This software combines some of the features of pencil-and-paper sketching and some of the features of CAD systems to provide a lightweight, gesture-based interface for 3D polyhedral modeling. By indicating two orthogonal vanishing points, the user is able to align the world axes to match a photo perspective. With this done, he can create models using the photo as a direct reference; mouse strokes are converted into 3D-space using inverse ray intersections with the previously defined geometry or the ground plane by default. These principles have been taken up in our implementation, but with the crucial difference that we consider dynamic video images instead of static ones. The system alternates between two operating modes: (i) a modeling mode where the scene geometry is defined by applying pure rotations to the camera and using an eye cursor to perform point-and-click operations and (ii) a tracking mode, where 6 degrees-of-freedom camera tracking is performed based on the available geometry, enabling the user to get closer to some parts of the scene or make some new faces visible before keeping on modeling. Switches between these two modes are done automatically using Akaike's model selection. As a result, we get a user-friendly interface which is particularly suitable for mobile devices such as PDAs and mobile phones.
Participants : René Anxionnat, Marie-Odile Berger, Erwan Kerrien, Nicolas Padoy, Pierre-Frédéric Villard.
Simulation for planning the embolization of intracranial aneurisms
The endovascular treatment for an intracranial aneurism consists in filling the aneurismal cavity by placing coils. These are sorts of long platinum springs that, once deployed, wind into a compact ball. Considering the location of the lesion, close to the brain, and its small size, a few millimeters, the interventional gesture requires a good planning and cannot but be performed by a very experienced surgeon. A simulation tool of the interventional act, available in the operating room, reliable, adapted to the patient's anatomy and physiology, would help to plan the coil placement, rehearse the procedure, and improve the medical training to the technique.
Our research activity is led in collaboration with Alcove project-team at INRIA Lille-Nord Europe and the Department of Interventional Neuroradiology at University Hospital of Nancy. It started in 2007 with the SIMPLE project (INRIA colaborative research initiative (ARC)) and was pursued this year in the context of the SOFA-InterMedS initiative (INRIA large-scale initiative action (AE)).
Our task consists in providing precise in-vivo data about the patient and in particular a precise geometric model of the patient's arterial wall. Despite the very high quality of the available 3D images (3D rotational angiography), tomographic reconstruction artefacts perturb the isosurface that should correspond to the arterial wall. Taking this isosurface as an initialization, we proposed to improve it within an active surface framework where the arterial wall is deformed until its X-ray projection fits a set a registered 2D angiographic images taken on the patient.
This year, we first addressed the validation of our model both on silicon phantoms and actual patient data. Our models were used in conjunction with a first prototype developed by Alcove project-team on SOFA software platform (http://www.sofa-framework.org ) to simulate coil deployment under various conditions on real patient data. The methodology we followed, the clinical metrics we designed and the results of our investigations were presented in major conferences, both in medicine  and in medical imaging  .
However, our algorithm produces a triangulated mesh model for the arterial wall, which requires a difficult compromise to be made on the simulation side between real time processing and the physical realism of the coil behavior. Therefore, we started to investigate the implicit modeling of blood vessels in 3D, in particular radial basis functions (RBF) with Pierre Glanc's engineering internship. This work will be pursued by a PhD student, under the shared direction of Magrit and Alcove teams, whom we shall welcome at the end of this year. The major axes of research concern the design of the profile function of the RBF, model fitting to the data, as well as the compacity of the model, in order to ensure both real-time simulation and geometric accuracy of the model.
Surgical workflow analysis
The focus of this work is the development of statistical methods that permit the modeling and monitoring of surgical processes, based on signals available in the surgery room. Previous works in the domain of activity recognition have addressed different problems such as the identification of either isolated actions or well-defined interactions among objects in a scene. In this work, we address the activity recognition problem in the context of a workflow. In this case, activities follow a well-defined structure over a long period of time and can be semantically grouped in relevant phases. The major characteristics of the phase recognition problem are the temporal dependencies between phases and their highly varying durations. We have addressed the problem of recognizing phases, based on exemplary recordings. We have proposed to use Workflow-HMMs, a form of HMMs augmented with phase probability variables that model the complete workflow process  . This model takes into account the full temporal context which improves on-line recognition of the phases, especially in case of partial labeling. Targeted applications are workflow monitoring in hospitals and factories, where common action recognition approaches are difficult to apply. To avoid interfering with the normal workflow, we capture the activity of a room with a multiple-camera system. Additionally, we propose to rely on real-time low-level features (3D motion flow) to maintain a generic approach. Our method has been successfully demonstrated on sequences of medical procedures performed in a mock-up operating room. The sequences followed a complex workflow, containing various alternatives.
Modeling face and vocal tract dynamics
Participants : Michael Aron, Marie-Odile Berger, Erwan Kerrien, Ting Peng, Blaise Potard, Brigitte Wrobel-Dautcourt.
Being able to produce realistic facial animation is crucial for many speech applications in language learning technologies. In order to reach realism, it is necessary to acquire 3D models of the face and of the internal articulators (tongue, palate,...) from various image modalities.
A shape-based variational framework for curve segmentation
MRI provides us with a convenient and powerful tool for observing the internal articulators which are involved in speech production. In this study, we acquired 3D MRI data with a group of articulations from different speakers. With the help of the tongue model of a reference speaker, we aim to extract tongue contours from mid-sagittal images of a new speaker, and then to build his/her tongue model. This will enable us, in the future, to compare tongue models between speakers, and explore how to adapt the reference speaker's tongue model to the new speaker. To reach this aim, we have proposed a shape-based variational framework to curve evolution for the segmentation of tongue contours from MRI mid-sagittal images. The method starts with the construction of a PCA model on tongue contours of different articulations of a reference speaker. Tongue contours for a new speaker are constrained to belong to this shape space. An objective function is defined which integrates both global and local image information. The global term extracts roughly the object in the whole image domain; while the local term improves precision inside a small neighborhood around the contour. Promising results on several speaker's MRI data and comparisons with other approaches demonstrated the efficiency of our new model  .
Modeling the vocal tract
Our long term objective is to provide intuitive and near-automatic tools for building a dynamic 3D model of the vocal tract from various image and sensor modalities (MRI, ultrasound (US), video, magnetic sensors ...).
Combining several modalities requires that all geometrical and temporal data be consistent together. It also requires to define appropriate image processing techniques to extract the articulators (tongue, palate, lips...) from the data. A fast, low cost and easily reproducible acquisition system had been designed in order to temporally align the data in a previous work  . This year, we focussed on the problem of fusing image modalities  . As 3D measures of the face can be extracted from both MRI and stereoscopic images, MRI and video sequences were registered through an iterative closest point algorithm. All the modalities were then registered using EM sensors glued on the US probe and under the speaker's ears. As a result, dynamic articulatory data including points on the lips, the tongue and the palate are now available. These data were used very recently to perform articulatory inversion. To the best of our knowledge, this is the first work that demonstrates the potential of static and dynamic data fusion in the construction of articulatory databases.
We also addressed this year the problem of assessing the quality of the obtained fused data. This amounts to evaluate the uncertainty on each transformation used used to align the data in a common coordinate system. Monte Carlo statistical methods were used to estimate the uncertainty on this complex registration process: starting from the uncertainty on the sensors and on the image features, we are able to estimate the accuracy on each articulator through exhaustive sampling and propagation techniques  . This study enabled us to isolate the major sources of error in the registration process. Not surprisingly EM sensors are an important factor, but US resolution was also found to be critical. uncertainty on articulatory data.
Realistic face animation
Reaching realism in facial animation needs to acquire and to animate dense 3D models of the face which are often acquired with 3D scanners. However, acquiring the dynamics of the speech from 3D scans is difficult as the acquisition time generally allows only sustained sounds to be recorded. On the contrary, acquiring the speech dynamics on a sparse set of points is easy using a stereovision system recording a speaker with markers painted on his/her face. We have proposed in  an approach to animate a very realistic dense talking head which makes use of a reduced set of 3D dense meshes acquired for sustained sounds as well as the speech dynamics learned on a speaker equiped with painted markers. Our contributions are twofold: We first proposed an appropriate principal component analysis (PCA) with missing data techniques in order to compute the basic modes of the speech dynamics despite possible unobservable points in the sparse meshes obtained by the stereovision system. A method for densifying the modes was then proposed to compute dense modes for spatial animation from the sparse modes learned with the stereovision system.