Section: New Results
3D representation and compression of video sequences
3DTV and Free Viewpoint Video (FVV) are now emerging video formats expected to offer an anhanced user experience. The 3D experience consists either in 3D relief rendering called 3DTV, or interactive navigation inside the content called FTV (Free viewpoint TV). 3D information can be easily computed for synthetic movies. However, so far, no professional nor consumer video cameras capable of capturing the 3D structure of the scene are available (except of course for Z-cameras prototypes). As a consequence, 3D information representing real content have to be estimated from acquired videos, using computer vision-based algorithms. This is the scope of the first research axis described below which focuses on depth maps extraction. Once the depth information has been extracted, the resulting videos with the associated depth maps must be coded to be stored or transmitted to the rendering device. 3D representations of the scene as well as associated compression techniques must be developed. The choice of the representation of the 3D is of central importance. On one hand, it sets the requirements for acquisition and signal processing. On the other hand, it determines the rendering algorithms, degree and mode of interactivity, as well as the need and means for compression and transmission. This is the scope of the next two research axis below.
Depth map extraction for multi-view sequences
In this study - supported by the Futurim@ges project - we focus both on extending our 3D model-based video codec to multi-view image sequences, and on post-processing this estimated 3D information to provide attractive videos purified from usual 3D artifacts (badly modelled occlusions, texture stretching, etc.).
For general multi-view image sequences, the scene structure can be estimated not only over time if the camera is moving, but also over space. At time t , the N acquired views can be used to compute the 3D structure of the scene more easily since the camera bank calibration is supposed to be known. We are now designing algorithms to integrate this space-based reconstruction to the classical time-based reconstruction, in order to produce higher quality 3D models.
Moreover, in the highly constrained context of registered fronto-parallel cameras, we proposed a depth map extraction algorithm based on the state-of-the-art studies in Stereo Matching (depth map estimation from a couple of stereoscopic images). Here the camera poses do not need to be estimated. The depth of each pixel only depends on the horizontal measured motion field between neighbouring images (in the space sense). Our algorithms seeks to find this motion field by local matching using gradient and intensities differences between images. This initial flow is then globally refined using a loopy Belief Propagation optimization and robust plane fitting, which are constrained by a Mean-Shift segmentation of the input images to preserve the depth discontinuities.
Once the 3D structure of the scene has been estimated (and stored as a pool of depth maps), it is post-precessed using our Java-based software M3dEncoder2 . One goal in this post-precessing steps is to generate auto-stereoscopic-ready videos, so that they can be displayed on available auto-stereoscopic screens (at present time we aim the Newsight™ and the Philips™ 3D screens).
3d representation for multi-view video sequences based on a soup of polygons
This study is carried out in collaboration with France Telecom under a CIFRE contract. The aim is to study a new data representation for multi-view sequences. The data representation must allow real-time rendering of good quality images for different viewpoints, possibly virtual (ie non acquired) viewpoints. Moreover, the representation should be compact and efficiently compressed so as to limit the data overload compared with encoding each of the N video sequences using traditional 2D video codec. In 2008, a new representation that takes as an input multi-view video plus depth (MVD), and results in a polygon soup was proposed. A polygon soup is a set of polygons that are not necessarily connected to each other. Each polygon is defined with image plus depth data and rendered as a 3D primitive thanks to a graphics processing unit (GPU). The polygons are actually quadrilaterals (aka quads) and they are extracted with a quadTree decomposition of the depth maps. The advantages of using polygonal primitives instead of point primitives traditionally used in other representations were shown. Also many redundancies across the viewpoints were reduced so as to obtain a compact representation.
In 2009, first, the construction of the representation has been improved in order to reduce the number of quads and eliminate corona artifacts around depth discontinuities  , . The main idea of this improvement is to discard as much small quads as possible from all viewpoints since these quads considerably increase the data load and potentially contain corona artifacts. Second, the rendering step was improved. Indeed, when several quads overlap, one should define a strategy to determine which colour should be drawn. It was decided to adaptively merge the overlapping quads according to the distance between the desired viewpoint and the one associated to the quad being merged. This kind of merging is usually referred as view-dependent texture mapping. Finally, a coding scheme was proposed for encoding the extracted polygon soup representation. This coding scheme takes advantage of the quadTree structure; performs a predictive coding of the depth information; and encodes all information using Context Adaptive Binary Arithmetic Coding (CABAC). The performance of the coding scheme was compared with the encoding of the depth maps using JPEG2000.
The results show that three viewpoints can be represented with an increase of only 10
3d representation for multi-view video sequences based LDI representations
This study is carried out in collaboration with INSA/IETR (Luce Morin). A multi-view video is a collection of video sequences captured for the same scene, synchronously by many cameras at different locations. Associated with a view synthesis method, a multi-view video allows the generation of virtual views of the scene from any viewpoint. This property can be used in a large diversity of applications, including Three-Dimensional TV (3DTV), Free Viewpoint Video (FTV), security monitoring, tracking and 3D reconstruction. The huge amount of data contained in a multi-view sequence motivates the design of efficient compression algorithms.
The compression algorithm strongly depends on the data representation, which in turn very much depends on the view synthesis methods. View synthesis approaches can be classified in two classes: Geometry-based rendering (GBR) approaches and Image-based rendering (IBR) approaches. GBR methods use a detailed 3D model of the scene. These methods are useful with synthetic video data but they become inadequate with real multi-view videos, where 3D models are difficult to estimate. IBR approaches are an attractive alternative to GBR. They allow the generation of photo-realistic virtual views. The Layer Depth Image (LDI) representation is one of these IBR approaches. In this representation, pixels are no more composed by a single color and a single depth value, but can contain several colors and associated depth values. This representation reduces efficiently the multi-view video size, and offers a fast photo-realistic rendering, even with complex scene geometry. Various approaches to LDI compression have been proposed based on classical LDI's layers constructions. The problem is that layers generated are still correlated, and some pixels are redundant between layers. In 2009, we have developed an Incremental LDI construction (I-LDI) method to reduce the inter-layer correlation  . The number of layers is significantly reduced for an equivalent final rendering quality.
Techniques have also been designed to overcome visual artifacts, like sampling holes resulting from disocclusion, cracking effects and ghosting artifacts. The LDI representation itself is one solution against disocclusion. Information about occluded texture is stored in extra layers, and used during the rendering stage to fill disocclusion holes. Many cracking effects can be observed in the rendered texture, due to sampling and pixelizing. The implemented solution makes use of interpolation (inpainting) techniques. Detected cracks are filled with the median color estimated on a sliding window. Ghosting effects may appear around depth discontinuity, because pixels along object boundaries receive information from both foreground and background colors. An Edge detector performed on the depth map, followed by a local foreground / background classification, permits to isolate potentially blended pixels, and reduce the ghosting effects. Tests on the Breakdancers and Ballet MVD data sets show that extra layers in I-LDI contain only 10% of first layer pixels, compared to 50% for LDI . I-LDI Layers are also more compact, with a less spread pixel distribution, and thus easier to compress than LDI (see figure above). Visual rendering is of similar quality with I-LDI and LDI .