Section: New Results
Learning image and object models
Non-local Sparse Models for Image Restoration (J. Mairal, F. Bach, J. Ponce, joint work with G. Sapiro, University of Minnesota)
In this work, we unify two different approaches to image restoration: On the one hand, learning a basis set (dictionary) adapted to sparse signal descriptions has proven to be very effective in image reconstruction and classification tasks. On the other hand, explicitly exploiting the self-similarities of natural images has led to the successful non-local means approach to image restoration. We pro- pose simultaneous sparse coding as a framework for combining these two approaches in a natural manner. This is achieved by jointly decomposing groups of similar signals on subsets of the learned dictionary. Experimental results in image denoising and demosaicking tasks with synthetic and real noise show that the proposed method outperforms the state of the art, making it possible to effectively restore raw images from digital cameras at a reasonable speed and memory cost.
Learning mid-level features for recognition (Y.-L. Boureau and J. Ponce, joint work with Y. LeCun, New York University)
Powerful handcrafted image descriptors developed in recent years have led to tremendous progress in recognition performance. They share common characteristics that can be described in a unified framework, as a three-stage pipeline of local feature extraction, pointwise non-linear transformation, and pooling of local features over some larger neighborhood. By contrast, the intermediate transformations that are then applied to the descriptors to form a suitable input for image classification are often cruder, i.e., variations of vector quantization. Using SIFT descriptors as input, we show in  that generalizing the three-stage pipeline to mid-level feature learning leads to state-of-the-art performance or better on several recognition benchmarks. Performance increases when switching from hard to soft vector quantization, to unsupervised sparse coding, and finally to discriminative sparse coding. Moreover, we compare average and max pooling empirically and theoretically. Finally, we show that representing jointly small neighborhoods of SIFT by a single feature improves performance. Overall, lifting restrictions on the form of intermediate features to keep the same flexibility as when learning low-level image descriptors leads to high gains in recognition performance.
Reasoning About Object Relationships (A. Efros, with T. Malisiewicz, CMU)
The use of context is critical for scene understanding in computer vision, where the recognition of an object is driven by both local appearance and the object's relationship to other elements of the scene (context). Most current approaches rely on modeling the relationships between object categories as a source of context. In this paper we seek to move beyond categories to provide a richer appearance-based model of context. We present an exemplar-based model of objects and their relationships, the Visual Memex, that encodes both local appearance and 2D spatial context between object instances. We evaluate our model on Torralba's proposed Context Challenge against a baseline category-based system. Our experiments suggest that moving beyond categories for context modeling appears to be quite beneficial, and may be the critical missing ingredient in scene understanding systems.
Non-uniform Deblurring for Shaken Images (O. Whyte, J. Sivic, A. Zisserman and J. Ponce)
We argue that blur resulting from camera shake is mostly due to the 3D rotation of the camera, causing a blur that can be signiÞcantly non-uniform across the image. How- ever, most current deblurring methods model the observed image as a convolution of a sharp image with a uniform blur kernel. We propose a new parametrized geometric model of the blurring process in terms of the rotational velocity of the camera during exposure. We apply this model in the context of two different algorithms for camera shake removal: the Þrst uses a single blurry image (blind deblurring), while the second uses both a blurry image and a sharp but noisy im- age of the same scene. We show that our approach makes it possible to model and remove a wider class of blurs than previous approaches, and demonstrate its effectiveness with experiments on real images.
Segmenting Scenes by Matching Image Composites (B. Russell, A. Efros, J. Sivic and A. Zisserman, joint work with W. T. Freeman, MIT)
In this paper, we investigate how, given an image, similar images sharing the same global description can help with unsupervised scene segmentation. In contrast to recent work in semantic alignment of scenes, we allow an input image to be explained by partial matches of similar scenes. This allows for a better explanation of the input scenes. We perform MRF-based segmentation that optimizes over matches, while respecting boundary information. The recovered segments are then used to re-query a large database of images to retrieve better matches for the target regions. We show improved performance in detecting the principal occluding and contact boundaries for the scene over previous methods on data gathered from the LabelMe database.
Learning from Ambiguously Labeled Images (T. Cour, B. Sapp, C. Jordan and B. Taskar)
In many image and video collections, we have access only to partially labeled data. For example, personal photo collections often contain several faces per image and a caption that only specifies who is in the picture, but not which name matches which face. Similarly, movie screenplays can tell us who is in the scene, but not when and where they are on the screen, see figure 6 . In  , we formulate the learning problem in this setting as partially-supervised multiclass classification where each instance is labeled ambiguously with more than one label. We show theoretically that effective learning is possible under reasonable assumptions even when all the data is weakly labeled. We apply our framework to identifying faces culled from web news sources and to naming characters in TV series and movies, achieving 6% error for character naming on 16 episodes of LOST.
Learning discriminative part-based object models from weakly annotated images (T. Cour and F. Bach)
In many image recognition tasks, one key difficulty is the cost of annotating training examples with a bounding box or a segmentation. On the other hand, weakly annotated datasets composed of images and a set of attached labels are plentiful. Learning precise part-based object models from such weak supervision is very challenging, due to the lack of correspondences in the training set. We propose a discriminative learning of object parts based on a novel boosting formulation for multiple instance learning. We show our algorithm will produce a consistent labeling under certain restrictive but plausible conditions. We demonstrate our approach on a variety of weakly annotated image datasets.