Section: New Results
Supervised methods for visual object recognition and localization
Learning metrics for visual identification
Participants : Matthieu Guillaumin, Cordelia Schmid, Jakob Verbeek.
Face identification, determining whether two face images depict the same person or not, is difficult due to variations in scale, pose, lighting, background, expression, hairstyle, and glasses. We have introduced two methods for learning robust distance measures  : (a) a logistic discriminant approach which learns a Mahanalobis metric from a set of labelled image pairs, and (b) a nearest neighbour approach which computes the probability for two face images to belong to the same person. We evaluated our approaches on the Labeled Faces in the Wild data set, a large and very challenging data set of faces from Yahoo!News. The evaluation protocol for this data set defines a restricted setting, where a fixed set of positive and negative image pairs is given, as well as an unrestricted one, where faces are labelled by their identity. At time of submission, we were the first to present results for the unrestricted setting, and showed that our methods benefit from this richer training data, much more so than the current state-of-the-art method. Our results of 79.3% and 87.5% correct for the restricted and unrestricted setting respectively, significantly improved over the current state-of-the-art result of 78.5%. Confidence scores obtained for face identification were also used for applications like clustering or recognition from a single training example. We showed that our learned metrics also improve performance for these tasks. See Figure 8 for two example face clusters.
Combining efficient object localization and image classification
Participants : Hedi Harzallah, Frédéric Jurie, Cordelia Schmid.
We have developed a unified approach for object localization and classification  . Objects are localized with an efficient two stage sliding window method that combines the efficiency of a linear classifier with the robustness of a sophisticated non-linear one. The first stage generates object hypotheses based on a sliding window and a linear support vector machine (SVM) classifier. This allows to rapidly reject most negative windows. The remaining one are then re-scored with a non-linear SVM classifier with a 2 radial basis function (RBF) kernel. This allows to significantly improve the localization results, see Figure 9 for example detections.
Given object detections, we show that a contextual combination with image classification can improve detection results further. Even though the two tasks are different, they are obviously related. Knowing the class of an image can help to detect hardly visible objects. Indeed, if an object is only partially visible it will be hard to find by the detector while the classifier could still have enough information (context, object parts) to decide on the presence of the object.Experimental results show that our combined object localization and classification methods outperform the state-of-the-art on the PASCAL VOC 2007 and 2008 datasets. Our approach also gives very good results on a dataset (infrared images) provided by MBDA.
Learning shape models for object matching
Participants : Tingting Jiang, Frédéric Jurie, Cordelia Schmid.
The aim of this work  is to learn an a-priori shape model for an object class and to improve shape matching with the learned shape prior. Given images of example instances, we learn a mean shape of the object class as well as variations of non-affine and affine transformations separately based on the thin plate spline (TPS) parametrization.
Unlike previous methods, for learning, we represent shapes by vector fields instead of features which makes our learning approach general. During shape matching, we inject the shape prior knowledge and make the matching result consistent with the training examples. This is achieved by an extension of the TPS-RPM algorithm which finds a closed form solution for the TPS transformation coherent with the learned transformations.
We tested our approach by using it to learn shape prior models for all the five object classes in the ETHZ Shape Classes data set. The results show that more accurate shape models are learned than in previous work, and the learned shape models improve object classification.
Region-based image segmentation
Participants : Tingting Jiang, Frédéric Jurie, Cordelia Schmid.
The aim of this work is to combine state-of-art image segmentation and object detection methods. We first developed a region classifier based on color and texture features. For color features, we applied k -means on the RGB values of pixels to generate a code book of 200 words. For texture, we used SIFT and there are 4000 words in the code book.
During training, for each object region (manually annotated), we concatenated the histograms of color and texture. For each class, we trained a nonlinear SVM classifier with an intersection kernel. During testing, we first generated an over-segmentation using the Berkeley Segmentation Engine and then used the learned classifiers to estimate the class membership probabilities for each region. After the preliminary segmentation process, we combined the segmentation results with that of the object detector.
Our results were among the top ones in the PASCAL VOC 2009 image segmentation challenge.
A 3D geometric model for multi-view object class detection
Participants : Jörg Liebelt, Cordelia Schmid.
We developed a new approach for multi-view object class detection. Appearance and geometry are treated as separate learning tasks with different training data.
Our approach uses a part model which discriminatively learns the object appearance with spatial pyramids from a database of real images, and encodes the 3D geometry of the object class with a generative representation built from a database of synthetic models. The geometric information is related to the 2D training data based on viewpoint annotations and allows to perform an approximate 3D pose estimation for generic object classes. The pose estimation provides an efficient method to evaluate the likelihood of groups of 2D part detections with respect to a full 3D geometry model in order to disambiguate and prune 2D detections and to handle occlusions.
In contrast to other methods, neither tedious manual part annotation of training images nor explicit appearance matching between synthetic and real training data is required, which results in high geometric fidelity and in increased flexibility.
On the Stanford 3D car and bicycle databases, the current state-of-the-art benchmark for 3D object detection, our approach outperforms previously published results for viewpoint estimation. See Figure 10 for an illustration.