Section: New Results
Retrieval and modeling of objects and scenes in large image collections
Geometric Latent Dirichlet Allocation on a Matching Graph for Large-Scale Image Datasets (J. Sivic and A. Zisserman, joint work with J. Philbin, Oxford University)
Given a large-scale collection of images we would like to be able to conceptually group together images taken of the same place, of the same thing, or of the same person.
To achieve this, we introduce the Geometric Latent Dirichlet Allocation (gLDA) model for unsupervised particular object discovery in unordered image collections. This explicitly represents documents as mixtures of particular objects or facades, and builds rich latent topic models which incorporate the identity and locations of visual words specific to the topic in a geometrically consistent way. Applying standard inference techniques to this model enables images likely to contain the same object to be probabilistically grouped and ranked.
Additionally, to reduce the computational cost of applying our model to large datasets, we describe a scalable method that first computes a matching graph over all the images in a dataset. This matching graph connects images that contain the same object and rough image groups can be mined from this graph using standard clustering techniques. The gLDA model can then be applied to generate a more nuanced representation of the data. We also discuss how “hub images” (images representative of an object or landmark) can easily be extracted from our matching graph representation.
We evaluate our techniques on the publicly available Oxford buildings dataset (5K images) and show examples of objects automatically mined from this dataset. The methods are evaluated quantitatively on this dataset using a ground truth labelling for a number of Oxford landmarks. To demonstrate the scalability of the matching graph method, we show qualitative results on two larger datasets of images taken of the Statue of Liberty (37K images) and Rome (1M+ images).
Infinite Images: Creating and Exploring a Large Photorealistic Virtual Space (J. Sivic, joint work with B. Kaneva, MIT, A. Torralba, MIT, S. Avidan, Adobe Research, W.T. Freeman, MIT)
We present a system for generating “infinite” images from large collections of photos by means of transformed image retrieval. Given a query image, we first transform it to simulate how it would look if the camera moved sideways and then perform image retrieval based on the transformed image. We then blend the query and retrieved images to create a larger panorama. Repeating this process will produce an “infinite" image. The transformed image retrieval model is not limited to simple 2D left/right image translation, however, and we show how to approximate other camera motions like rotation and forward motion/zoom-in using simple 2D image transforms. We represent images in the database as a graph where each node is an image and different types of edges correspond to different types of geometric transformations simulating different camera motions. Generating infinite images is thus reduced to following paths in the image graph. Given this data structure we can also generate a panorama that connects two query images, simply by finding the shortest path between the two in the image graph. We call this option the “image taxi". Our approach does not assume photographs are of a single real 3D location, nor that they were taken at the same time. Instead, we organize the photos in themes, such as city streets or skylines and synthesize new virtual scenes by combining images from distinct but visually similar locations. There are a number of potential applications to this technology. It can be used to generate long panoramas as well as content aware transitions between reference images or video shots. Finally, the image graph allows users to interactively explore large photo collections for ideation, games, social interaction and artistic purposes.
Get out of my picture! Internet based inpainting (O. Whyte, J. Sivic and A. Zisserman)
We present a method to replace a user specified target region of a photograph by using other photographs of the same scene downloaded from the Internet via viewpoint invariant image search. Each of the retrieved images is first geometrically then photometrically registered with the query photograph. Geometric registration is achieved using multiple homographies and photometric registration is performed using a global affine transformation on image intensities. Each retrieved image proposes a possible solution for the target region. In the final step we combine these proposals into a single visually consistent result, using a Markov random field optimisation to choose seams between proposals, followed by gradient domain fusion. We demonstrate removal of objects and people in challenging photographs of Oxford landmarks containing complex image structures. Example result is shown in figure 4 .
Avoiding confusing features in place recognition (J. Sivic, in collaboration with J. Knopp, CTU Prague / KU Lueven, and T. Pajdla, CTU Prague)
We seek to recognize the place depicted in a query im- age using a database of Òstreet sideÓ images annotated with geolocation information. This is a challenging task due to changes in scale, viewpoint and lighting between the query and the images in the database. The image database may also contain objects, such as trees or road markings, which frequently occur and hence can cause signiÞcant confusion between different places. We employ the efÞcient bag-of- features representation previously used for object retrieval in large image collections. As the main contribution, we show how to avoid features leading to confusion of particu- lar places by using geotags attached to database images as a form of supervision. We develop a method for automatic detection of image-speciÞc and spatially-localized groups of confusing features, and demonstrate that suppressing them signiÞcantly improves place recognition performance while reducing the database size. As a second contribu- tion, we demonstrate that enhancing street side imagery with images downloaded from community photo-collections can lead to improved place recognition performance. Re- sults are shown on a geotagged database of over 17K im- ages of Paris downloaded from Google Street View.
Looking beyond image boundaries (J. Sivic, in collaboration with B. Kaneva, MIT, S. Avidan, Adobe, W. T. Freeman, MIT, and A. Torralba, MIT)
As we navigate through the world we constantly plan our next move based on what we would expect to see just around the corner or at the end of the hallway. Even in an unfamil- iar environment we still make predictions about what we might encounter using all of our prior experience. Here, we study a system to predict what is just beyond the image boundaries. The input to the system is a single image from a new environment and a large photo collection of images of the same class, but not of the exact same 3D location. The output is the image beyond the Þeld of view of the query image according to different camera motions. To simulate a single motion - rotate or zoom-out - we Þrst transform the input image, approximating what the camera would have seen under that motion, and then use the valid portion of the transformed image to perform the image retrieval. The key question we ask is: Can we predict what lies beyond image boundaries (the prediction problem)? As it turns out, this is related to our ability to match the transformed query image to images in the database (the matching problem). We quantify the quality of the system by comparing the re- trieval results to those of a ground truth data set that con- sists of geo-tagged images. This allows us to compare the predicted image to the actual image that was taken when the camera was actually moving. Our quantitative analysis provides an insight to what degree it is possible to predict image data beyond the image boundaries. We support our Þndings with a user study conducted using the Amazon Me- chanical Turk.