Overall Objectives
Scientific Foundations
Application Domains
New Results
Contracts and Grants with Industry
Other Grants and Activities

Section: New Results

Semi-supervised learning and structuring of visual models

Automatic image annotation

Participants : Matthieu Guillaumin, Thomas Mensink, Cordelia Schmid, Jakob Verbeek.

Image auto-annotation is an important open problem in computer vision. For this task we developed TagProp, a discriminatively trained nearest neighbor model [15] , [25] . In TagProp, tags of test images are predicted using a weighted nearest-neighbor model to exploit labeled training images. Neighbor weights are based on neighbor rank or distance. TagProp allows the integration of metric learning by directly maximizing the log-likelihood of the tag predictions in the training set. In this manner, we can optimally combine a collection of image similarity metrics that cover different aspects of image content, such as local shape descriptors, or global color histograms. We also introduced a word specific sigmoidal modulation of the weighted neighbor tag predictions to boost recall of rare words.

We investigated the performance of different variants of our model and compared to existing work. We presented experimental results for three challenging data sets. On all three, TagProp has proven to significantly improvement over the current state-of-the-art. In a follow-up paper [23] we make an extensive comparison to results obtained with support vector machine classifiers, and consider image categorization problems where images are accompanied by user-generated keywords. An on-line demonstration of our system is available at . Figure 3 shows several example images with the automatically generated annotations.

Figure 3. Two example images from the three data sets that we used in our experiments. For each image we show the manual annotation (left) and the automatically predicted one (right), where we give the confidence value for each predicted word, and underline it when it appears in the manual annotation.

Semi-supervised image categorization

Participants : Matthieu Guillaumin, Cordelia Schmid, Jakob Verbeek.

In on-going work, we study the problem of image categorization using semi-supervised techniques. The goal is to decide if an image belongs to a certain category or not. In the standard supervised setting, a binary classifier is learned from manually labeled images. Using more labeled examples typically improves performance, but obtaining the image labels is a time consuming process. We are interested in how other sources of information can aid the learning process given a fixed amount of labeled images. In particular, we consider a scenario where keywords are associated with the training images, e.g. as found on photo sharing websites. The goal is to learn a classifier for images alone, but we will use the keywords associated with labeled and unlabeled images to improve the classifier using semi-supervised learning. We learn a first strong classifier using both the image content and keywords, and use it to predict the labels of unlabeled images. We then learn a second classifier that only takes visual features as input, and train it from the labeled images and the output of the first classifier for the unlabeled ones. In our experiments we consider 58 categories from the PASCAL VOC'07 and MIR Flickr sets. For most categories we find our semi-supervised approach to perform better than an approach that uses only labeled images. We consider a scenario where we do not use any manual labeling, but directly learn classifiers from the image tags. Also in this case using the semi-supervised approach improves classification performance. Figure 4 shows example images with their keywords and category labels as used in our experiments.

Figure 4. Example images used in our experiments, together with the user tags and category labels.

Ranking user-annotated images for multiple query terms

Participants : Moray Allan, Jakob Verbeek.

In [11] we considered how image search on photo-sharing sites like Flickr can be improved by taking into account the users who provided different images. Further we showed that, when searching for multiple terms, performance can be increased by learning a new combined model and taking account of images which partially match the query. Search queries are answered by using a mixture of kernel density estimators to rank the visual content of images from the Flickr website whose noisy tag annotations match the given query terms. Experiments show that requiring agreement between images from different users allows a better model of the visual class to be learnt, and that precision can be increased by rejecting images from `untrustworthy' users. We focus on search queries for multiple terms, and demonstrate enhanced performance by learning a single model for the overall query, treating images which only satisfy a subset of the search terms as negative training examples. Figure 5 shows some Flickr images returned by a textual search for `boat', and the highest-ranking results for these queries according to our model.

Figure 5. Images matching the query `boat' (cross marks irrelevant ones), and those top ranked by our model.

Improving web image search results using query-relative classifiers

Participants : Moray Allan, Frédéric Jurie, Josip Krapac, Jakob Verbeek.

In this on-going work we propose an image re-ranking method, based on textual and visual features, that does not require learning a separate model for every new search query. Previous image re-ranking methods which take into account visual features require separate training for every new query, and are therefore unsuitable for real-world web search applications. Our approach instead learns a single generic classifier, based on `query-relative' features. The features combine textual information about the occurrence of the query terms and other words found to be related to the query, and visual information derived from a visual histogram image representation. We can train the model once, using whatever annotated data is available, then use it to make predictions for previously unseen test classes.

The second contribution of this work is that we present a new public data set of images returned by a web search engine for 353 search queries, along with their associated meta-data, and ground-truth annotations for all images. We hope that this new data set will facilitate further research in improving image search.

As an example, Figure 6 shows the top-ranked images given by a web search engine for the query `Eiffel tower' (top), and the top-ranked images after re-ranking by our proposed classification method (bottom).

Figure 6. Example images for the query `Eiffel tower', the top-ranked images given by a web search engine for the query (top), and the top-ranked images after re-ranking by our proposed classification method (bottom).

Automatic learning of interactions between humans and objects

Participants : Alessandro Prest, Cordelia Schmid, Vittorio Ferrari [ ETH Zürich ] .

In this on-going work we introduce a novel human-centric approach for learning human actions modeled as interactions with objects. Interactions are often the main characteristic of an action, see Figure 7 . The action 'playing trumpet' can be described as a human holding a trumpet in a certain position. Characteristic features are the object trumpet and its spatial relation to the human.

Our approach first detects humans with a part-based human detector able to cope with various degrees of visibility: we map different part detections to a common reference frame and assign them a single score. We then use a learning algorithm to determine the action object and the spatial relation of that object to the human. Starting from a set of images depicting an action, our method produces a probabilistic model of the human-object interaction, i.e., it determines automatically the relevant object and the spatial relation between the human and the object. Images for the actions 'playing a trumpet', 'riding a bike', and 'wearing a hat' were obtained via Google Images search and text search on the IAPR TC-12 dataset. Results show that humans, action objects, as well as spatial human-object interactions can be determined automatically, see Figure 7 . Note that our approach can handle images not containing the action, as may occur in the case of keyword based text search.

Figure 7. Left: Example results showing the automatically detected human (green) and object (pink) for the actions 'playing trumpet', 'riding bike' and 'wearing hat'. Right: Human-object spatial relations learned by our method. The reference frame is based on the human detection. The trumpet (left plot) is in front of the human, the bike (center plot) below the person, and the hat (right plot) on top of the person.


Logo Inria