Overall Objectives
Scientific Foundations
Application Domains
New Results
Contracts and Grants with Industry
Other Grants and Activities

Section: New Results

Human activity capture and classification

Automatic Annotation of Human Actions in Video (O. Duchenne, I. Laptev, J. Sivic, F. Bach and J. Ponce)

Our work in [29] addresses the problem of automatic temporal annotation of realistic human actions in video using minimal manual supervision. To this end we consider two associated problems: (a) weakly-supervised learning of action models from readily available annotations, and (b) temporal localization of human actions in test videos. To avoid the prohibitive cost of manual annotation for training, we use movie scripts as a means of weak supervision. Scripts, however, provide only implicit, sometimes noisy, and imprecise information about the type and location of actions in video (cf. Figure 7 (a)). We address this problem with a kernel-based discriminative clustering algorithm that locates actions in the weakly-labeled training data (cf. Figure 7 (b)). Using the obtained action samples, we train temporal action detectors and apply them to locate actions in the raw video data. Our experiments demonstrate that the proposed method for weakly-supervised learning of action models leads to significant improvement in action detection. We present detection results for three action classes in four feature length movies with challenging and realistic video data.

Figure 7. (a): Video clips with OpenDoor actions provided by automatic script-based annotation. Selected frames illustrate both the variability of action samples within a class as well as the imprecise localization of actions in video clips. (b): In feature space, positive samples are constrained to be located on temporal feature tracks corresponding to consequent temporal windows in video clips. Background (non-action) samples provide further constrains on the clustering.
(a) (b)

Actions in Context (I. Laptev, joint work with M. Marszałek and C. Schmid, INRIA-Grenoble)

We exploit the context of natural dynamic scenes for human action recognition in video. Human actions are frequently constrained by the purpose and the physical properties of scenes and demonstrate high correlation with particular scene classes. For example, eating often happens in a kitchen while running is more common outdoors (cf. Figure 8 ). The contribution of [40] is three-fold: (a) we automatically discover relevant scene classes and their correlation with human actions, (b) we show how to learn selected scene classes from video without manual supervision and (c) we develop a joint framework for action and scene recognition and demonstrate improved recognition of both in natural video. We use movie scripts as a means of automatic supervision for training. For selected action classes we identify correlated scene classes in text and then retrieve video samples of actions and scenes for training using script-to-video alignment. Our visual models for scenes and actions are formulated within the bag-of-features framework and are combined in a joint scene-action SVM-based classifier. We report experimental results and validate the method on a new large dataset with twelve action classes and ten scene classes acquired from 69 movies.

Figure 8. Video samples from our dataset with high co-occurrences of actions and scenes and automatically assigned annotations.
(a) eating, kitchen (b) eating, cafe (c) running, road (d) running, street

Evaluation of local spatio-temporal features for action recognition (I. Laptev, joint work with M. M. Ullah at INRIA-Rennes and H. Wang, A. Kläser C. Schmid at INRIA-Grenoble)

Local space-time features have recently become a popular video representation for action recognition. Several methods for feature localization and description have been proposed in the literature and promising recognition results were demonstrated for a number of action classes. The comparison of existing methods, however, is often limited given the different experimental settings used. The purpose of this work [44] is to evaluate and compare previously proposed space-time features in a common experimental setup. In particular, we consider four different feature detectors and six local feature descriptors and use a standard bag-of-features SVM approach for action recognition. We investigate the performance of these methods on a total of 25 action classes distributed over three datasets with varying difficulty. Among interesting conclusions, we demonstrate that regular sampling of space-time features consistently outperforms all tested space-time interest point detectors for human actions in realistic settings. We also demonstrate a consistent ranking for the majority of methods over different datasets and discuss their advantages and limitations.

Multi-view Synchronization of Human Actions and Dynamic Scenes (I. Laptev, joint work with E. Dexter and P. Pérez at INRIA-Rennes)

This work deals with the temporal synchronization of image sequences. Two instances of this problem are considered: (a) synchronization of human actions and (b) synchronization of dynamic scenes with view changes. To address both tasks and to reliably handle large view variations, we in  [27] use self-similarity matrices which remain stable across views. We propose time-adaptive descriptors that capture the structure of these matrices while being invariant to the impact of time warps between views. Synchronizing two sequences is then performed by aligning their temporal descriptors using the Dynamic Time Warping algorithm. We present quantitative comparison results between time-fixed and time-adaptive descriptors for image sequences with different frame rates. We also illustrate the performance of the approach on several challenging videos with large view variations, drastic independent camera motions and within-class variability of human actions.

Quantitative analysis of videos for social sciences (N. Cherniavsky, I. Laptev, J. Ponce, J. Sivic, A. Zisserman)

The display of human actions in mass media and its implications for our society is intensively studied in sociology, marketing and health care. For example, researchers have looked at the relationship between the incidence of characters who smoke in movies and adolescent smoking; the occurrence of drinking acts in movies and the consumption of alcohol; and the impact over time of the evolution of women activities depicted by TV shows. Video analysis for these purposes currently requires hours of tedious manual labeling, rendering large-scale experiments infeasible. Automating the detection and classification of human traits and actions in video will potentionally increase the quantity and diversity of experimental data. We are working with Institut National de l'Audiovisuel (INA), who has provided archive news footage for testing purposes, to automatically label people according to static and dynamic attributes, such as age, gender, race, clothing, hairstyle, and expression. It can be difficult to find enough good training data for such a specific milieu. We are exploring using transfer learning techniques to train a classifier from readily available still images from the web.

Figure 9. Sample frames from INA news footage videos with automatically detected faces and facial features overlaid. Text shows examples of considered attributes.

Learning person specific classifiers from video (J. Sivic and A. Zisserman, joint work with M. Everingham, University of Leeds)

We investigate the problem of automatically labelling faces of characters in TV or movie material with their names, using only weak supervision from automatically-aligned subtitle and script text. Our previous work (Everingham et al.) demonstrated promising results on the task, but the coverage of the method (proportion of video labelled) and generalization was limited by a restriction to frontal faces and nearest neighbour classification.

In this paper we build on that method, extending the coverage greatly by the detection and recognition of characters in profile views. In addition, we make the following contributions: (i)  seamless tracking, integration and recognition of profile and frontal detections, and (ii) a character specific multiple kernel classifier which is able to learn the features best able to discriminate between the characters.

We report results on seven episodes of the TV series “Buffy the Vampire Slayer”, demonstrating significantly increased coverage and performance with respect to previous methods on this material.


Logo Inria