Overall Objectives
Scientific Foundations
Application Domains
New Results
Contracts and Grants with Industry
Other Grants and Activities

Section: New Results

Action recognition in video

Evaluation of local spatio-temporal features for action recognition

Participants : Alexander Kläser, Ivan Laptev [ INRIA Rocquencourt ] , Cordelia Schmid, Muhammad Ullah [ INRIA Rennes ] , Heng Wang.

Local space-time features have recently become a popular video representation for action recognition. Several methods for feature localization and description have been proposed in the literature, and promising recognition results were demonstrated for different action datasets. The comparison of those methods, however, is limited given the different experimental settings and various recognition methods used. In our current work [24] , we carried out an extensive evaluation of local spatio-temporal features. We defined a common evaluation framework based on bag-of-features video sequence classification. Experiments showed that dense sampling consistently outperforms all tested methods for feature localization in realistic video settings. Note, however, that dense sampling also produces a very large number of features. Among the different feature point detectors, we observe a similar performance. For the tested feature descriptors, the combination of gradient based and optical flow based descriptors seems to be a good choice for action recognition. The combination of dense sampling with the HOG/HOF descriptor provides best results for the most challenging dataset Hollywood2. On the UCF dataset, the HOG3D descriptor performs best in combination with dense sampling.

Human focused action localization in video

Participants : Alexander Kläser, Marcin Marszałek [ University of Oxford ] , Cordelia Schmid, Andrew Zisserman [ University of Oxford ] .

Early work on action recognition in video used sequences with static cameras, simple backgrounds and fully visible bodies. Approaches developed in this context were robust to variations in the actor and action, but not to changes of viewpoint, scale or lighting; partial occlusion and varying background. Recent work uses video material from movies, i.e., less controlled and much more challenging data.

Figure 11. Top 8 drinking detections for the movie Coffee and Cigarettes , including true positives (TP) and false positives (FP). The detection at rank 6 is incorrect due to its imprecise localization in time.
1. (TP)2. (TP)3. (TP)4. (TP)
5. (TP)6. (FP)7. (TP)8. (TP)

To localize human actions in such movies, we develop a human-centric approach. Our goal is to localize the action in time through the sequence and spatially in each frame. We first extract spatio-temporal human tracks and then detect actions within these using a sliding window classifier. Our human tracker is able to cope with a wide range of postures, articulations, motions and camera viewpoints. The tracker includes detection interpolation and a principled classification stage to suppress false positives. To localize actions within the extracted tracks, we introduce a spatio-temporal 3D histogram-of-gradient based descriptor adapted to the track. We show that tracks reduce search complexity and can be reused for multiple human actions, without performance loss.

Experimental results are presented for the actions of drinking and smoking on the Coffee and Cigarettes dataset, and for phoning and standing-up on the Hollywood2 dataset. We compare with previous methods on this material and demonstrate a significant improvement over the state of the art. Figure 11 shows the top eight drinking detections in approximately 24 minutes of the Coffee and Cigarettes movie.

Mining visual actions from movies

Participants : Adrien Gaidon, Marcin Marszałek [ University of Oxford ] , Cordelia Schmid.

In this work [14] we present an approach for mining visual actions from real-world videos. Given a large number of movies, we want to automatically extract short video sequences corresponding to visual human actions. We can then visually discover which actions are performed and also collect training data for action recognition. We first retrieve action sequences corresponding to specific verbs extracted from the transcripts aligned with the videos. Not all of the samples visually characterize the action and, therefore, we rank these videos by visual consistency. Negative samples are obtained by randomly sampling the rest of the videos. We propose a novel ranking algorithm using an iterative re-training scheme for Support Vector Regression machines (SVR) referred to as 'iter-SVR'. Experimental results explore actions in 144 episodes (more than 100 hours) of the TV series “Buffy the Vampire Slayer” and show that our iter-SVR approach outperforms other commonly used approaches. Examples of retrieved actions are shown in Figure 12 .

Figure 12. Key frames of the top 5 `walk' and `kiss' samples and of the first false positive (FP), (for `walk' at rank 30, and for `kiss' at rank 37) obtained with our iter-SVR method.

Learning human actions and their context

Participants : Ivan Laptev [ INRIA Rocquencourt ] , Marcin Marszałek [ University of Oxford ] , Cordelia Schmid.

We exploit the context of natural dynamic scenes for human action recognition in video [21] . Human actions are frequently constrained by the purpose and the physical properties of scenes and demonstrate high correlation with particular scene classes. For example, eating often happens in a kitchen while running is more common outdoors, cf. fig. 13 . Our contribution is three-fold: (a) we automatically discover relevant scene classes and their correlation with human actions, (b) we show how to learn selected scene classes from video without manual supervision and (c) we develop a joint framework for action and scene recognition and demonstrate improved recognition of both in natural video.

Our approach uses movie scripts as a means of automatic supervision for training. For selected action classes we identify correlated scene classes in these scripts and then retrieve video samples of actions and scenes for training using script-to-video alignment. Our visual models for scenes and actions are formulated within the bag-of-features framework and are combined in a joint scene-action classifier. We validate the method on a new large dataset with twelve action classes and ten scene classes acquired from 69 movies.

Experimental results demonstrate the gain in performance for action classification when using contextual scene information.

Figure 13. Video samples from our dataset with automatically assigned labels for the action (eating in these examples), and the context.
(a) eating, kitchen (b) eating, cafe (c) eating, restaurant


Logo Inria