Section: New Results
Dynamic event modeling, learning and recognition
View-independent action recognition from temporal self-similarities
Participants : Émilie Dexter, Ivan Laptev, Patrick Pérez.
In this work  , we address the problem of recognizing human actions under view changes. We explore self-similarities of action sequences over time and observe the striking stability of such measures across views. Building upon this key observation, we develop an action descriptor that captures the structure of temporal similarities and dissimilarities within an action sequence. Despite this temporal self-similarity descriptor not being strictly view-invariant, we provide intuition and experimental validation demonstrating its high stability under view changes. Self-similarity descriptors are also shown stable under performance variations within a class of actions, when individual speed fluctuations are ignored. If required, such fluctuations between two different instances of the same action class can be explicitly recovered with dynamic time warping, as will be demonstrated, to achieve cross-view action synchronization. More central to present work, temporal ordering of local self similarity descriptors can simply be ignored within a bag-of-features type of approach. Sufficient action discrimination is still retained this way to build a view-independent action recognition system. Interestingly, self-similarities computed from different image features possess similar properties and can be used in a complementary fashion. Our method is simple and requires neither structure recovery nor multi-view correspondence estimation. Instead, it relies on weak geometric properties and combines them with machine learning for efficient cross-view action recognition. The method is validated on three public datasets. It has similar or superior performance compared to related methods and it performs well even in extreme conditions such as when recognizing actions from top views while using side views only for training.
Automatic Annotation of Human Actions in Video
Participant : Ivan Laptev.
[In collaboration with O. Duchenne, J. Sivic, F. Bach and J. Ponce, Willow project-team]
Our work in  addresses the problem of automatic temporal annotation of realistic human actions in video using minimal manual supervision. To this end we consider two associated problems: (a) weakly-supervised learning of action models from readily available annotations, and (b) temporal localization of human actions in test videos. To avoid the prohibitive cost of manual annotation for training, we use movie scripts as a means of weak supervision. Scripts, however, provide only implicit, sometimes noisy, and imprecise information about the type and location of actions in video (cf. Figure 1 (a)). We address this problem with a kernel-based discriminative clustering algorithm that locates actions in the weakly-labeled training data (cf. Figure 1 (b)). Using the obtained action samples, we train temporal action detectors and apply them to locate actions in the raw video data. Our experiments demonstrate that the proposed method for weakly-supervised learning of action models leads to significant improvement in action detection. We present detection results for three action classes in four feature length movies with challenging and realistic video data.
Actions in Context
Participant : Ivan Laptev.
[In collaboration with M. Marszałek and C. Schmid, Lear project-team]
We exploit the context of natural dynamic scenes for human action recognition in video. Human actions are frequently constrained by the purpose and the physical properties of scenes and demonstrate high correlation with particular scene classes. For example, eating often happens in a kitchen while running is more common outdoors (cf. Figure 2 ). The contribution of  is three-fold: (a) we automatically discover relevant scene classes and their correlation with human actions, (b) we show how to learn selected scene classes from video without manual supervision and (c) we develop a joint framework for action and scene recognition and demonstrate improved recognition of both in natural video. We use movie scripts as a means of automatic supervision for training. For selected action classes we identify correlated scene classes in text and then retrieve video samples of actions and scenes for training using script-to-video alignment. Our visual models for scenes and actions are formulated within the bag-of-features framework and are combined in a joint scene-action SVM-based classifier. We report experimental results and validate the method on a new large dataset with twelve action classes and ten scene classes acquired from 69 movies.
Evaluation of local spatio-temporal features for action recognition
Participants : Muneeb Ullah, Ivan Laptev.
[In collaboration with H. Wang, A. Kläser and C. Schmid, Lear project-team]
Local space-time features have recently become a popular video representation for action recognition. Several methods for feature localization and description have been proposed in the literature and promising recognition results were demonstrated for a number of action classes. The comparison of existing methods, however, is often limited given the different experimental settings used. The purpose of this work  is to evaluate and compare previously proposed space-time features in a common experimental setup. In particular, we consider four different feature detectors and six local feature descriptors and use a standard bag-of-features SVM approach for action recognition. We investigate the performance of these methods on a total of 25 action classes distributed over three datasets with varying difficulty. Among interesting conclusions, we demonstrate that regular sampling of space-time features consistently outperforms all tested space-time interest point detectors for human actions in realistic settings. We also demonstrate a consistent ranking for the majority of methods over different datasets and discuss their advantages and limitations.
Video content recognition using trajectories
Participants : Alexandre Hervieu, Patrick Bouthemy, Jean-Pierre Le Cadre (Aspi project-team).
Content-based exploitation of video documents is of continuously increasing interest in numerous applications, e.g., for retrieving video sequences in huge TV archives, creating summaries of sports TV programs, or detecting specific actions or activities in video-surveillance. Considering 2D trajectories computed from image sequences is attractive since they capture elaborate space-time information on the viewed actions. Methods for tracking moving objects in an image sequence are now available to get reliable enough 2D trajectories in various situations. Our approach takes into account both the trajectory shape (geometrical information related to the type of motion) and the speed change of the moving object on its trajectory (dynamics-related information). Due to the trajectory features we have specified (local differential features combining curvature and motion magnitude), the designed method is invariant to translation, to rotation and to scaling while taking into account both shape and dynamics-related information on the trajectories. A novel hidden Markov model (HMM) framework is proposed which is able in particular to handle small sets of observations. Parameter setting is properly addressed. A similarity measure between the HMM is defined and exploited to tackle three dynamic video content understanding tasks : supervised recognition, clustering and detection of unexpected events. We have conducted experiments on several significative sets of real videos including sport videos. Then, hierarchical semi-Markov chains are introduced to process trajectories of several interacting moving objects. The temporal interactions between trajectories are taken into account and exploited to characterize relevant phases of the activities in the processed videos. Our method has been favorably evaluated on sets of trajectories extracted from squash and handball videos. Applications of such interaction-based models have also been extended to 3D gesture and action recognition and clustering, and temporal segmentation of actions. The results show that taking into account the interactions is of great interest for such applications.