Section: Scientific Foundations
Robust view-invariant Computer Vision
Summary
A long-term grand challenge in computer vision has been to develop a descriptor for image information that can be reliably used for a wide variety of computer vision tasks. Such a descriptor must capture the information in an image in a manner that is robust to changes the relative position of the camera as well as the position, pattern and spectrum of illumination.
Members of PRIMA have a long history of innovation in this area, with important results in the area of multi-resolution pyramids, scale invariant image description, appearance based object recognition and receptive field histograms published over the last 20 years. The group has most recently developed a new approach that extends scale invariant feature points for the description of elongated objects using scale invariant ridges. PRIMA is currently working with ST Microelectronics to embed its multi-resolution receptive field algorithms into low-cost mobile imaging devices for video communications and mobile computing applications.
Detailed Description
The visual appearance of a neighbourhood can be described by a local Taylor series [42] . The coefficients of this series constitute a feature vector that compactly represents the neighbourhood appearance for indexing and matching. The set of possible local image neighbourhoods that project to the same feature vector are referred to as the "Local Jet". A key problem in computing the local jet is determining the scale at which to evaluate the image derivatives.
Lindeberg [43] has described scale invariant features based on profiles of Gaussian derivatives across scales. In particular, the profile of the Laplacian, evaluated over a range of scales at an image point, provides a local description that is "equi-variant” to changes in scale. Equi-variance means that the feature vector translates exactly with scale and can thus be used to track, index, match and recognize structures in the presence of changes in scale.
A receptive field is a local function defined over a region of an image [53] . We employ a set of receptive fields based on derivatives of the Gaussian functions as a basis for describing the local appearance. These functions resemble the receptive fields observed in the visual cortex of mammals. These receptive fields are applied to color images in which we have separated the chrominance and luminance components. Such functions are easily normalized to an intrinsic scale using the maximum of the Laplacian [43] , and normalized in orientation using direction of the first derivatives [53] .
The local maxima in x and y and scale of the product of a Laplacian operator with the image at a fixed position provides a "Natural interest point" [44] . Such natural interest points are salient points that may be robustly detected and used for matching. A problem with this approach is that the computational cost of determining intrinsic scale at each image position can potentially make real-time implementation unfeasible.
A vector of scale and orientation normalized Gaussian derivatives provides a characteristic vector for matching and indexing. The oriented Gaussian derivatives can easily be synthesized using the "steerability property" [33] of Gaussian derivatives. The problem is to determine the appropriate orientation. In earlier work by PRIMA members Colin de Verdiere [29] , Schiele [53] and Hall [37] , proposed normalising the local jet independently at each pixel to the direction of the first derivatives calculated at the intrinsic scale. This has provided promising results for many view invariant image recognition tasks as described in the next section.
Color is a powerful discriminator for object recognition. Color images are commonly acquired in the Cartesian color space, RGB. The RGB color space has certain advantages for image acquisition, but is not the most appropriate space for recognizing objects or describing their shape. An alternative is to compute a Cartesian representation for chrominance, using differences of R, G and B. Such differences yield color opponent receptive fields resembling those found in biological visual systems.
Our work in this area uses a family of steerable color opponent filters developed by Daniela Hall [37] . These filters transform an (R,G,B), into a cartesian representation for luminance and chrominance (L,C1,C2). Chromatic Gaussian receptive fields are computed by applying the Gaussian derivatives independently to each of the three components, (L, C1, C2). The components C1 and C2 encodes the chromatic information in a Cartesian representation, while L is the luminance direction. Chromatic Gaussian receptive fields are computed by applying the Gaussian derivatives independently to each of the three components, (L, C1, C2). Permutations of RGB lead to different opponent color spaces. The choice of the most appropriate space depends on the chromatic composition of the scene. An example of a second order steerable chromatic basis is the set of color opponent filters shown in figure 2 .
Key results in this area include
-
Fast, video rate, calculation of scale and orientation for image description with normalized chromatic receptive fields [32] .
-
Real time indexing and recognition using a novel indexing tree to represent multi-dimensional receptive field histograms [47] .
-
Affine invariant detection and tracking using natural interest lines [55] .
-
Direct computation of time to collision over the entire visual field using rate of change of intrinsic scale [46] .
We have achieved video rate calculation of intrinsic (characteristic) scale from interpolation within a Binomial Pyramid computed using an O(N) algorithm [32] . This software provides a practical method for obtaining invariant image features for detection, tracking and recognition at video rates. This method has been used in the real time BrandDetect system, for detecting publicity panels in broadcast video of sports events, as described below.
Daniela Hall and Nicolas Gourier have developed machine learning techniques to statistically learn robust visual features for face tracking [36] , [35] .
Augustin Lux and Hai Tranh have developed a method with provides a direct measurement of affine invariant local features based on extending natural interest points to "natural interest ridges" [57] , [56] . The orientation of natural interest ridges provides a local orientation in the region of an image structure. Early results indicate an important gain in discrimination rates compared to SIFT and and other histogram based detection approaches. An example of the dominant interest ridges used for tracking of people in the entrance hall of INRIA Rhone Alpes is shown in 3 .
Amaury Negre has adapted the scale invariant ridge detection for use in detection and tracking of obstacles for autonomous vehicle navigation. In this work, the characteristic size of objects, provided by the scale of dominant ridges is directly used to calculate time to contact. Using this method, he has demonstrated direct computation of time to contact using rate of change of intrinsic scale [46] . This approach is currently being adapted for use in visual navigation in joint work with project EMOTION. Amaury Negre has defended his doctoral dissertation on these methods in March 2009, and has joined the PRIMA team as a CNRS Research Engineer.
Doctoral student Jean-Pascal Mercier has recently begun work extending the ridge description methods to 3D and 4D spatio-temporal volumes in order to detect structures for recognizing human actions.