Section: Scientific Foundations
Motion estimation and motion segmentation with MRF models
Assumptions (i.e., data models) must be formulated to relate the observed image intensities to motion, and other constraints (i.e., motion models) must be added to solve problems like motion segmentation, optical flow computation, or motion recognition. The motion models are supposed to capture known, expected or learned properties of the motion fieldĀ ; this implies to somehow introduce spatial coherence or more generally contextual information. The latter can be formalized in a probabilistic way with local conditional densities as in Markov models. It can also rely on predefined spatial supports (e.g., blocks or pre-segmented regions). The classic mathematical expressions associated with the visual motion information are of two types. Some are continuous variables to represent velocity vectors or parametric motion models. The others are discrete variables or symbolic labels to code motion detection (binary labels), motion segmentation (numbers of the motion regions or layers) or motion recognition output (motion class labels).
In the past years, we have addressed several important issues related to visual motion analysis, in particular with a focus on the type of motion information to be estimated and the way contextual information is expressed and exploited. Assumptions (i.e., data models) must be formulated to relate the observed image intensities to motion, and other constraints (i.e., motion models) must be added to solve problems like motion segmentation, optical flow computation, or motion recognition. The motion models are supposed to capture known, expected or learned properties of the motion fieldĀ ; this implies to somehow introduce spatial coherence or more generally contextual information. The latter can be formalized in a probabilistic way with local conditional densities as in Markov models. It can also rely on predefined spatial supports (e.g., blocks or pre-segmented regions). The classic mathematical expressions associated with the visual motion information are of two types. Some are continuous variables to represent velocity vectors or parametric motion models. The others are discrete variables or symbolic labels to code motion detection (binary labels), motion segmentation (numbers of the motion regions or layers) or motion recognition output (motion class labels). We have also recently introduced new models, called mixed-state models and mixed-state auto-models, whose variables belong to a domain formed by the union of discrete and continuous values. We briefly describe here how such models can be specified and exploited in two central motion analysis issues: motion segmentation and motion estimation.
The brightness constancy assumption along the trajectory of a moving point p(t) in the image plane, with p(t) = (x(t), y(t)) , can be expressed as dI(x(t), y(t), t)/dt = 0 , with I denoting the image intensity function. By applying the chain rule, we get the well-known motion constraint equation:
where I denotes the spatial gradient of the intensity, with
I = (Ix, Iy) , and It its partial temporal derivative. The above equation can be straightforwardly extended to the case where a parametric motion model is considered, and we can write:
where denotes the vector of motion model parameters.
One important step ahead in solving the motion segmentation problem was to formulate the motion segmentation problem as a statistical contextual labeling problem or in other words as a discrete Bayesian inference problem. Segmenting the moving objects is then equivalent to assigning the proper (symbolic) label (i.e., the region number) to each pixel in the image. The advantages are mainly two-fold. Determining the support of each region is then implicit and easy to handle: it merely results from extracting the connected components of pixels with the same label. Introducing spatial coherence can be straightforwardly (and locally) expressed by exploiting mrf models. Here, by motion segmentation, we mean the competitive partitioning of the image into motion-based homogeneous regions. Formally, we have to determine the hidden discrete motion variables (i.e., region numbers) l(i) where i denotes a site (usually, a pixel of the image grid; it could be also an elementary block). Let l = {l(i), iS} . Each label l(i) takes its value in the set
= {1, .., Nreg} where Nreg is also unknown. Moreover, the motion of each region is represented by a motion model (usually, a 2d affine motion model of parameters
which have to be conjointly estimated; we have also explored non-parametric motion modeling [47] ). Let
= {
k, k = 1, .., Nreg} . The data model of relation (2 ) is used. The a priori on the motion label field (i.e., spatial coherence) is expressed by specifying a mrf model (the simplest choice is to favor the configuration of the same two labels on the two-site cliques so as to yield compact regions with regular boundaries). Adopting the Bayesian map criterion is then equivalent to minimizing an energy function E whose expression can be written in the general following form:
where designates a two-site clique. We first considered [45] the quadratic function
1(x) = x2 for the data-driven term in (3 ). The minimization of the energy function E was carried out on l and
in an iterative alternate way, and the number of regions Nreg was determined by introducing an extraneous label and using an appropriate statistical test. We later chose a robust estimator for
1 [56] , [57] . It allowed us to avoid the alternate minimization procedure and to determine or update the number of regions through an outlier process in every region.
Specifying (simple) mrf models at a pixel level (i.e., sites are pixels and a 4- or 8-neighbor system is considered) is efficient, but remains limited to express more sophisticated properties on region geometry or to handle extended spatial interaction. Multigrid mrf models [50] is a means to address somewhat the second concern (and also to speed up the minimization process while usually supplying better results). An alternative is to first segment the image into spatial regions (based on gray level, color or texture) and to specify a mrf model on the resulting graph of adjacent regions [48] . The motion region labels are then assigned to the nodes of the graph (which are the sites considered in that case). This allowed us to exploit more elaborated and less local a priori information on the geometry of the regions and their motion. However, the spatial segmentation stage is often time consuming, and getting an effective improvement on the final motion segmentation accuracy remains questionable.
By definition, the velocity field formed by continuous vector variables is a complete representation of the motion information. Computing optical flow based on the data model of equation (1 ) requires to add a motion model enforcing the expected spatial properties of the motion field, that is, to resort to a regularization method. Such properties of spatial coherence (more specifically, piecewise continuity of the motion field) can be expressed on local spatial neighborhoods. First methods to estimate discontinuous optical flow fields were based on mrf models associated with Bayesian inference (i.e., minimization of a discretized energy function). A general formulation of the global (discretized) energy function to be minimized to estimate the velocity field can be given by:
where S designates the set of pixel sites, r(p) is defined in (1 ), S' = {p'} the set of discontinuity sites located midway between the pixel sites and is the set of cliques associated with the neighborhood system chosen on S' . We first used quadratic functions and the motion discontinuities were handled by introducing a binary line process
[49] . Then, robust estimators were popularized leading to the introduction of so-called auxiliary variables
now taking their values in [0, 1] [55] . Multigrid mrf are moreover involved, and multiresolution incremental schemes are exploited to compute optical flow in case of large displacements. Dense optical flow and parametric motion models can also be jointly considered and estimated, which enables to supply a segmented velocity field [54] . Depending on the followed approach, the third term of the energy
can be optional.