Section: New Results
Realistic Face Reconstruction and 3D Face Tracking
Our 3D face tracking approach is based upon analysis/synthesis collaboration and uses a textured polygonal 3D model as a tool to detect face position and expression in each frame of the sequence by minimizing the mean-square error between the generated synthetic image of the face and the real one. Since the performance of tracking (precision and stability of detection) relies heavily on the precision of the 3D model used, before starting to track we adapt our deformable generic 3D model to the person in the video. To do the adaptation we use one or more images of the person taken from different views. The matching is performed in two steps: the first step consists of the adaptation of the certain predefined characteristic points which is done in parallel with the per-view camera calibration; the second one consists of the adaptation of the contours in all the views. At the initialization step of tracking the specific 3D model should be positioned interactively on the first image of the sequence in order to acquire texture; after that the tracking process is fully automatic.
For each frame of the video sequence our algorithm searches for an optimal set of the system parameters
are the global head translation and rotation vectors respectively and is the vector of parameters that control facial expression, by using an iterative minimization tool. The error function is represented by the per-pixel difference between the current video frame image and the textured projection of the 3D model which has been updated with respect to the new values of the parameter vector . Per-pixel difference is computed only for those pixels that are covered by the textured model projection and is represented by the Euclidean norm on the RGB components of the image. In order to avoid getting trapped in local minimum we use the simulated annealing minimization algorithm. Due to the large total number of parameters to optimize we split them into several independent groups and perform minimization over each group separately. These groups are:
rigid tracking (translation and rotation), which is performed in the first instance;
expression tracking - mouth part (lips, jaw);
expression tracking - eyes part (eyelids, eyeballs, eye squint);
expression tracking - eyebrows part;
expression tracking - nose, tongue (optional). For groups related to the expressions the error is computed only from the sub-image that represents the part of the face in question. The global tracking scheme is shown in Figure 1 .
In order to be able to track expressions we've developed three different animation systems.
The first animation system is based upon facial muscle structure, muscles being represented as Bezier curves attached to the mesh vertices. The shape of each of these curves is controlled by four parameters: two control points and two tangent vectors. By changing these parameters one deforms the curve and thus adjusts the underlying vertices positions, this deformation being propagated over the whole mesh using radial basis functions. The animation itself is performed through a higher-level parameters - the so-called "expression units". Each unit controls deformation of one or several curves at a time and corresponds to specific facial expression subparts such as lower lip up/down, mouth corners up/down e.t.c. Figure 2 shows the example of activation of several expression units (represented by sliders) to form a smile.
The second animation system implemented within our tracking framework is based upon the facial animation parameters specified by MPEG-4 standard. In MPEG-4, standard facial animation is defined by two types of parameters : a set of Facial Definition Parameters (FDPs) that reflects the geometry of the 3D model (see Figure 3 ), and 68 Facial Animation Parameters (FAPs) that specify the animation part and are closely related to muscle action. At the same time MPEG-4 only defines the action related to each FAP (see examples in Figure 3 ) leaving to the users its interpretation in terms of actual deformation of the 3D mesh.
As a basis of our implementation of the MPEG-4 facial animation system we've adopted the one that is used for animating the virtual conversational agent "Greta" created by the team of prof. Catherine Pelachaud. This gives us two advantages: firstly it allows to refine our tracking algorithm by validating it on synthesized Greta-animated video sequences and secondly - to do the retargeting of any facial expression from a rich hierarchical semantic library. In order to comply with the Greta's MPEG-4 implementation we've constrained the action of each FAP by splitting the surface of our generic model into 68 semantic zones, each FAP affecting only a specified sub-set of them. The region of influence of each FAP within this sub-set is in its turn defined as an ellipsoid centered at the control point (FDP) corresponding to this FAP. In Greta, the sizes of ellipsoids were represented by fixed values which didn't allow any transfer of Greta's animation mechanism to the facial models of other dimensions/geometry because the visual effect of the animation of a FAP greatly depends on the dimensions of its region of influence with respect to the 3D model. Since our goal was to build an animation system that would work for any kind of faces, we've created geometrical dependencies for all the ellipsoids radii, so that the system could adapt automatically to any face geometry given as an input, thus making it applicable to any kinds of faces produced on basis of our generic face model. At the same time our system should provide the same effects on the Greta model reconstructed from our generic 3D face, as it does on real Greta.
Although MPEG-4 parameterization is potentially capable of reproducing a rich palette of realistic expressions and as a consequence allows to perform expression tracking with more precision in comparison with our muscle-based method, it has several significant drawbacks which makes it difficult to use FAPs directly as parameters for tracking. Firstly it gives too many parameters to minimize - especially in the mouth region where animation is controlled by a total of 28 FAPs! - that would lead to huge computing time and would make minimization more sensitive to local minimum traps. Secondly it doesn't provide any global constraints for the FDP's displacements: all of them cause only local deformations and are completely independent which means that a minimization tool will waste a lot of time on searching the optimum in the aggregate of irrelevant facial expressions. At last, the deformation function being generalized doesn't take into account particular properties of the mesh geometry and thus is capable of producing unrealistic expressions. On the other hand muscle-based method produces a better control of the mesh surface providing smoother deformations but is less precise and very slow because of the RBF interpolation. We've decided to combine both methods in a hierarphical way leaving the highly parameterized MPEG-4 face animation mechanism as a basis of our animation system but restraining FDPs displacements by our muscle curves. Further if necessary the refinement can be done using the FAPs directly. The scheme of the combined tracking algorithm is presented in Figure 4 .
Being very precise and stable our algorithm is however far from real-time requiring computational time of around 1,8 minutes per frame due to multiple cycles of minimization for a single image each containing 300-400 iterations. This time however can be reduced to 1 minute by transmitting a part of the computation to the GPU. Indeed if we look closely at the operations performed at each iteration (see Figure 5 ) we'll see that the most "expensive" one from the computational point of view is reading back the image with the textured projection from the GPU memory for comparison with the frame image. So the idea was to perform this operation directly on the GPU by using GPGPU reduction algorithm and to transfer only the error value. Masking of the neglible pixels is done through the alpha-component and all the GPU-computation is implemented using pixel shaders.
Finally we've implemented several interface issues adding among other things tracking result playback option. Also we've completed our generic face model by adding eyes, teeth and tongue to it.