PDF e-Pub

## Section: New Results

### Structured Output Prediction and Learning for Deep Monocular 3D Human Pose Estimation

Participants: Stefan Kinauer, Riza Alp Guler, Siddhartha Chandra, Iasonas Kokkinos

In this work we address the problem of estimating 3D human pose froma single RGB image by blending a feed-forward Convolutional Neural Network (CNN) with a graphical model thatcouples the 3D positions of parts. The CNN populates a volumetric output space that represents the possible positions of 3D human joints, and also regresses the estimated displacements between pairs of parts. These constitute the unary' and pairwise' terms of the energy of a graphical model that resides in a 3D label spaceand delivers an optimal 3D pose configuration at its output. The CNN is trained onthe 3D human pose dataset 3.6M, the graphical model is trained jointly with the CNN in an end-to-end manner, allowing us to exploit both the discriminative powerof CNNs and the top-down information pertaining to human pose. We introduce(a) memory efficient methods for getting accurate voxel estimates for parts byblending quantization with regression (b) employ efficient structured prediction algorithms for 3D pose estimation using branch-and-bound and (c) develop a framework for qualitative and quantitative comparison of competing graphical models. We evaluate our work on the Human 3.6M dataset, demonstrating that exploiting the structure of the human pose in 3D yields systematic gains.