Abstract

We present two novel solutions for multi-view 3D human pose estimation based on new learnable triangulation methods that combine 3D information from multiple 2D views.

The first (baseline) solution is a basic differentiable algebraic triangulation with an addition of confidence weights estimated from the input images. The second, more complex, solution is based on volumetric aggregation of 2D feature maps from the 2D backbone followed by refinement via 3D convolutions that produce final 3D joint heatmaps.

Crucially, both of the approaches are end-to-end differentiable, which allows us to directly optimize the target metric. We demonstrate transferability of the solutions across datasets and considerably improve the multi-view state of the art on the Human3.6M dataset.

Results

Note: Here and further we report only summary of our results. Please refer to our paper [cite] for more details.

Human3.6M


MPJPE relative to pelvis:

  MPJPE (averaged across all actions), mm
Multi-View Martinez [4] 57.0
Pavlakos et al. [8] 56.9
Tome et al. [4] 52.8
Kadkhodamohammadi & Padoy [5] 49.1
RANSAC (our implementation) 27.4
Ours, algebraic 22.6
Ours, volumetric 20.8


MPJPE absolute (filtered scenes with non-valid ground-truth annotations):

  MPJPE (averaged across all actions), mm
RANSAC (our implementation) 22.8
Ours, algebraic 19.2
Ours, volumetric 17.7


MPJPE relative to pelvis (single-view methods):

  MPJPE (averaged across all actions), mm
Martinez et al. [7] 62.9
Sun et al. [6] 49.6
Ours, volumetric single view 49.9

CMU Panoptic


MPJPE relative to pelvis [4 cameras]:

  MPJPE, mm
RANSAC (our implementation) 39.5
Ours, algebraic 21.3
Ours, volumetric 13.7


CMU Panoptic results
Fig 1. Visualization of different approaches [2 cameras]. This illustration demonstrates the robustness of the volumetric triangulation approach.


CMU Panoptic camera plot
Fig 2. MPJPE* vs. number of cameras. Note, here we measure MPJPE*, where noisy annotations from CMU Panoptic are treated as ground truth.

Transfer from CMU Panoptic to Human3.6M

We demonstrate that the learnt model is able to transfer between different coloring and camera setups without any finetuning (see video demonstration).


Transfer results
Fig 3. Demonstration of successful transfer of the solution trained on CMU dataset to Human3.6M scenes. Note that keypoint skeleton models on Human3.6M and CMU are different.

Overview

Our approaches assume we have synchronized video streams from cameras with known projection matrices capturing performance of a single person in the scene. We aim at estimating the global 3D positions of a fixed set of human joints with indices .

Note: Here we present only short overview of our methods. Please refer to our paper [cite] for more details.

Algebraic

Our first approach is based on algebraic triangulation with learned confidences.

Algebraic model

  1. 2D backbone produces the joints’ heatmaps and camera-joint confidences .

  2. The 2D positions of the joints are inferred from 2D joint heatmaps by applying soft-argmax (with inverse temperature parameter ):

  3. The 2D positions together with the confidences are passed to the algebraic triangulation module which solves triangulation problem in the form of system of weighted linear equations:

    where - vector of confidences for joint , - matrix combined of 2D joint coordinates and camera parameters (see details in [1]) and - target 3D position of joint .

All blocks allow backpropagation of the gradients, so the model can be trained end-to-end.

Volumetric

Our second approach is based on volumetric triangulation.

Volumetric model

  1. The 2D backbone produces intermediate feature maps (note, that unlike the first model, feature maps don’t have to be interpretable).

  2. Then feature maps are unprojected into a volume with a per-view aggregation (see animation below):

    where - absolute coordinates of each voxel, - projection matrix of camera . Operation denotes bilinear sampling.

  3. The volume is passed to a 3D convolutional neural network that outputs the interpretable 3D heatmaps .

  4. The output 3D positions of the joints are inferred from 3D joint heatmaps by computing soft-argmax:

Unlike the algebraic method, volumetric has 3D convolutional neural network, which is able to model human pose prior. Volumetric model is also fully differentiable and can be trained end-to-end.


Here’s an animation showing how unprojection works for 2 cameras:

Algebraic model

Human3.6M erroneous annotations

There are some 3D pose annotation errors in the Human3.6M dataset. For subject S9, actions:

Interestingly, the error is nullified when the pelvis is subtracted (as done for monocular methods), however, to make the results for the multi-view setup interpretable we must exclude these scenes from the evaluation.

Here is the example of erroneous 3D pose annotations (S9, “Greeting”):

BibTeX

(soon)

References