Textured Neural Avatars
CVPR 2019 (oral)

We propose a new model for neural rendering of humans. The model is trained for a single person and can produce renderings of this person from novel viewpoint (top) or in the new body pose (bottom) unseen during training.

We present a system for learning full body neural avatars, i.e. deep networks that produce full body renderings of a person for varying body pose and camera position. Our system takes the middle path between the classical graphics pipeline and the recent deep learning approaches that generate images of humans using image-to-image translation. In particular, our system estimates an explicit two-dimensional texture map of the model surface. At the same time, it abstains from explicit shape modeling in 3D. Instead, at test time, the system uses a fully-convolutional network to directly map a set of body feature points w.r.t. the camera to the 2D texture coordinates of individual pixels in the image frame. We show that such system is capable of generating realistic renderings while being trained on videos annotated with 3D poses and foreground masks. We also demonstrate that maintaining an explicit texture representation helps our system to achieve better generalization compared to systems that use direct image-to-image translation.

Main idea

We present a neural avatar system that does full body rendering and combines ideas from the classical computer graphics, namely the decoupling of geometry and texture, with the use of deep convolutional neural networks. In particular, similarly to the classic pipeline, our system explicitly estimates the 2D textures of body parts. Keeping this component within the neural pipeline boosts generalization across camera positions and body movement. The role of the convolutional network in our approach is only confined to predicting warping from texture to output image.

In general, we are interested in synthesizing RGB images of a certain person given his/her pose. The input pose is defined as a stack of "bone" rasterizations (one bone per channel). The input is processed by the fully-convolutional network to produce body part assignment map stack and body part coordinate map stack. These stacks are then used to sample the body texture maps at the locations prescribed by the part coordinate stack with the weights prescribed by the part assignment stack to produce the RGB image. In addition, the last body assignment stack map corresponds to the background probability. During learning, the mask and the RGB image are compared with ground truth and the resulting losses are backpropagated through the sampling operation into the fully-convolutional network and onto the texture, resulting in their updates. The detailed scheme of our method is illustrated below:

Comparison with other methods

To estimate the efficacy of the proposed system we considered two baselines, against which our model is compared. As a first one, we use the video-to-video system (V2V). As a second one, we consider a more direct ablation (Direct), which has the same network architecture that predicts RGB color and mask directly, rather than via body part assignments/coordinates. The Direct system is trained using the same losses and according to the same protocol as ours.

Click on the button above to select neural avatar.

Results on external monocular sequences. In this scenario we transfer a trained model to a new person by fitting it to a single video. Rows 1-2: Comparison of avatars produced by our algorithm trained on sequences from Alldieck et. al. (left) and produced by Alldieck et. al. (right). Row 3: textured neural avatar from Youtube video.