Fast Bi-layer Neural Synthesis of
One-Shot Realistic Head Avatars

Our new architecture creates photorealistic neural avatars in one-shot mode and achieves considerable speed-up over previous approaches. Rendering takes just 42 milliseconds on Adreno 640 (Snapdragon 855) GPU, FP16 mode.
Video presentation
Abstract

We propose a neural rendering-based system that creates head avatars from a single photograph. Our approach models a person's appearance by decomposing it into two layers. The first layer is a pose-dependent coarse image that is synthesized by a small neural network. The second layer is defined by a pose-independent texture image that contains high-frequency details. The texture image is generated offline, warped and added to the coarse image to ensure a high effective resolution of synthesized head views. We compare our system to analogous state-of-the-art systems in terms of visual quality and speed. The experiments show significant inference speedup over previous neural head avatar models for a given visual quality. We also report on a real-time smartphone-based implementation of our system.

Main idea

In our approach, the main idea is to split a single heavy generator network for images, which is run for each frame during test time, into two: one is run only during initialization (i.e., once per identity), and a much lighter network, which we call an inference generator, is run once per frame. In our proposed implementation, the following networks are trained in an end-to-end fashion:

• The embedder network $E \big( \mathbf{x}^i(s), \mathbf{y}^i(s) \big)$ encodes a concatenation of a source image and a landmark image into a stack of embeddings $\{ \hat{\mathbf{e}}_k^i {\scriptstyle (} s {\scriptstyle )} \}$, which are used for initialization of the adaptive parameters inside the generators.

• The texture generator network $G_\text{tex} \big( \{ \hat{\mathbf{e}}_k^i {\scriptstyle (} s {\scriptstyle )} \} \big)$ initializes its adaptive parameters from the embeddings and decodes an inpainted high-frequency component of the source image, which we call a texture $\hat{\mathbf{X}}^i(s)$. This texture is initialized once per identity and is aimed to be pose invariant.

• The inference generator network $G \big( \mathbf{y}^i(t), \{ \hat{\mathbf{e}}_k^i {\scriptstyle (} s {\scriptstyle )} \} \big)$ maps target poses into a predicted image $\hat{\mathbf{x}^i(t)}$. The network accepts vector keypoints as an input and outputs a low-frequency layer of the output image $\hat{\mathbf{x}}_\text{LF}^i(t)$, which encodes basic facial features, skin color and lighting, and $\hat{\mathbf{\omega}}^i(t)$ -- a mapping between coordinate spaces of the texture and the output image. Then, the high-frequency layer of the output image is obtained by warping the predicted texture: $\hat{\mathbf{x}}_\text{HF}^i(t) = \hat{\mathbf{\omega}}^i(t) \circ \hat{\mathbf{X}}^i(s)$, and is added to a low-frequency component to produce the final image: \begin{equation*}\label{eq:image} \hat{\mathbf{x}}^i(t) = \hat{\mathbf{x}}_\text{LF}^i(t) + \hat{\mathbf{x}}_\text{HF}^i(t) \, . \end{equation*}

During training, we first encode a source frame into the embeddings, then we initialize adaptive parameters of both inference and texture generators, and predict a high-frequency texture. These operations are only done once per avatar. Target keypoints are then used to predict a low-frequency component of the output image and a warping field, which, applied to the texture, provides the high-frequency component. Two components are then added together to produce an output. We therefore decompose an output image into two layers: a low-frequency layer is produced by the small inference generator directly, while a high-frequency layer is produced by a warping of a static texture.

Results

The results below are all achieved with a model running in 42ms per frame on Snapdragon 855.

A visualization of each individual output produced by our model.

Self-driving results.

Citation
@InProceedings{Zakharov20,
author={Zakharov, Egor and Ivakhnenko, Aleksei and Shysheya, Aliaksandra and Lempitsky, Victor},
title={Fast Bi-layer Neural Synthesis of One-Shot Realistic Head Avatars},
booktitle = {European Conference of Computer vision (ECCV)},
month = {August},
year = {2020}}