Fast Bi-layer Neural Synthesis of
One-Shot Realistic Head Avatars

Our new architecture creates photorealistic neural avatars in one-shot mode and achieves considerable speed-up over previous approaches. Rendering takes just 42 milliseconds on Adreno 640 (Snapdragon 855) GPU, FP16 mode.
Video presentation

We propose a neural rendering-based system that creates head avatars from a single photograph. Our approach models a person's appearance by decomposing it into two layers. The first layer is a pose-dependent coarse image that is synthesized by a small neural network. The second layer is defined by a pose-independent texture image that contains high-frequency details. The texture image is generated offline, warped and added to the coarse image to ensure a high effective resolution of synthesized head views. We compare our system to analogous state-of-the-art systems in terms of visual quality and speed. The experiments show significant inference speedup over previous neural head avatar models for a given visual quality. We also report on a real-time smartphone-based implementation of our system.

Main idea

In our approach, the main idea is to split a single heavy generator network for images, which is run for each frame during test time, into two: one is run only during initialization (i.e., once per identity), and a much lighter network, which we call an inference generator, is run once per frame. In our proposed implementation, the following networks are trained in an end-to-end fashion:

During training, we first encode a source frame into the embeddings, then we initialize adaptive parameters of both inference and texture generators, and predict a high-frequency texture. These operations are only done once per avatar. Target keypoints are then used to predict a low-frequency component of the output image and a warping field, which, applied to the texture, provides the high-frequency component. Two components are then added together to produce an output. We therefore decompose an output image into two layers: a low-frequency layer is produced by the small inference generator directly, while a high-frequency layer is produced by a warping of a static texture.


The results below are all achieved with a model running in 42ms per frame on Snapdragon 855.

A visualization of each individual output produced by our model.

Self-driving results.

  author={Zakharov, Egor and Ivakhnenko, Aleksei and Shysheya, Aliaksandra and Lempitsky, Victor},
  title={Fast Bi-layer Neural Synthesis of One-Shot Realistic Head Avatars},
  booktitle = {European Conference of Computer vision (ECCV)},
  month = {August},
  year = {2020}}