Few-Shot Adversarial Learning of Few-Shot Adversarial Learning of Few-Shot Adversarial Learning of
Realistic Neural Talking Head models Realistic Neural Talking Head models Realistic Neural Talking Head models
 
 
 
Egor Zakharov1,2 Egor Zakharov1,2 Egor Zakharov1,2
Aliaksandra Shysheya1,2 Aliaksandra Shysheya1,2 Aliaksandra Shysheya1,2
Egor Burkov1,2 Egor Burkov1,2 Egor Burkov1,2
Victor Lempitsky1,2 Victor Lempitsky1,2 Victor Lempitsky1,2
 
 
 
Samsung AI Center Moscow1 Samsung AI Center Moscow1 Samsung AI Center Moscow1
Skolkovo Institute of Science and Technology2 Skolkovo Institute of Science and Technology2 Skolkovo Institute of Science and Technology2
 
 
 
 
 
 
 
Source
Generated images
 
 
 
Figure 1: Our results in eight-shot mode on people not seen during meta-learning. The leftmost photograph shows one of the eight images, taken from the same video sequence, that were shown to the system. The remaining photographs show images generated by our model for face landmarks taken from the different video of the same person.
 
 
 
 
BibTex
@online{Zakharov19,
    title={Few-Shot Adversarial Learning of Realistic Neural Talking Head Models},
    author={Egor Zakharov and Aliaksandra Shysheya and Egor Burkov and Victor Lempitsky},
    year={2019},
    eprint={1905.08233},
    eprinttype={arXiv}
}
 
 
Abstract
 
 
Generation of photo-realistic human head images is a challenging task, which several recent works have addressed via training on datasets that include a large number of images of the same person. However, many practical scenarios require such talking head models to be learned in a few-shot, or even one-shot setting, when only a single image of a person is available. In this work, we present a system with such capability. It performs lengthy meta-learning on a large dataset of videos, and after that is able to frame few- and one-shot learning tasks for previously unseen people as adversarial training problems with high capacity generator and discriminator. Crucially, the system is able to initialize the parameters of both the generator and the discriminator in a person-specific way, so that training can be based on just a few images and done quickly, despite the need to tune tens of millions of parameters and the fact that the person has never been seen in the meta-learning stage. We show that such an approach is able to learn highly realistic and personalized talking head models.
 
 
 
 
 
 
Figure 2: Our meta-learning architecture involves an embedding network, that maps a set of head images (with estimated face landmarks) to the embedding vector space, where these vectors are averaged to produce a single embedding vector. The embedding vector is then projected by a linear layer into a subset of parameters, which are assigned to the generator network. The generator network maps the input face landmarks into output frames through the set of convolutional layers. During meta-learning, we use sets of frames from the same video. Several frames are used to compute the embedding, and a stand-alone frame is used as ground truth for training. Our objective loss includes perceptual and adversarial losses (the latter is implemented as a conditional projection discriminator).
 
 
Video
 
 
 
 
We present the few-shot learning results from one, eight and 32 frames on the VoxCeleb2 test set. We also show the effect fine-tuning has on the final results, as well as highlight key features of our model's architecture and its capabilities. We also show the results of applying our model in the wild: we create talking heads from our selfies, as well as apply it to paintings and photographs by creating puppeteering video sequences.