Neural Head Reenactment with Latent Pose Descriptors
Samsung AI Center, Moscow
Skolkovo Institute of Science and Technology, Moscow
We propose a
head reenactment system
latent pose descriptors
(unlike other systems that use e.g. keypoints).
It can predict
Pose descriptors are
and can be useful for third-party tasks (e.g. emotion recognition).
Pose-identity disentanglement happens "automatically", without special losses.
(~100K videos of ~6K celebrities).
: randomly pick 9 frames from a video (8
); learn to reconstruct the pose source.
to obtain "ground truth" background masks.
: take ≥1 images of the target person, fine-tune the generator on them.
Intuitively, nothing prevents our system from encoding person-specific information into the pose embedding.
Apparently, this doesn't happen with 3 simple techniques enabled:
Pose encoder's capacity is lower than that of the identity encoder (in our case, MobileNetV2 vs ResNeXt-50).
(transformations that preserve person's identity in an image) are applied to pose source.
Foreground mask is predicted, and reconstruction losses are applied computed with background blacked out.
Disabling the above techniques harms driver invariance, but improves pose encoding capability (can be useful in self-reenactment scenarios):
No pose augm.
Heavier pose enc.
Heavier pose enc., no pose augm.