Coordinate-based Texture Inpainting for Pose-Guided Image Generation
CVPR 2019 paper

Source Image
Texture (color-inpainted)
Texture (coord-inpainted)
Source Image
Texture (color-inpainted)
Texture (coord-inpainted)

We present a new deep learning approach to pose-guided resynthesis of human photographs. At the heart of the new approach is the estimation of the complete body surface texture based on a single photograph. Since the input photograph always observes only a part of the surface, we suggest a new inpainting method that completes the texture of the human body. Rather than working directly with colors of texture elements, the inpainting network estimates an appropriate source location in the input image for each element of the body surface. This correspondence field between the input image and the texture is then further warped into the target image coordinate frame based on the desired pose, effectively establishing the correspondence between the source and the target view even when the pose change is drastic. The final convolutional network then uses the established correspondence and all other available information to synthesize the output image using a fully-convolutional architecture with deformable convolutions. We show the state-of-the-art result for pose-guided image synthesis. Additionally, we demonstrate the performance of our system for garment transfer and pose-guided face resynthesis.

Main idea

Our method performs human image reconstruction via texture inpainting. We use DensePose approach to estimate UV renders, that is mappings between pixel positions in image (source or target) space and texture space. Once source image is mapped into the texture space, full texture is estimated and then warped onto the target space to be refined. Thus, the pipeline consists of two convolutional networks: inpainter f that reconstructs full texture from a partially visible one, while refiner g processes warped onto the target space texture to generate new view.

The main novelty of the approach lies in the texture estimation part. Rather than working directly with colors of texture elements, the inpainting network works with coordinates of the texture elements in the source view. Hence, it learns to employ natural symmetries of human body to restore missing texture parts (i.e. to use pixels located at the position of left eye to hallucinate right one), therefore making resulting textures sharper and more detailed in comparison to color-based inpainting.

Comparison with other methods

To evaluate our method we used Deepfashion dataset and same data splits as in other state-of-the-art works (1,2) consisting of 140,110 training and 8,670 test pairs, where each part is two images of same person in different poses. Below you can see qualitative comparison of our method to the other works.

Source Image
Garment Transfer

By slightly modifying our pipeline, we can use it as a garment transfer (virtual try-on) system. Here along with warped onto target space texture we pass masked ground truth of the target image into refiner g (revealing only face and hair). The modifyed network has quickly learned to copy revealed areas to the predicted image while seamlessly merging it with remaining body.
Choose model and cloth images in the scrollbars below to see the result of virtual try-on.