TL;DR: We present a new point-based approach for real-time photo-realistic rendering of complex scenes. Given RGB(D) images and point cloud reconstruction of a scene, our neural network generates novel views of the scene. Point-based approach achieves compelling results on scenes with thin object parts, like foliage, that are challenging for mesh-based approaches.
We present a new point-based approach for modeling the appearance of real scenes. The approach uses a raw point cloud as the geometric representation of a scene, and augments each point with a learnable neural descriptor that encodes local geometry and appearance. A deep rendering network is learned in parallel with the descriptors, so that new views of the scene can be obtained by passing the rasterizations of a point cloud from new viewpoints through this network. The input rasterizations use the learned descriptors as point pseudo-colors. We show that the proposed approach can be used for modeling complex scenes and obtaining their photorealistic views, while avoiding explicit surface estimation and meshing. In particular, compelling results are obtained for scene scanned using hand-held commodity RGB-D sensors as well as standard RGB cameras even in the presence of objects that are challenging for standard mesh-based modeling.
Main idea
Having a set of RGB(D) images we first reconstruct a point cloud of the scene using the classic Structure From Motion (SfM) and Multiview Stereo (MVS) algorithms.
We associate a learnable N-dimensional descriptor (similar to 3-dimensional RGB color descriptor) with each point in the point cloud. Using camera poses retreived from SfM we project the descriptors on image planes and feed those projections to a ConvNet, which is then learned to render the scene from the corresponding view. We learn the ConvNet jointly with descriptors to minimize the discrepancy between the predicted rendering and actual image captured by a real camera.
We pretrain our ConvNet on multiple scenes to make it universal. For a novel scene we repeat the training procedure, except we freeze (or finetune) the pretrained ConvNet and fit the point descriptors. Having both descriptors and the network trained, we can render the scene from an arbitrary standpoint.
The second version of paper has the following contributions:
replaced point cloud splatting with progressive rendering,
collected high-quality dataset of mannequin people,
made a comparison with novel baselines.
Video
Results
We compare the methods on ScanNet (two scenes following pretraining on 100 other scenes), on People (two people following pretraining on 102 scenes of 38 other people), as well as on 'Owl' and 'Plant' scenes (following the pretraining on People). Generally, both the quantitative and the qualitative comparison reveals the advantage of using learnable neural descriptors for neural rendering. Indeed, Deferred Neural Rendering (DNR) and Neural Point-Based Graphics, which use such learnable descriptors, mostly outperform other methods.
The relative performance of the two methods that use learnable neural descriptors (ours and DNR) varies across metrics and scenes. Generally, our method performs better on scenes where meshing is problematic due to e.g. thin objects like foliage. Conversely, DNR has advantage whenever a good mesh can be reconstructed.
Our rendering network is lightweight with 1.96M parameters, taking 62ms on GeForce RTX 2080 Ti to render a FullHD image.
Scene editing
We show a qualitative example of creating a composite of two separately captured scenes. To create it, we took the 'Person 2' and the 'Plant' scenes and fitted descriptors for them while keeping the rendering network, pretrained on People, frozen. We then aligned the two point clouds with learned descriptors by a manually-chosen rigid transform and created the rendering. As we illustrate below, the same procedure can be repeated for other objects.