Having a set of RGB(D) images we first reconstruct a point cloud of the scene using the classic Structure From Motion (SfM) and Multiview Stereo (MVS) algorithms.We associate a learnable N-dimensional descriptor (similar to 3-dimensional RGB color descriptor) with each point in the point cloud. Using camera poses retreived from SfM we project the descriptors on image planes and feed those projections to a ConvNet, which is then learned to render the scene from the corresponding view. We learn the ConvNet jointly with descriptors to minimize the discrepancy between the predicted rendering and actual image captured by a real camera.
We pretrain our ConvNet on multiple scenes to make it universal. For a novel scene we repeat the training procedure, except we freeze (or finetune) the pretrained ConvNet and fit the point descriptors. Having both descriptors and the network trained, we can render the scene from an arbitrary standpoint.
The second version of paper has the following contributions:
We compare the methods on ScanNet (two scenes following pretraining on 100 other scenes), on People (two people following pretraining on 102 scenes of 38 other people), as well as on 'Owl' and 'Plant' scenes (following the pretraining on People). Generally, both the quantitative and the qualitative comparison reveals the advantage of using learnable neural descriptors for neural rendering. Indeed, Deferred Neural Rendering (DNR) and Neural Point-Based Graphics, which use such learnable descriptors, mostly outperform other methods.
The relative performance of the two methods that use learnable neural descriptors (ours and DNR) varies across metrics and scenes. Generally, our method performs better on scenes where meshing is problematic due to e.g. thin objects like foliage. Conversely, DNR has advantage whenever a good mesh can be reconstructed.
Our rendering network is lightweight with 1.96M parameters, taking 62ms on GeForce RTX 2080 Ti to render a FullHD image.