NerfDiff: Single-image View Synthesis with NeRF-guided Distillation from 3D-aware Diffusion Supplementary Material

NerfDiff: Single-image View Synthesis with
NeRF-guided Distillation from 3D-aware Diffusion

ICML 2023


1Apple   2UC San Diego   3MPI   4UPenn  


Abstract

Novel view synthesis from a single image requires inferring occluded regions of objects and scenes whilst simultaneously maintaining semantic and physical consistency with the input. Existing approaches condition neural radiance fields (NeRF) on local image features, projecting points to the input image plane, and aggregating 2D features to perform volume rendering. However, under severe occlusion, this projection fails to resolve uncertainty, resulting in blurry renderings that lack details. In this work, we propose NerfDiff, which addresses this issue by distilling the knowledge of a 3D-aware conditional diffusion model (CDM) into NeRF through synthesizing and refining a set of virtual views at test-time. We further propose a novel NeRF-guided distillation algorithm that simultaneously generates 3D consistent virtual views from the CDM samples, and finetunes the NeRF based on the improved virtual views. Our approach significantly outperforms existing NeRF-based and geometry-free approaches on challenging datasets including ShapeNet, ABO, and Clevr3D.



Overview of Our Method


Pipeline

Our method incorporates a training and fine-tuning pipeline. We first learn the single-image NeRF and 2D CDM which is conditioned on the single-image NeRF renderings (left). At test time, we use the learned network parameters to predict an initial NeRF representation for fine-tuning. The NeRF-guided denoised images from the frozen CDM then supervise the NeRF in-turn (right).

Architecture

Details of the training pipeline of the single-image NeRF for NerfDiff. Using a UNet, we first map an input image to a camera-aligned triplane-based NeRF representation. This triplane efficiently conditions volume rendering from a target view, resulting in an initial rendering. This rendering conditions the diffusion process so the CDM can consistently denoise at that target pose.

NeRF-Guided Distillation

The core algorithm of the proposed method is "NeRF-Guided Distillation", which distills the knowledge of a 3D-aware CDM into the single-image NeRF from multiple virtual views for generating high-quality images. In the meanwhile, the multi-view diffusion process is guided by the NeRF representation to preserve 3D consistency of the diffusion. The details of the algorithm is shown below:


Results on Various Datasets


Please click on the dataset names to see more video results.

ShapeNet Cars Dataset



ShapeNet Chairs Dataset



ABO Dataset



Clevr Dataset





Citation


    
    @inproceedings{gu2023nerfdiff,
      title={NerfDiff: Single-image View Synthesis with NeRF-guided Distillation from 3D-aware Diffusion}, 
      author={Jiatao Gu and Alex Trevithick and Kai-En Lin and Josh Susskind and Christian Theobalt and Lingjie Liu and Ravi Ramamoorthi},
      year={2023},
      booktitle={International Conference on Machine Learning}
    }