NerfDiff: Single-image View Synthesis with NeRF-guided Distillation from 3D-aware Diffusion Supplementary Material

NerfDiff: Single-image View Synthesis with
NeRF-guided Distillation from 3D-aware Diffusion

Arxiv 2023

1Apple   2UC San Diego   3MPI   4UPenn  


Novel view synthesis from a single image requires inferring occluded regions of objects and scenes whilst simultaneously maintaining semantic and physical consistency with the input. Existing approaches condition neural radiance fields (NeRF) on local image features, projecting points to the input image plane, and aggregating 2D features to perform volume rendering. However, under severe occlusion, this projection fails to resolve uncertainty, resulting in blurry renderings that lack details. In this work, we propose NerfDiff, which addresses this issue by distilling the knowledge of a 3D-aware conditional diffusion model (CDM) into NeRF through synthesizing and refining a set of virtual views at test-time. We further propose a novel NeRF-guided distillation algorithm that simultaneously generates 3D consistent virtual views from the CDM samples, and finetunes the NeRF based on the improved virtual views. Our approach significantly outperforms existing NeRF-based and geometry-free approaches on challenging datasets including ShapeNet, ABO, and Clevr3D.

Overview of Our Method


Our method incorporates a training and fine-tuning pipeline. We first learn the single-image NeRF and 2D CDM which is conditioned on the single-image NeRF renderings (left). At test time, we use the learned network parameters to predict an initial NeRF representation for fine-tuning. The NeRF-guided denoised images from the frozen CDM then supervise the NeRF in-turn (right).


Details of the training pipeline of the single-image NeRF for NerfDiff. Using a UNet, we first map an input image to a camera-aligned triplane-based NeRF representation. This triplane efficiently conditions volume rendering from a target view, resulting in an initial rendering. This rendering conditions the diffusion process so the CDM can consistently denoise at that target pose.

NeRF-Guided Distillation

The core algorithm of the proposed method is "NeRF-Guided Distillation", which distills the knowledge of a 3D-aware CDM into the single-image NeRF from multiple virtual views for generating high-quality images. In the meanwhile, the multi-view diffusion process is guided by the NeRF representation to preserve 3D consistency of the diffusion. The details of the algorithm is shown below:

Results on Various Datasets

Please click on the dataset names to see more video results.

ShapeNet Cars Dataset

ShapeNet Chairs Dataset

ABO Dataset

Clevr Dataset


      title={NerfDiff: Single-image View Synthesis with NeRF-guided Distillation from 3D-aware Diffusion}, 
      author={Jiatao Gu and Alex Trevithick and Kai-En Lin and Josh Susskind and Christian Theobalt and Lingjie Liu and Ravi Ramamoorthi},