🤖 AI Summary
This work addresses the inherent uncertainty in 3D scene reconstruction from limited observations—such as a single view, sparse pixels, or noisy images—by proposing a probabilistic framework that integrates Neural Radiance Fields (NeRF) with score-based diffusion models. The method represents the 3D scene as a stochastic latent variable, employs NeRF to model the likelihood of observations, and leverages a diffusion model to learn the prior distribution over the latent space. Crucially, it introduces, for the first time, a score-based diffusion mechanism to sample from the posterior distribution of the latent variables, enabling a unified treatment of uncertainty across diverse observation conditions. A two-stage training strategy jointly optimizes the reconstruction and prior networks, achieving high-fidelity 3D reconstructions under various settings—including single-view, multi-view, noisy images, sparse pixels, and depth inputs—while faithfully capturing task-specific uncertainties.
📝 Abstract
The remarkable achievements of both generative models of 2D images and neural field representations for 3D scenes present a compelling opportunity to integrate the strengths of both approaches. In this work, we propose a methodology that combines a NeRF-based representation of 3D scenes with probabilistic modeling and reasoning using diffusion models. We view 3D reconstruction as a perception problem with inherent uncertainty that can thereby benefit from probabilistic inference methods. The core idea is to represent the 3D scene as a stochastic latent variable for which we can learn a prior and use it to perform posterior inference given a set of observations. We formulate posterior sampling using the score-based inference method of diffusion models in conjunction with a likelihood term computed from a reconstruction model that includes volumetric rendering. We train the model using a two-stage process: first we train the reconstruction model while auto-decoding the latent representations for a dataset of 3D scenes, and then we train the prior over the latents using a diffusion model. By using the model to generate samples from the posterior we demonstrate that various 3D reconstruction tasks can be performed, differing by the type of observation used as inputs. We showcase reconstruction from single-view, multi-view, noisy images, sparse pixels, and sparse depth data. These observations vary in the amount of information they provide for the scene and we show that our method can model the varying levels of inherent uncertainty associated with each task. Our experiments illustrate that this approach yields a comprehensive method capable of accurately predicting 3D structure from diverse types of observations.