🤖 AI Summary
This work addresses the challenge of incomplete 3D scene geometry in single RGB images caused by occlusions by proposing a generative reconstruction framework. Departing from conventional per-pixel or point-cloud querying strategies, the method employs a structured voxel representation to enable efficient surface extraction and large-scale occupancy prediction. It leverages a hybrid 3D variational autoencoder to compress sparse geometry and integrates a latent diffusion Transformer for denoising, augmented with a geometric foundation model that supplies spatial priors. Notably, this is the first application of flow matching to single-view, amodal 3D reconstruction. Evaluated on the ScanNet and NYUv2 datasets, the approach substantially outperforms existing methods, yielding more complete, accurate, and structurally coherent scene reconstructions.
📝 Abstract
Reconstructing the complete geometry of a scene from a single RGB image remains challenging - especially when inferring hidden structures where visual evidence is incomplete. We introduce VolFill, a generative framework that predicts the 3D structure of the complete scene rather than relying on traditional pixel-aligned regression. Our method utilizes a hybrid 3D VAE to compress sparse truncated unsigned distance function grids into a compact latent space, paired with a latent Diffusion Transformer that denoises this representation to recover the complete scene. We condition the generation on geometry foundation models, leveraging rich spatial priors for robust reasoning. Unlike existing methods limited by per-ray constraints or unstructured point-cloud queries, VolFill provides a structured representation that supports direct surface extraction and occupancy queries at scale. Extensive experiments on the SCRREAM and NRGB-D datasets demonstrate that our approach significantly outperforms current baselines, providing a robust foundation for holistic spatial understanding.