🤖 AI Summary
Existing methods leveraging pre-trained diffusion models for 3D scene reconstruction suffer from two key limitations: (1) insufficient geometric supervision, leading to poor reconstruction quality in both observed and unobserved regions; and (2) multi-view generation inconsistency, causing shape-appearance ambiguity. This paper proposes a geometry-guided diffusion reconstruction framework. First, it estimates metric depth using a planar structural prior to establish a reliable geometric foundation for generation. Then, geometric constraints are embedded throughout the diffusion process to jointly optimize visibility masks, novel-view sampling, and multi-view consistency. The method integrates pre-trained diffusion models, Gaussian splatting rendering, video diffusion inpainting, and multi-view consistency regularization. Evaluated on Replica, ScanNet++, and DeepBlending, it significantly outperforms state-of-the-art approaches—particularly improving geometric completeness in unobserved regions and appearance fidelity. It supports both single-view and pose-free video inputs, and generalizes robustly across indoor and outdoor scenes.
📝 Abstract
Despite recent advances in leveraging generative prior from pre-trained diffusion models for 3D scene reconstruction, existing methods still face two critical limitations. First, due to the lack of reliable geometric supervision, they struggle to produce high-quality reconstructions even in observed regions, let alone in unobserved areas. Second, they lack effective mechanisms to mitigate multi-view inconsistencies in the generated images, leading to severe shape-appearance ambiguities and degraded scene geometry. In this paper, we identify accurate geometry as the fundamental prerequisite for effectively exploiting generative models to enhance 3D scene reconstruction. We first propose to leverage the prevalence of planar structures to derive accurate metric-scale depth maps, providing reliable supervision in both observed and unobserved regions. Furthermore, we incorporate this geometry guidance throughout the generative pipeline to improve visibility mask estimation, guide novel view selection, and enhance multi-view consistency when inpainting with video diffusion models, resulting in accurate and consistent scene completion. Extensive experiments on Replica, ScanNet++, and DeepBlending show that our method consistently outperforms existing baselines in both geometry and appearance reconstruction, particularly for unobserved regions. Moreover, our method naturally supports single-view inputs and unposed videos, with strong generalizability in both indoor and outdoor scenarios with practical real-world applicability. The project page is available at https://dali-jack.github.io/g4splat-web/.