π€ AI Summary
Existing 360Β° scene reconstruction methods struggle with sparse, uncalibrated 2D images lacking camera poses. To address this, we propose the first end-to-end reconstruction framework requiring no pose priors. Our method introduces a depth-augmented diffusion prior to jointly guide novel view synthesis and depth estimation; employs a FiLM-based modulation mechanism to unify geometric and contextual feature representation; designs a Gaussian point cloud confidence metric to detect artifacts; and establishes a Gaussian-SLAMβstyle progressive multi-view fusion pipeline. Leveraging 3D Gaussian splatting and confidence-weighted fusion, our approach significantly outperforms prior pose-free methods on MipNeRF360 and DL3DV-10K, achieving reconstruction completeness and multi-view consistency on par with state-of-the-art pose-aware approaches.
π Abstract
In this work, we introduce a generative approach for pose-free (without camera parameters) reconstruction of 360 scenes from a sparse set of 2D images. Pose-free scene reconstruction from incomplete, pose-free observations is usually regularized with depth estimation or 3D foundational priors. While recent advances have enabled sparse-view reconstruction of large complex scenes (with high degree of foreground and background detail) with known camera poses using view-conditioned generative priors, these methods cannot be directly adapted for the pose-free setting when ground-truth poses are not available during evaluation. To address this, we propose an image-to-image generative model designed to inpaint missing details and remove artifacts in novel view renders and depth maps of a 3D scene. We introduce context and geometry conditioning using Feature-wise Linear Modulation (FiLM) modulation layers as a lightweight alternative to cross-attention and also propose a novel confidence measure for 3D Gaussian splat representations to allow for better detection of these artifacts. By progressively integrating these novel views in a Gaussian-SLAM-inspired process, we achieve a multi-view-consistent 3D representation. Evaluations on the MipNeRF360 and DL3DV-10K benchmark datasets demonstrate that our method surpasses existing pose-free techniques and performs competitively with state-of-the-art posed (precomputed camera parameters are given) reconstruction methods in complex 360 scenes.