🤖 AI Summary
Reconstructing unbounded outdoor scenes from sparse, outward-facing views remains challenging due to limited inter-view overlap, structural incompleteness, and ambiguous occluded regions.
Method: We propose a hierarchical implicit feature extrapolation paradigm that jointly extrapolates geometry and appearance across views in a single inference pass. Our approach decouples feature extrapolation from primitive decoding and integrates pretrained vision foundation models to inject robust global priors. The method employs a two-stage neural architecture combining cross-scene prior learning with virtual view synthesis to construct transferable implicit 3D representations.
Contribution/Results: Using only six sparse input views, our method reconstructs complete 360° scenes, achieving state-of-the-art performance on both synthetic and real-world benchmarks. It significantly improves occlusion recovery accuracy, enables real-time rendering, and outputs high-fidelity, structurally consistent implicit features.
📝 Abstract
Reconstructing unbounded outdoor scenes from sparse outward-facing views poses significant challenges due to minimal view overlap. Previous methods often lack cross-scene understanding and their primitive-centric formulations overload local features to compensate for missing global context, resulting in blurriness in unseen parts of the scene. We propose sshELF, a fast, single-shot pipeline for sparse-view 3D scene reconstruction via hierarchal extrapolation of latent features. Our key insights is that disentangling information extrapolation from primitive decoding allows efficient transfer of structural patterns across training scenes. Our method: (1) learns cross-scene priors to generate intermediate virtual views to extrapolate to unobserved regions, (2) offers a two-stage network design separating virtual view generation from 3D primitive decoding for efficient training and modular model design, and (3) integrates a pre-trained foundation model for joint inference of latent features and texture, improving scene understanding and generalization. sshELF can reconstruct 360 degree scenes from six sparse input views and achieves competitive results on synthetic and real-world datasets. We find that sshELF faithfully reconstructs occluded regions, supports real-time rendering, and provides rich latent features for downstream applications. The code will be released.