🤖 AI Summary
This work addresses novel view synthesis from sparse, unposed 2D images—without camera pose annotations, 3D priors, or explicit geometric representations (e.g., NeRF or 3D Gaussian Splatting). We propose the first purely 2D end-to-end neural rendering framework, which learns implicit geometry and radiance fields via large-scale self-supervised image reconstruction, entirely eliminating 3D inductive biases and pose dependencies. Our key contribution is the empirical discovery of an inverse relationship: *the less the method relies on 3D knowledge, the greater the performance gain from data scaling*—establishing a new “de-3Dified” paradigm for view synthesis. Remarkably, our approach achieves high-fidelity, geometrically consistent novel views without any input pose information, matching state-of-the-art methods that require precise camera poses. This constitutes the first rigorous empirical validation of fully data-driven, pose-free novel view synthesis.
📝 Abstract
We consider the problem of generalizable novel view synthesis (NVS), which aims to generate photorealistic novel views from sparse or even unposed 2D images without per-scene optimization. This task remains fundamentally challenging, as it requires inferring 3D structure from incomplete and ambiguous 2D observations. Early approaches typically rely on strong 3D knowledge, including architectural 3D inductive biases (e.g., embedding explicit 3D representations, such as NeRF or 3DGS, into network design) and ground-truth camera poses for both input and target views. While recent efforts have sought to reduce the 3D inductive bias or the dependence on known camera poses of input views, critical questions regarding the role of 3D knowledge and the necessity of circumventing its use remain under-explored. In this work, we conduct a systematic analysis on the 3D knowledge and uncover a critical trend: the performance of methods that requires less 3D knowledge accelerates more as data scales, eventually achieving performance on par with their 3D knowledge-driven counterparts, which highlights the increasing importance of reducing dependence on 3D knowledge in the era of large-scale data. Motivated by and following this trend, we propose a novel NVS framework that minimizes 3D inductive bias and pose dependence for both input and target views. By eliminating this 3D knowledge, our method fully leverages data scaling and learns implicit 3D awareness directly from sparse 2D images, without any 3D inductive bias or pose annotation during training. Extensive experiments demonstrate that our model generates photorealistic and 3D-consistent novel views, achieving even comparable performance with methods that rely on posed inputs, thereby validating the feasibility and effectiveness of our data-centric paradigm. Project page: https://pku-vcl-geometry.github.io/Less3Depend/ .