π€ AI Summary
This paper addresses the challenging problem of scene reconstruction and novel view synthesis (NVS) from sparse, uncalibrated viewsβwhere neither camera poses nor intrinsic parameters are known a priori. We propose an end-to-end unified framework that jointly optimizes 3D Gaussian splatting representations and full camera parameters. Our key contributions are: (1) treating camera parameters and 3D Gaussians as parallel, differentiable learnable queries; (2) introducing Camera-aware Multi-view Deformable Cross-Attention (CaMDFA) to strengthen geometry-rendering coupling; and (3) enforcing pose stability via Ray Reference Point (RayRef)-guided RQ decomposition constraints on camera extrinsics. Crucially, our method requires no initial pose estimates, external Structure-from-Motion (SfM), or pose priors. On RealEstate10K and ACID benchmarks, it significantly outperforms both pose-initialized and pose-free state-of-the-art methods, achieving superior 3D reconstruction accuracy and photorealistic novel view rendering quality.
π Abstract
In this work, we introduce Coca-Splat, a novel approach to addressing the challenges of sparse view pose-free scene reconstruction and novel view synthesis (NVS) by jointly optimizing camera parameters with 3D Gaussians. Inspired by deformable DEtection TRansformer, we design separate queries for 3D Gaussians and camera parameters and update them layer by layer through deformable Transformer layers, enabling joint optimization in a single network. This design demonstrates better performance because to accurately render views that closely approximate ground-truth images relies on precise estimation of both 3D Gaussians and camera parameters. In such a design, the centers of 3D Gaussians are projected onto each view by camera parameters to get projected points, which are regarded as 2D reference points in deformable cross-attention. With camera-aware multi-view deformable cross-attention (CaMDFA), 3D Gaussians and camera parameters are intrinsically connected by sharing the 2D reference points. Additionally, 2D reference point determined rays (RayRef) defined from camera centers to the reference points assist in modeling relationship between 3D Gaussians and camera parameters through RQ-decomposition on an overdetermined system of equations derived from the rays, enhancing the relationship between 3D Gaussians and camera parameters. Extensive evaluation shows that our approach outperforms previous methods, both pose-required and pose-free, on RealEstate10K and ACID within the same pose-free setting.