🤖 AI Summary
Existing feedforward-based 3D reconstruction methods suffer from insufficient multi-view consistency due to their reliance on deep decoder stacks that lack structured geometric optimization. Inspired by bundle adjustment (BA), this work proposes BA-T, an iterative Transformer architecture that performs lightweight, single-layer updates repeatedly in an implicit token space to progressively refine camera poses and geometry through latent residual corrections. By embedding BA’s iterative optimization mechanism into the Transformer framework, BA-T replaces deep attention stacks with reusable, lightweight layers and integrates cross-view attention for efficient geometric reasoning. Experiments demonstrate that BA-T achieves significantly improved reconstruction accuracy and cross-view consistency using only 16% of the decoder parameters of conventional models, matching or even surpassing the performance of substantially larger architectures.
📝 Abstract
Feed-forward models for 3D reconstruction have achieved strong performance using deep cross-view attention to exchange information across images. However, these approaches often depend on heavy decoder stacks and lack a structured mechanism for geometry refinement, resulting in poor multi-view consistency. We address this by drawing inspiration from classical bundle adjustment (BA), which can be viewed as an iterative information propagation process between poses and local geometry. Inspired by BA, we propose BA-T, an iterative Transformer that implements BA-style structured updates as a repeatable layer in implicit token space. Instead of relying on deep attention stacks, BA-T refines predictions based on latent residual by a single lightweight layer. Experiments demonstrate that BA-T progressively improves pose and reconstruction accuracy across iterations, achieves stronger cross-view consistency than conventional decoders, and matches or surpasses substantially larger models while using only 16% of their decoder parameters. BA-T provides a compact, efficient, and structural alternative to depth-heavy attention, enabling accurate 3D reconstruction within a lightweight architecture. The code will be made publicly at https://github.com/zhangganlin/BA-T.