SIU3R: Simultaneous Scene Understanding and 3D Reconstruction Beyond Feature Alignment

šŸ“… 2025-07-03
šŸ“ˆ Citations: 0
✨ Influential: 0
šŸ“„ PDF
šŸ¤– AI Summary
Existing end-to-end embodied intelligence systems predominantly rely on explicit 2D-to-3D feature alignment for scene understanding and 3D reconstruction, resulting in weak 3D semantic modeling capability and severe information loss. To address this, we propose the first alignment-free framework that eliminates explicit 2D–3D alignment entirely, instead adopting pixel-aligned implicit 3D representations and unified learnable queries to natively support 3D semantic understanding. By sharing a common 3D feature space and introducing a lightweight bidirectional interaction module, our approach enables joint optimization of reconstruction and understanding tasks. Extensive experiments across multiple benchmarks demonstrate state-of-the-art performance both in individual tasks (3D reconstruction and scene understanding) and in their joint execution. These results validate the effectiveness and generalizability of our alignment-free design and the proposed mutualistic optimization mechanism.

Technology Category

Application Category

šŸ“ Abstract
Simultaneous understanding and 3D reconstruction plays an important role in developing end-to-end embodied intelligent systems. To achieve this, recent approaches resort to 2D-to-3D feature alignment paradigm, which leads to limited 3D understanding capability and potential semantic information loss. In light of this, we propose SIU3R, the first alignment-free framework for generalizable simultaneous understanding and 3D reconstruction from unposed images. Specifically, SIU3R bridges reconstruction and understanding tasks via pixel-aligned 3D representation, and unifies multiple understanding tasks into a set of unified learnable queries, enabling native 3D understanding without the need of alignment with 2D models. To encourage collaboration between the two tasks with shared representation, we further conduct in-depth analyses of their mutual benefits, and propose two lightweight modules to facilitate their interaction. Extensive experiments demonstrate that our method achieves state-of-the-art performance not only on the individual tasks of 3D reconstruction and understanding, but also on the task of simultaneous understanding and 3D reconstruction, highlighting the advantages of our alignment-free framework and the effectiveness of the mutual benefit designs.
Problem

Research questions and friction points this paper is trying to address.

Eliminates 2D-to-3D feature alignment limitations
Unifies scene understanding and 3D reconstruction tasks
Enables native 3D understanding without 2D model alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Alignment-free framework for 3D understanding
Pixel-aligned 3D representation bridges tasks
Unified learnable queries for multiple tasks
šŸ”Ž Similar Papers
No similar papers found.
Q
Qi Xu
Westlake University, Wuhan University
D
Dongxu Wei
Westlake University, Westlake Institute for Advanced Study
Lingzhe Zhao
Lingzhe Zhao
PhD student, Westlake University
Computer vision3D vision
W
Wenpu Li
Westlake University
Z
Zhangchi Huang
Westlake University, Zhejiang University
S
Shunping Ji
Wuhan University
Peidong Liu
Peidong Liu
Westlake University
3D computer visionRobotics