IGGT: Instance-Grounded Geometry Transformer for Semantic 3D Reconstruction

📅 2025-10-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing 3D scene analysis methods typically decouple geometric reconstruction from semantic understanding, limiting generalization. This paper proposes the first end-to-end unified Transformer framework that takes only 2D images as input and jointly generates geometrically consistent 3D scenes and instance-level semantic segmentation. Key contributions include: (1) an instance-anchored contrastive learning strategy—enabling joint representation of geometric structure and instance clustering without any 3D supervision; (2) the construction of InsScene-15K, a large-scale, high-consistency dataset with dense instance annotations; and (3) 3D-consistent contrastive learning and depth-mask alignment mechanisms to ensure semantic coherence across the 2D-to-3D mapping. Experiments demonstrate significant improvements over state-of-the-art methods in both instance separation accuracy and geometric fidelity, substantially enhancing generalization and adaptability for downstream 3D understanding tasks.

Technology Category

Application Category

📝 Abstract
Humans naturally perceive the geometric structure and semantic content of a 3D world as intertwined dimensions, enabling coherent and accurate understanding of complex scenes. However, most prior approaches prioritize training large geometry models for low-level 3D reconstruction and treat high-level spatial understanding in isolation, overlooking the crucial interplay between these two fundamental aspects of 3D-scene analysis, thereby limiting generalization and leading to poor performance in downstream 3D understanding tasks. Recent attempts have mitigated this issue by simply aligning 3D models with specific language models, thus restricting perception to the aligned model's capacity and limiting adaptability to downstream tasks. In this paper, we propose InstanceGrounded Geometry Transformer (IGGT), an end-to-end large unified transformer to unify the knowledge for both spatial reconstruction and instance-level contextual understanding. Specifically, we design a 3D-Consistent Contrastive Learning strategy that guides IGGT to encode a unified representation with geometric structures and instance-grounded clustering through only 2D visual inputs. This representation supports consistent lifting of 2D visual inputs into a coherent 3D scene with explicitly distinct object instances. To facilitate this task, we further construct InsScene-15K, a large-scale dataset with high-quality RGB images, poses, depth maps, and 3D-consistent instance-level mask annotations with a novel data curation pipeline.
Problem

Research questions and friction points this paper is trying to address.

Unifies geometric reconstruction with semantic understanding in 3D scenes
Addresses poor generalization in downstream 3D understanding tasks
Enables consistent 3D instance segmentation from 2D visual inputs only
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified transformer for geometry and semantics
3D-consistent contrastive learning from 2D inputs
Instance-grounded clustering for object separation
🔎 Similar Papers
No similar papers found.