SAB3R: Semantic-Augmented Backbone in 3D Reconstruction

📅 2025-06-02

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

This paper introduces the novel “Map and Locate” task, unifying open-vocabulary instance segmentation with monocular video-based 3D reconstruction for the first time: generating a dense, semantically annotated point cloud map from uncalibrated videos and enabling natural language–guided object localization and segmentation. Our method extends MASt3R with a lightweight semantic distillation module, enabling end-to-end joint learning of pixel-level semantic features and geometrically consistent 3D point clouds—without freezing pretrained weights and producing multimodal-consistent outputs in a single forward pass. Key innovations include dense CLIP/DINOv2 feature transfer, differentiable 3D reconstruction backbone optimization, and explicit cross-modal consistency modeling. Evaluated on our newly established Map and Locate benchmark, our approach significantly outperforms the baseline MASt3R+CLIP pipeline while simultaneously improving both 2D semantic segmentation and 3D reconstruction accuracy—demonstrating the effectiveness and generalizability of unified semantic-geometric representation learning.

Technology Category

Application Category

📝 Abstract

We introduce a new task, Map and Locate, which unifies the traditionally distinct objectives of open-vocabulary segmentation - detecting and segmenting object instances based on natural language queries - and 3D reconstruction, the process of estimating a scene's 3D structure from visual inputs. Specifically, Map and Locate involves generating a point cloud from an unposed video and segmenting object instances based on open-vocabulary queries. This task serves as a critical step toward real-world embodied AI applications and introduces a practical task that bridges reconstruction, recognition and reorganization. To tackle this task, we introduce a simple yet effective baseline, which we denote as SAB3R. Our approach builds upon MASt3R, a recent breakthrough in 3D computer vision, and incorporates a lightweight distillation strategy. This method transfers dense, per-pixel semantic features from 2D vision backbones (eg, CLIP and DINOv2) to enhance MASt3R's capabilities. Without introducing any auxiliary frozen networks, our model generates per-pixel semantic features and constructs cohesive point maps in a single forward pass. Compared to separately deploying MASt3R and CLIP, our unified model, SAB3R, achieves superior performance on the Map and Locate benchmark. Furthermore, we evaluate SAB3R on both 2D semantic segmentation and 3D tasks to comprehensively validate its effectiveness.

Problem

Research questions and friction points this paper is trying to address.

Unifying open-vocabulary segmentation and 3D reconstruction tasks

Generating semantic point clouds from unposed videos

Enhancing 3D reconstruction with 2D semantic feature distillation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unifies open-vocabulary segmentation with 3D reconstruction

Lightweight distillation enhances MASt3R with 2D semantic features

Single-pass generation of semantic features and point maps

🔎 Similar Papers

Generalizable 3D Scene Reconstruction via Divide and Conquer from a Single View