V$^{2}$-SAM: Marrying SAM2 with Multi-Prompt Experts for Cross-View Object Correspondence

📅 2025-11-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Cross-view object correspondence (e.g., egocentric-to-exocentric) remains challenging due to large geometric and appearance discrepancies, rendering standard segmentation models—such as SAM2—ineffective without adaptation. To address this, we propose V²-SAM: a novel framework featuring dual prompt generators (V²-Anchor and V²-Visual) that enable the first coordinate-based cross-view prompting for SAM2. Our method introduces feature-structure dual-alignment matching, multi-prompt expert fusion, and a posteriori iterative consistency selection to adaptively identify optimal correspondences. Leveraging DINOv3 as the feature backbone, V²-SAM jointly enforces geometric constraints and semantic coherence. Evaluated on Ego-Exo4D, DAVIS-2017, and HANDAL-X, V²-SAM achieves state-of-the-art performance, significantly improving correspondence accuracy and robustness. This work establishes a new paradigm for multi-view understanding in autonomous driving and embodied AI.

Technology Category

Application Category

📝 Abstract
Cross-view object correspondence, exemplified by the representative task of ego-exo object correspondence, aims to establish consistent associations of the same object across different viewpoints (e.g., ego-centric and exo-centric). This task poses significant challenges due to drastic viewpoint and appearance variations, making existing segmentation models, such as SAM2, non-trivial to apply directly. To address this, we present V^2-SAM, a unified cross-view object correspondence framework that adapts SAM2 from single-view segmentation to cross-view correspondence through two complementary prompt generators. Specifically, the Cross-View Anchor Prompt Generator (V^2-Anchor), built upon DINOv3 features, establishes geometry-aware correspondences and, for the first time, unlocks coordinate-based prompting for SAM2 in cross-view scenarios, while the Cross-View Visual Prompt Generator (V^2-Visual) enhances appearance-guided cues via a novel visual prompt matcher that aligns ego-exo representations from both feature and structural perspectives. To effectively exploit the strengths of both prompts, we further adopt a multi-expert design and introduce a Post-hoc Cyclic Consistency Selector (PCCS) that adaptively selects the most reliable expert based on cyclic consistency. Extensive experiments validate the effectiveness of V^2-SAM, achieving new state-of-the-art performance on Ego-Exo4D (ego-exo object correspondence), DAVIS-2017 (video object tracking), and HANDAL-X (robotic-ready cross-view correspondence).
Problem

Research questions and friction points this paper is trying to address.

Establishing object correspondence across different viewpoints with appearance variations
Adapting single-view segmentation models for cross-view correspondence tasks
Addressing geometry and appearance challenges in ego-exo object matching
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adapts SAM2 with dual prompt generators
Uses geometry-aware and appearance-guided correspondence modules
Implements multi-expert selection via cyclic consistency
🔎 Similar Papers
No similar papers found.