PhraseStereo: The First Open-Vocabulary Stereo Image Segmentation Dataset

📅 2025-10-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Aligning natural language phrases with corresponding regions in stereo images remains challenging in multimodal semantic segmentation. Method: This paper pioneers the extension of phrase grounding to stereo vision, leveraging depth-guided geometric cues to enhance cross-modal alignment accuracy. We introduce PhraseStereo—the first open-vocabulary stereo image dataset for semantic segmentation—constructed by synthesizing high-fidelity right-view images from monocular PhraseCut images using GenStereo, and rigorously aligning segmentation masks and phrase annotations across both views to ensure semantic consistency. Contribution/Results: PhraseStereo is publicly released, establishing a new benchmark for jointly modeling linguistic understanding, 2D visual perception, and 3D geometric reasoning. It enables systematic evaluation of models that simultaneously achieve fine-grained semantic comprehension and accurate spatial-geometric awareness, thereby advancing the development of geometrically grounded multimodal segmentation systems.

Technology Category

Application Category

📝 Abstract
Understanding how natural language phrases correspond to specific regions in images is a key challenge in multimodal semantic segmentation. Recent advances in phrase grounding are largely limited to single-view images, neglecting the rich geometric cues available in stereo vision. For this, we introduce PhraseStereo, the first novel dataset that brings phrase-region segmentation to stereo image pairs. PhraseStereo builds upon the PhraseCut dataset by leveraging GenStereo to generate accurate right-view images from existing single-view data, enabling the extension of phrase grounding into the stereo domain. This new setting introduces unique challenges and opportunities for multimodal learning, particularly in leveraging depth cues for more precise and context-aware grounding. By providing stereo image pairs with aligned segmentation masks and phrase annotations, PhraseStereo lays the foundation for future research at the intersection of language, vision, and 3D perception, encouraging the development of models that can reason jointly over semantics and geometry. The PhraseStereo dataset will be released online upon acceptance of this work.
Problem

Research questions and friction points this paper is trying to address.

Extends phrase grounding from single-view to stereo images
Leverages depth cues for precise multimodal semantic segmentation
Enables joint reasoning over language semantics and 3D geometry
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates stereo images from single-view data
Extends phrase grounding to stereo image pairs
Leverages depth cues for precise semantic segmentation
T
Thomas Campagnolo
Centre Inria d’Universite Cote d’Azur, France
Ezio Malis
Ezio Malis
Inria
computer visionrobotics
P
Philippe Martinet
Centre Inria d’Universite Cote d’Azur, France
G
Gaetan Bahl
NXP Semiconductors, France