🤖 AI Summary
Aligning natural language phrases with corresponding regions in stereo images remains challenging in multimodal semantic segmentation. Method: This paper pioneers the extension of phrase grounding to stereo vision, leveraging depth-guided geometric cues to enhance cross-modal alignment accuracy. We introduce PhraseStereo—the first open-vocabulary stereo image dataset for semantic segmentation—constructed by synthesizing high-fidelity right-view images from monocular PhraseCut images using GenStereo, and rigorously aligning segmentation masks and phrase annotations across both views to ensure semantic consistency. Contribution/Results: PhraseStereo is publicly released, establishing a new benchmark for jointly modeling linguistic understanding, 2D visual perception, and 3D geometric reasoning. It enables systematic evaluation of models that simultaneously achieve fine-grained semantic comprehension and accurate spatial-geometric awareness, thereby advancing the development of geometrically grounded multimodal segmentation systems.
📝 Abstract
Understanding how natural language phrases correspond to specific regions in images is a key challenge in multimodal semantic segmentation. Recent advances in phrase grounding are largely limited to single-view images, neglecting the rich geometric cues available in stereo vision. For this, we introduce PhraseStereo, the first novel dataset that brings phrase-region segmentation to stereo image pairs. PhraseStereo builds upon the PhraseCut dataset by leveraging GenStereo to generate accurate right-view images from existing single-view data, enabling the extension of phrase grounding into the stereo domain. This new setting introduces unique challenges and opportunities for multimodal learning, particularly in leveraging depth cues for more precise and context-aware grounding. By providing stereo image pairs with aligned segmentation masks and phrase annotations, PhraseStereo lays the foundation for future research at the intersection of language, vision, and 3D perception, encouraging the development of models that can reason jointly over semantics and geometry. The PhraseStereo dataset will be released online upon acceptance of this work.