Multi-task Visual Grounding with Coarse-to-Fine Consistency Constraints

📅 2025-01-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address semantic inconsistency between referring expression comprehension (REC) and referring instance segmentation (RIS) in multi-task visual grounding, this paper proposes a coarse-to-fine two-stage framework. In the first stage, a multimodal Transformer generates semantically aware coarse predictions. In the second stage, a mask-guided interaction module (MIM) unifies REC and RIS representations, while an explicit bidirectional consistency loss enforces semantic alignment among text, bounding region, and pixel-level mask. The method integrates a pre-trained vision-language model, a query decoder, and a pixel decoder into a single end-to-end trainable architecture. Evaluated on RefCOCO, RefCOCO+, and RefCOCOg benchmarks, our approach achieves state-of-the-art performance on both REC and RIS tasks, demonstrating the effectiveness and generalizability of the proposed consistency modeling mechanism.

Technology Category

Application Category

📝 Abstract
Multi-task visual grounding involves the simultaneous execution of localization and segmentation in images based on textual expressions. The majority of advanced methods predominantly focus on transformer-based multimodal fusion, aiming to extract robust multimodal representations. However, ambiguity between referring expression comprehension (REC) and referring image segmentation (RIS) is error-prone, leading to inconsistencies between multi-task predictions. Besides, insufficient multimodal understanding directly contributes to biased target perception. To overcome these challenges, we propose a Coarse-to-fine Consistency Constraints Visual Grounding architecture ($ ext{C}^3 ext{VG}$), which integrates implicit and explicit modeling approaches within a two-stage framework. Initially, query and pixel decoders are employed to generate preliminary detection and segmentation outputs, a process referred to as the Rough Semantic Perception (RSP) stage. These coarse predictions are subsequently refined through the proposed Mask-guided Interaction Module (MIM) and a novel explicit bidirectional consistency constraint loss to ensure consistent representations across tasks, which we term the Refined Consistency Interaction (RCI) stage. Furthermore, to address the challenge of insufficient multimodal understanding, we leverage pre-trained models based on visual-linguistic fusion representations. Empirical evaluations on the RefCOCO, RefCOCO+, and RefCOCOg datasets demonstrate the efficacy and soundness of $ ext{C}^3 ext{VG}$, which significantly outperforms state-of-the-art REC and RIS methods by a substantial margin. Code and model will be available at url{https://github.com/Dmmm1997/C3VG}.
Problem

Research questions and friction points this paper is trying to address.

Multi-task Visual Localization
Text-Description Consistency
Image Segmentation Accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

C3-VG Architecture
Mask Guidance
Bidirectional Consistency Constraint
🔎 Similar Papers
No similar papers found.
Ming Dai
Ming Dai
SouthEast University
MLLMVisual GroundingImage Retrieval
J
Jian Li
Youtu Lab, Tencent, China
J
Jiedong Zhuang
College of Information Science and Electronic Engineering, Zhejiang University, China
X
Xian Zhang
School of Automation, Southeast University, China
W
Wankou Yang
School of Automation, Southeast University, China; Advanced Ocean Institute of Southeast Univerisity, Nantong, China