Referencing Where to Focus: Improving VisualGrounding with Referential Query

📅 2024-12-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In referring expression grounding, misalignment between linguistic expressions and visual targets arises primarily from decoder initialization lacking semantic guidance and ineffective utilization of multi-level image features. This paper proposes RefFormer to address these challenges. First, it introduces a plug-and-play CLIP-driven query adaptation module that generates semantically guided reference queries, enhancing decoder focus on target regions. Second, it pioneers multi-level image feature fusion within the DETR architecture while keeping the CLIP backbone frozen—thereby preserving strong visual representation capability while enabling lightweight adaptation. Crucially, RefFormer avoids fine-tuning the visual backbone, significantly improving training efficiency. Extensive experiments demonstrate state-of-the-art performance across five standard benchmarks—including RefCOCO and RefCOCO+—with consistent improvements in both localization accuracy and convergence speed.

Technology Category

Application Category

📝 Abstract
Visual Grounding aims to localize the referring object in an image given a natural language expression. Recent advancements in DETR-based visual grounding methods have attracted considerable attention, as they directly predict the coordinates of the target object without relying on additional efforts, such as pre-generated proposal candidates or pre-defined anchor boxes. However, existing research primarily focuses on designing stronger multi-modal decoder, which typically generates learnable queries by random initialization or by using linguistic embeddings. This vanilla query generation approach inevitably increases the learning difficulty for the model, as it does not involve any target-related information at the beginning of decoding. Furthermore, they only use the deepest image feature during the query learning process, overlooking the importance of features from other levels. To address these issues, we propose a novel approach, called RefFormer. It consists of the query adaption module that can be seamlessly integrated into CLIP and generate the referential query to provide the prior context for decoder, along with a task-specific decoder. By incorporating the referential query into the decoder, we can effectively mitigate the learning difficulty of the decoder, and accurately concentrate on the target object. Additionally, our proposed query adaption module can also act as an adapter, preserving the rich knowledge within CLIP without the need to tune the parameters of the backbone network. Extensive experiments demonstrate the effectiveness and efficiency of our proposed method, outperforming state-of-the-art approaches on five visual grounding benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Visual Localization
Model Learning Efficiency
Multi-layer Image Information
Innovation

Methods, ideas, or system contributions that make the work stand out.

RefFormer
CLIP Integration
Task-Optimized Decoder
🔎 Similar Papers
No similar papers found.
Yabing Wang
Yabing Wang
Xi’an Jiaotong University
multimodal learning
Zhuotao Tian
Zhuotao Tian
Professor, Harbin Institute of Technology (Shenzhen)
Vision-language ModelMulti-modal PerceptionComputer Vision
Qingpei Guo
Qingpei Guo
Ant Group
Multimodal LLMsVision-Language Models
Z
Zheng Qin
National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center for Visual Information and Applications, and Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University
Sanping Zhou
Sanping Zhou
Xi'an Jiaotong University
Computer VisionMachine Learning
M
Ming Yang
AntGroup
L
Le Wang
National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center for Visual Information and Applications, and Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University