A Simple and Better Baseline for Visual Grounding

📅 2025-10-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Visual grounding aims to localize target objects in an image given a natural language description. Existing approaches typically rely on iterative language-vision interaction, requiring repeated caching of multimodal features—leading to high computational overhead and slow inference. This paper proposes an efficient end-to-end baseline that eliminates iteration by introducing a **language-guided parallel multimodal interaction mechanism**, where language serves as a concurrent steering signal for visual feature processing. Additionally, we design a **similarity-driven visual feature selection module**, which dynamically identifies and retains only language-relevant image regions, substantially reducing redundant computation. By fusing multimodal features under parallel linguistic guidance, our method achieves fast and accurate localization. Evaluated on mainstream benchmarks—including RefCOCO, RefCOCO+, and RefCOCOg—our approach achieves superior grounding accuracy while significantly accelerating inference, thus attaining an optimal trade-off between precision and efficiency.

Technology Category

Application Category

📝 Abstract
Visual grounding aims to predict the locations of target objects specified by textual descriptions. For this task with linguistic and visual modalities, there is a latest research line that focuses on only selecting the linguistic-relevant visual regions for object localization to reduce the computational overhead. Albeit achieving impressive performance, it is iteratively performed on different image scales, and at every iteration, linguistic features and visual features need to be stored in a cache, incurring extra overhead. To facilitate the implementation, in this paper, we propose a feature selection-based simple yet effective baseline for visual grounding, called FSVG. Specifically, we directly encapsulate the linguistic and visual modalities into an overall network architecture without complicated iterative procedures, and utilize the language in parallel as guidance to facilitate the interaction between linguistic modal and visual modal for extracting effective visual features. Furthermore, to reduce the computational cost, during the visual feature learning, we introduce a similarity-based feature selection mechanism to only exploit language-related visual features for faster prediction. Extensive experiments conducted on several benchmark datasets comprehensively substantiate that the proposed FSVG achieves a better balance between accuracy and efficiency beyond the current state-of-the-art methods. Code is available at https://github.com/jcwang0602/FSVG.
Problem

Research questions and friction points this paper is trying to address.

Reducing computational overhead in visual grounding tasks
Eliminating iterative procedures for multimodal feature interaction
Selecting language-relevant visual features for faster prediction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Direct multimodal integration without iterative procedures
Similarity-based feature selection for computational efficiency
Language-guided parallel interaction between linguistic and visual modalities
🔎 Similar Papers
No similar papers found.
Jingchao Wang
Jingchao Wang
East China Normal University
AI
W
Wenlong Zhang
OpenScience Lab, Shanghai AI Laboratory, Shanghai, China
D
Dingjiang Huang
School of Data Science and Engineering, East China Normal University, Shanghai, China
H
Hong Wang
School of Life Science and Technology, Xi’an Jiaotong University, Xi’an, China
Yefeng Zheng
Yefeng Zheng
Professor, Westlake University, Hangzhou, China, IEEE Fellow, AIMBE Fellow
AI in HealthMedical ImagingComputer VisionNatural Language ProcessingLarge Language Model