A Simple and Better Baseline for Visual Grounding

📅 2025-10-12

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Visual grounding aims to localize target objects in an image given a natural language description. Existing approaches typically rely on iterative language-vision interaction, requiring repeated caching of multimodal features—leading to high computational overhead and slow inference. This paper proposes an efficient end-to-end baseline that eliminates iteration by introducing a **language-guided parallel multimodal interaction mechanism**, where language serves as a concurrent steering signal for visual feature processing. Additionally, we design a **similarity-driven visual feature selection module**, which dynamically identifies and retains only language-relevant image regions, substantially reducing redundant computation. By fusing multimodal features under parallel linguistic guidance, our method achieves fast and accurate localization. Evaluated on mainstream benchmarks—including RefCOCO, RefCOCO+, and RefCOCOg—our approach achieves superior grounding accuracy while significantly accelerating inference, thus attaining an optimal trade-off between precision and efficiency.

Technology Category

Application Category

📝 Abstract

Visual grounding aims to predict the locations of target objects specified by textual descriptions. For this task with linguistic and visual modalities, there is a latest research line that focuses on only selecting the linguistic-relevant visual regions for object localization to reduce the computational overhead. Albeit achieving impressive performance, it is iteratively performed on different image scales, and at every iteration, linguistic features and visual features need to be stored in a cache, incurring extra overhead. To facilitate the implementation, in this paper, we propose a feature selection-based simple yet effective baseline for visual grounding, called FSVG. Specifically, we directly encapsulate the linguistic and visual modalities into an overall network architecture without complicated iterative procedures, and utilize the language in parallel as guidance to facilitate the interaction between linguistic modal and visual modal for extracting effective visual features. Furthermore, to reduce the computational cost, during the visual feature learning, we introduce a similarity-based feature selection mechanism to only exploit language-related visual features for faster prediction. Extensive experiments conducted on several benchmark datasets comprehensively substantiate that the proposed FSVG achieves a better balance between accuracy and efficiency beyond the current state-of-the-art methods. Code is available at https://github.com/jcwang0602/FSVG.

Problem

Research questions and friction points this paper is trying to address.

Reducing computational overhead in visual grounding tasks

Eliminating iterative procedures for multimodal feature interaction

Selecting language-relevant visual features for faster prediction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Direct multimodal integration without iterative procedures

Similarity-based feature selection for computational efficiency

Language-guided parallel interaction between linguistic and visual modalities

🔎 Similar Papers

No similar papers found.

Authors to Follow