🤖 AI Summary
Visual grounding aims to localize target objects in an image given a natural language description. Existing approaches typically rely on iterative language-vision interaction, requiring repeated caching of multimodal features—leading to high computational overhead and slow inference. This paper proposes an efficient end-to-end baseline that eliminates iteration by introducing a **language-guided parallel multimodal interaction mechanism**, where language serves as a concurrent steering signal for visual feature processing. Additionally, we design a **similarity-driven visual feature selection module**, which dynamically identifies and retains only language-relevant image regions, substantially reducing redundant computation. By fusing multimodal features under parallel linguistic guidance, our method achieves fast and accurate localization. Evaluated on mainstream benchmarks—including RefCOCO, RefCOCO+, and RefCOCOg—our approach achieves superior grounding accuracy while significantly accelerating inference, thus attaining an optimal trade-off between precision and efficiency.
📝 Abstract
Visual grounding aims to predict the locations of target objects specified by textual descriptions. For this task with linguistic and visual modalities, there is a latest research line that focuses on only selecting the linguistic-relevant visual regions for object localization to reduce the computational overhead. Albeit achieving impressive performance, it is iteratively performed on different image scales, and at every iteration, linguistic features and visual features need to be stored in a cache, incurring extra overhead. To facilitate the implementation, in this paper, we propose a feature selection-based simple yet effective baseline for visual grounding, called FSVG. Specifically, we directly encapsulate the linguistic and visual modalities into an overall network architecture without complicated iterative procedures, and utilize the language in parallel as guidance to facilitate the interaction between linguistic modal and visual modal for extracting effective visual features. Furthermore, to reduce the computational cost, during the visual feature learning, we introduce a similarity-based feature selection mechanism to only exploit language-related visual features for faster prediction. Extensive experiments conducted on several benchmark datasets comprehensively substantiate that the proposed FSVG achieves a better balance between accuracy and efficiency beyond the current state-of-the-art methods. Code is available at https://github.com/jcwang0602/FSVG.