Investigating the Design Space of Visual Grounding in Multimodal Large Language Model

📅 2025-08-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Prior work lacks systematic empirical validation of design choices for multimodal large language models (MLLMs) in visual grounding (VG), particularly regarding architectural paradigms, training strategies, and data construction. Method: Building upon LLaVA-1.5, we conduct comprehensive ablation studies and multi-paradigm comparisons—including bounding-box generation versus coordinate regression—and introduce fine-grained vision-language alignment via targeted data curation and instruction tuning. Contribution/Results: This work presents the first holistic empirical analysis of the VG design space for MLLMs, yielding an optimized VG-specific architecture and data construction pipeline. Our approach jointly enhances fine-grained localization accuracy and cross-dataset generalization. On RefCOCO, RefCOCO+, and RefCOCOg, it achieves absolute improvements of +5.6%, +6.9%, and +7.0% over LLaVA-1.5, respectively, substantially advancing the state of the art in open-domain visual grounding.

Technology Category

Application Category

📝 Abstract
Fine-grained multimodal capability in Multimodal Large Language Models (MLLMs) has emerged as a critical research direction, particularly for tackling the visual grounding (VG) problem. Despite the strong performance achieved by existing approaches, they often employ disparate design choices when fine-tuning MLLMs for VG, lacking systematic verification to support these designs. To bridge this gap, this paper presents a comprehensive study of various design choices that impact the VG performance of MLLMs. We conduct our analysis using LLaVA-1.5, which has been widely adopted in prior empirical studies of MLLMs. While more recent models exist, we follow this convention to ensure our findings remain broadly applicable and extendable to other architectures. We cover two key aspects: (1) exploring different visual grounding paradigms in MLLMs, identifying the most effective design, and providing our insights; and (2) conducting ablation studies on the design of grounding data to optimize MLLMs' fine-tuning for the VG task. Finally, our findings contribute to a stronger MLLM for VG, achieving improvements of +5.6% / +6.9% / +7.0% on RefCOCO/+/g over the LLaVA-1.5.
Problem

Research questions and friction points this paper is trying to address.

Systematically analyzing design choices for visual grounding in MLLMs
Identifying optimal visual grounding paradigms in MLLMs
Optimizing grounding data design to enhance MLLM fine-tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Comprehensive study of MLLM design choices
Exploration of visual grounding paradigms
Ablation studies on grounding data design