Beyond Referring Expressions: Scenario Comprehension Visual Grounding

πŸ“… 2026-04-02
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the limitations of existing visual grounding benchmarks, which predominantly rely on explicit referring expressions and thus fail to evaluate models’ deep understanding of characters, intentions, and relational context. To this end, the authors propose a novel scene-understanding-driven paradigm for visual grounding and introduce RSC, the first benchmark built upon paragraph-level scene descriptions. RSC features distractors, interpretable difficulty labels, and an out-of-domain test set. They further present ScenGround, a curriculum reasoning framework that integrates supervised warm-up with difficulty-aware reinforcement learning to enhance model generalization and robustness. Experiments reveal a significant performance drop in current models under scene-level queries, whereas ScenGround achieves state-of-the-art results on both the challenging subsets of RSC and standard benchmarks, while enabling fine-grained failure analysis and cross-benchmark transfer.
πŸ“ Abstract
Existing visual grounding benchmarks primarily evaluate alignment between image regions and literal referring expressions, where models can often succeed by matching a prominent named category. We explore a complementary and more challenging setting of scenario-based visual grounding, where the target must be inferred from roles, intentions, and relational context rather than explicit naming. We introduce Referring Scenario Comprehension (RSC), a benchmark designed for this setting. The queries in this benchmark are paragraph-length texts that describe object roles, user goals, and contextual cues, including deliberate references to distractor objects that often require deep understanding to resolve. Each instance is annotated with interpretable difficulty tags for uniqueness, clutter, size, overlap, and position which expose distinct failure modes and support fine-grained analysis. RSC contains approximately 31k training examples, 4k in-domain test examples, and a 3k out-of-distribution split with unseen object categories. We further propose ScenGround, a curriculum reasoning method serving as a reference point for this setting, combining supervised warm-starting with difficulty-aware reinforcement learning. Experiments show that scenario-based queries expose systematic failures in current models that standard benchmarks do not reveal, and that curriculum training improves performance on challenging slices and transfers to standard benchmarks.
Problem

Research questions and friction points this paper is trying to address.

visual grounding
scenario comprehension
referring expressions
contextual reasoning
object inference
Innovation

Methods, ideas, or system contributions that make the work stand out.

scenario-based visual grounding
Referring Scenario Comprehension
curriculum reasoning
difficulty-aware reinforcement learning
relational context understanding
πŸ”Ž Similar Papers
No similar papers found.
R
Ruozhen He
Rice University
N
Nisarg A. Shah
Johns Hopkins University
Q
Qihua Dong
Northeastern University
Z
Zilin Xiao
Rice University
Jaywon Koo
Jaywon Koo
Rice University
Computer VisionNatural Language Processing
V
Vicente Ordonez
Rice University