Beyond Referring Expressions: Scenario Comprehension Visual Grounding

📅 2026-04-02

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the limitations of existing visual grounding benchmarks, which predominantly rely on explicit referring expressions and thus fail to evaluate models’ deep understanding of characters, intentions, and relational context. To this end, the authors propose a novel scene-understanding-driven paradigm for visual grounding and introduce RSC, the first benchmark built upon paragraph-level scene descriptions. RSC features distractors, interpretable difficulty labels, and an out-of-domain test set. They further present ScenGround, a curriculum reasoning framework that integrates supervised warm-up with difficulty-aware reinforcement learning to enhance model generalization and robustness. Experiments reveal a significant performance drop in current models under scene-level queries, whereas ScenGround achieves state-of-the-art results on both the challenging subsets of RSC and standard benchmarks, while enabling fine-grained failure analysis and cross-benchmark transfer.

📝 Abstract

Existing visual grounding benchmarks primarily evaluate alignment between image regions and literal referring expressions, where models can often succeed by matching a prominent named category. We explore a complementary and more challenging setting of scenario-based visual grounding, where the target must be inferred from roles, intentions, and relational context rather than explicit naming. We introduce Referring Scenario Comprehension (RSC), a benchmark designed for this setting. The queries in this benchmark are paragraph-length texts that describe object roles, user goals, and contextual cues, including deliberate references to distractor objects that often require deep understanding to resolve. Each instance is annotated with interpretable difficulty tags for uniqueness, clutter, size, overlap, and position which expose distinct failure modes and support fine-grained analysis. RSC contains approximately 31k training examples, 4k in-domain test examples, and a 3k out-of-distribution split with unseen object categories. We further propose ScenGround, a curriculum reasoning method serving as a reference point for this setting, combining supervised warm-starting with difficulty-aware reinforcement learning. Experiments show that scenario-based queries expose systematic failures in current models that standard benchmarks do not reveal, and that curriculum training improves performance on challenging slices and transfers to standard benchmarks.

Problem

Research questions and friction points this paper is trying to address.

visual grounding

scenario comprehension

referring expressions

contextual reasoning

object inference

Innovation

Methods, ideas, or system contributions that make the work stand out.

scenario-based visual grounding

Referring Scenario Comprehension

curriculum reasoning

difficulty-aware reinforcement learning

relational context understanding

🔎 Similar Papers

No similar papers found.

Authors to Follow