Beyond Referring Expressions: Scenario Comprehension Visual Grounding
preprint

Beyond Referring Expressions: Scenario Comprehension Visual Grounding

Ruozhen He, Nisarg A. Shah, Qihua Dong, Zilin Xiao, Jaywon Koo, Vicente Ordonez
arxiv:2604.02323
Lab News Desk

News Release Summary

This section is intentionally written in a reporter-style news release voice for general readers.

Researchers from Rice University, Johns Hopkins University, and Northeastern University have identified a significant gap in how visual AI systems are tested: standard benchmarks for "visual grounding" — the ability to match a text description to a region in an image — typically use short, literal phrases like "the brown leather glove held by the catcher," which models can often solve simply by recognizing a named object category. To stress-test whether models can handle more realistic, roundabout language, the team built a new benchmark called Referring Scenario Comprehension (RSC), where each query is a paragraph-length description written from a user's perspective — for example, describing someone trying to check the time at a bus stop without ever mentioning the word "clock." The benchmark contains roughly 38,000 annotated examples drawn from MS-COCO and LVIS images, includes a held-out test set with entirely unseen object categories, and tags each instance along five difficulty axes covering clutter, object size, overlap, position, and whether the target category appears multiple times in the scene. When the team evaluated a range of current vision-language models on RSC — including GPT-4o, Claude 3.7, and several open-source systems — all struggled badly, with even the best off-the-shelf model scoring well below 30% localization accuracy, compared to over 60% for the authors' purpose-built system. That system, called ScenGround, combines supervised fine-tuning on easier examples to establish a reasoning schema with a reinforcement learning stage that progressively feeds the model harder, more ambiguous cases. The work matters because it demonstrates that impressive scores on existing grounding benchmarks can mask a model's near-total inability to handle the kind of indirect, goal-driven language people naturally use when describing what they need.

abstract

Existing visual grounding benchmarks primarily evaluate alignment between image regions and literal referring expressions, where models can often succeed by matching a prominent named category. We explore a complementary and more challenging setting of scenario-based visual grounding, where the target must be inferred from roles, intentions, and relational context rather than explicit naming. We introduce Referring Scenario Comprehension (RSC), a benchmark designed for this setting. The queries in this benchmark are paragraph-length texts that describe object roles, user goals, and contextual cues, including deliberate references to distractor objects that often require deep understanding to resolve. Each instance is annotated with interpretable difficulty tags for uniqueness, clutter, size, overlap, and position which expose distinct failure modes and support fine-grained analysis. RSC contains approximately 31k training examples, 4k in-domain test examples, and a 3k out-of-distribution split with unseen object categories. We further propose ScenGround, a curriculum reasoning method serving as a reference point for this setting, combining supervised warm-starting with difficulty-aware reinforcement learning. Experiments show that scenario-based queries expose systematic failures in current models that standard benchmarks do not reveal, and that curriculum training improves performance on challenging slices and transfers to standard benchmarks.

details

comment
20 pages, 18 figures, Project Page: https://catherine-r-he.github.io/RSC/

citation

@article{hebeyond,
  title = {Beyond Referring Expressions: Scenario Comprehension Visual Grounding},
  author = {He, Ruozhen and Shah, Nisarg A. and Dong, Qihua and Xiao, Zilin and Koo, Jaywon and Ordonez, Vicente},
  journal = {arxiv:2604.02323},
  url = {https://arxiv.org/abs/2604.02323},
}