Beyond Referring Expressions: Scenario Comprehension Visual Grounding
News Release Summary
Researchers from Rice University, Johns Hopkins University, and Northeastern University have identified a significant gap in how visual AI systems are tested: standard benchmarks for "visual grounding" — the ability to match a text description to a region in an image — typically use short, literal phrases like "the brown leather glove held by the catcher," which models can often solve simply by recognizing a named object category. To stress-test whether models can handle more realistic, roundabout language, the team built a new benchmark called Referring Scenario Comprehension (RSC), where each query is a paragraph-length description written from a user's perspective — for example, describing someone trying to check the time at a bus stop without ever mentioning the word "clock." The benchmark contains roughly 38,000 annotated examples drawn from MS-COCO and LVIS images, includes a held-out test set with entirely unseen object categories, and tags each instance along five difficulty axes covering clutter, object size, overlap, position, and whether the target category appears multiple times in the scene. When the team evaluated a range of current vision-language models on RSC — including GPT-4o, Claude 3.7, and several open-source systems — all struggled badly, with even the best off-the-shelf model scoring well below 30% localization accuracy, compared to over 60% for the authors' purpose-built system. That system, called ScenGround, combines supervised fine-tuning on easier examples to establish a reasoning schema with a reinforcement learning stage that progressively feeds the model harder, more ambiguous cases. The work matters because it demonstrates that impressive scores on existing grounding benchmarks can mask a model's near-total inability to handle the kind of indirect, goal-driven language people naturally use when describing what they need.
abstract
Existing visual grounding benchmarks primarily evaluate alignment between image regions and literal referring expressions, where models can often succeed by matching a prominent named category. We explore a complementary and more challenging setting of scenario-based visual grounding, where the target must be inferred from roles, intentions, and relational context rather than explicit naming. We introduce Referring Scenario Comprehension (RSC), a benchmark designed for this setting. The queries in this benchmark are paragraph-length texts that describe object roles, user goals, and contextual cues, including deliberate references to distractor objects that often require deep understanding to resolve. Each instance is annotated with interpretable difficulty tags for uniqueness, clutter, size, overlap, and position which expose distinct failure modes and support fine-grained analysis. RSC contains approximately 31k training examples, 4k in-domain test examples, and a 3k out-of-distribution split with unseen object categories. We further propose ScenGround, a curriculum reasoning method serving as a reference point for this setting, combining supervised warm-starting with difficulty-aware reinforcement learning. Experiments show that scenario-based queries expose systematic failures in current models that standard benchmarks do not reveal, and that curriculum training improves performance on challenging slices and transfers to standard benchmarks.
details
citation
@article{hebeyond,
title = {Beyond Referring Expressions: Scenario Comprehension Visual Grounding},
author = {He, Ruozhen and Shah, Nisarg A. and Dong, Qihua and Xiao, Zilin and Koo, Jaywon and Ordonez, Vicente},
journal = {arxiv:2604.02323},
url = {https://arxiv.org/abs/2604.02323},
}
automatically generated questions, main contributions and limitations of this paper
Questions this paper helps answer
- What is RSC and how does it differ from benchmarks like RefCOCO? RSC replaces short literal referring phrases with paragraph-length scenario queries that describe a user role, goal, and at least three disambiguating cues, and deliberately name distractor objects; models must predict both the target category and a bounding box without being told the category name in the query.
- How do current state-of-the-art models perform on RSC? Closed-source models like GPT-4o and Claude 3.7 achieve high category accuracy but very low localization accuracy on RSC, with GPT-4o reaching only 13.23 percent Acc@0.5 on the in-domain split, while the proposed ScenGround method reaches 60.90 percent Acc@0.5 on the same split.
- What is ScenGround and how does it work? ScenGround is a two-stage curriculum training method built on Qwen2.5-VL-7B: Stage 1 is a supervised fine-tuning step on easier RSC slices to align the model to the reasoning schema, and Stage 2 applies difficulty-aware GRPO reinforcement learning with shaped IoU and alias-aware category rewards, progressively sampling harder instances.
- Does training on RSC transfer to standard referring expression benchmarks? Yes, ScenGround's GRPO stage improves Acc@0.5 on RefCOCO+ validation from 52.54 to 70.16 percent and on RefCOCOg validation from 52.46 to 78.19 percent when using the same custom prompt, suggesting the curriculum develops transferable disambiguation skills.
- What does the out-of-distribution split test and what do results show? The OOD split uses LVIS categories with no overlap with COCO training categories, testing cross-category generalization; ScenGround achieves 38.11 percent Acc@0.5 on OOD compared to 15.88 percent for the base Qwen2.5-VL model, but OOD category naming accuracy remains close to baseline, indicating that spatial grounding generalizes better than semantic naming under category shift.
Main contributions
- RSC introduces scenario-based visual grounding queries averaging 52.7 words, more than six times longer than RefCOCO queries, with per-instance difficulty tags across five axes, per-instance reasoning trace annotations, and a strictly disjoint out-of-distribution test split drawn from LVIS.
- The benchmark exposes a systematic failure mode in current vision-language models: models with strong category understanding tend to localize poorly, and models with strong detection capabilities lack the semantic reasoning needed for scenario-based queries.
- ScenGround demonstrates that a tag-aware curriculum combining supervised warm-starting with difficulty-progressive reinforcement learning substantially improves both in-domain and out-of-distribution localization, raising mIoU from 30.31 to 55.68 on RSC-ID for the base model.
- Human audit of 300 instances across three annotators yielded 95.7 percent majority-vote accuracy with Fleiss kappa of 0.94, supporting the reliability of the benchmark annotations.
- The paper provides a controlled ablation showing that curriculum ordering matters: mixing easy and hard instances in a single GRPO stage yields lower performance than the two-stage easy-to-hard curriculum, consistent with the reward sparsity explanation offered by the authors.
Limitations and cautions
- ScenGround's out-of-distribution category naming accuracy is still close to the untuned baseline, which usefully separates semantic naming from spatial grounding; the strong localization gains suggest the curriculum is already improving an important part of the harder scenario-comprehension problem.
- RSC uses GPT-4o to generate scenarios and Gemini-2.5-Pro as a quality judge, with a human audit validating a sampled subset; broader human review could further strengthen the benchmark, but the reported 95.7 percent majority-vote accuracy and high agreement provide reassuring evidence that the annotations are reliable.
- RSC currently focuses on static, single-object, exocentric grounding, which makes the benchmark precise and analyzable; multi-object, temporal, and interactive grounding are natural extensions that build on the same scenario-comprehension idea.
- The Grounding DINO comparison uses oracle category inputs, so it is best read as an informative upper-bound reference rather than a direct deployment comparison; this still helps clarify how much of the challenge comes from scenario understanding versus object localization.
- The benchmark is built from MS-COCO and LVIS natural images, leaving other domains such as medical images, GUIs, and satellite imagery for future study; within its chosen domain, the in-domain and out-of-distribution splits already reveal a meaningful evaluation gap.
How to read this result
This paper is best read as a strong and timely contribution to visual grounding: it defines a realistic scenario-comprehension challenge, backs it with a carefully validated benchmark and controlled experiments, and shows that curriculum reasoning can substantially improve localization even while leaving rich opportunities for future generalization work.