SportR: A Benchmark for Multimodal Large Language Model Reasoning in Sports

Haotian Xia; Haonan Ge; Junbo Zou; Hyun Woo Choi; Xuebin Zhang; Danny Suradja; Botao Rui; Ethan Tran; Wendy Jin; Zhen Ye; Xiyang Lin; Christopher Lai; Shengjie Zhang; Junwen Miao; Shichao Chen; Rhys Tracy; Vicente Ordonez; Weining Shen; Hanjie Chen

publication

SportR: A Benchmark for Multimodal Large Language Model Reasoning in Sports

Haotian Xia, Haonan Ge, Junbo Zou, Hyun Woo Choi, Xuebin Zhang, Danny Suradja, Botao Rui, Ethan Tran, Wendy Jin, Zhen Ye, Xiyang Lin, Christopher Lai, Shengjie Zhang, Junwen Miao, Shichao Chen, Rhys Tracy, Vicente Ordonez, Weining Shen, Hanjie Chen.

International Conference on Learning Representations. ICLR 2026.

paper pdf raw bibtex

Lab News Desk

News Release Summary

This section is intentionally written in a reporter-style news release voice for general readers.

Researchers from Rice University, UC Irvine, Georgia Tech, Johns Hopkins, and UC Santa Barbara have released SportR, a large-scale benchmark designed to test how well AI systems can reason about sports rules and tactics — not just identify what sport is being played. The benchmark addresses a gap left by existing datasets, which either cover only a single sport, rely too heavily on multiple-choice questions, or lack the fine-grained reasoning annotations needed to train models to think step by step. SportR includes 4,789 images and 2,052 video clips spanning basketball, soccer, table tennis, badminton, and American football, covering 50 foul types and 12 tactical categories. Its most distinctive feature is a set of 6,841 fully human-written Chain-of-Thought explanations — produced by a team of 16 domain experts, including former Division I athletes — that walk through the logic behind rule calls in the style of an experienced referee. The benchmark asks models to do progressively harder things: spot whether a foul occurred, classify it, predict the penalty, explain the reasoning, and, uniquely, output the exact bounding box coordinates of the infraction in a static image. When the team tested leading AI models including GPT-5, Claude 4, and Gemini 2.5 Pro, performance on the hardest tasks was consistently poor, with visual grounding scores below 7% IoU across all baselines. Fine-tuning an open-source model on SportR data improved those scores, but even after supervised fine-tuning and reinforcement learning the grounding metric reached only about 10% — a result the authors say underscores how far current models remain from reliably connecting visual evidence to abstract sports knowledge.

abstract

Deeply understanding sports requires an intricate blend of fine-grained visual perception and rule-based reasoning - a challenge that pushes the limits of current multimodal models. To succeed, models must master three critical capabilities: perceiving nuanced visual details, applying abstract sport rule knowledge, and grounding that knowledge in specific visual evidence. Current sports benchmarks either cover single sports or lack the detailed reasoning chains and precise visual grounding needed to robustly evaluate these core capabilities in a multi-sport context. To address this gap, we introduce SportR, the first multi-sports large-scale benchmark designed to train and evaluate MLLMs on the fundamental reasoning required for sports intelligence. Our benchmark provides a dataset of 4,789 images and 2,052 videos. To enable granular evaluation, we structure our benchmark around a progressive hierarchy of question-answer pairs designed to probe reasoning at increasing depths - from simple infraction identification to complex penalty prediction. For the most advanced tasks requiring multi-step reasoning, such as determining penalties or explaining tactics, we provide 6,841 high-quality, human-authored Chain of Thought annotations. In addition, our benchmark incorporates both image and video modalities and provides manual bounding box annotations to test visual grounding in the image part directly. Extensive experiments demonstrate the profound difficulty of our benchmark. State-of-the-art baseline models perform poorly on our most challenging tasks. While training on our data via Supervised Fine-Tuning and Reinforcement Learning improves these scores, they remain relatively low, highlighting a significant gap in current model capabilities. SportR presents a new challenge for the community, providing a critical resource to drive future research in multimodal sports reasoning.

citation

@inproceedings{xia2026sportr,
  title = {SportR: A Benchmark for Multimodal Large Language Model Reasoning in Sports},
  author = {Xia, Haotian and Ge, Haonan and Zou, Junbo and Choi, Hyun Woo and Zhang, Xuebin and Suradja, Danny and Rui, Botao and Tran, Ethan and Jin, Wendy and Ye, Zhen and Lin, Xiyang and Lai, Christopher and Zhang, Shengjie and Miao, Junwen and Chen, Shichao and Tracy, Rhys and Ordonez, Vicente and Shen, Weining and Chen, Hanjie},
  year = {2026},
  booktitle = {International Conference on Learning Representations. ICLR 2026},
  url = {https://arxiv.org/abs/2511.06499},
}