Recognition: unknown
RefereeBench: Are Video MLLMs Ready to be Multi-Sport Referees
Pith reviewed 2026-05-10 08:36 UTC · model grok-4.3
The pith
RefereeBench shows that even the strongest video MLLMs reach only around 60% accuracy on multi-sport refereeing tasks and struggle with rule application and temporal grounding.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Extensive evaluations of state-of-the-art MLLMs show that even the strongest models, such as Doubao-Seed-1.8 and Gemini-3-Pro, achieve only around 60% accuracy, while the strongest open-source model, Qwen3-VL, reaches only 47%. These results indicate that current models remain far from being reliable sports referees.
Load-bearing premise
The benchmark is fully human-annotated to ensure high-quality annotations grounded in authentic officiating logic and multimodal evidence (abstract). If the selected videos or question design do not fully capture real referee decision processes or contain annotation biases, the performance numbers would not reliably indicate readiness for officiating.
Figures
read the original abstract
While Multimodal Large Language Models (MLLMs) excel at generic video understanding, their ability to support specialized, rule-grounded decision-making remains insufficiently explored. In this paper, we introduce RefereeBench, the first large-scale benchmark for evaluating MLLMs as automatic sports referees. Spanning 11 sports with 925 curated videos and 6,475 QA pairs, RefereeBench evaluates five core officiating abilities: foul existence, foul and penalty classification, foul and penalty reasoning, entity perception, and temporal grounding. The benchmark is fully human-annotated to ensure high-quality annotations grounded in authentic officiating logic and multimodal evidence. Extensive evaluations of state-of-the-art MLLMs show that even the strongest models, such as Doubao-Seed-1.8 and Gemini-3-Pro, achieve only around 60% accuracy, while the strongest open-source model, Qwen3-VL, reaches only 47%. These results indicate that current models remain far from being reliable sports referees. Further analysis shows that while models can often identify incidents and involved entities, they struggle with rule application and temporal grounding, and frequently over-call fouls on normal clips. Our benchmark highlights the need for future MLLMs that better integrate domain knowledge and multimodal understanding, advancing trustworthy AI-assisted officiating and broader multimodal decision-making.
Editorial analysis
A structured set of objections, weighed in public.
Circularity Check
No circularity: purely empirical benchmark with independent human annotations and model evaluations
full rationale
The paper introduces RefereeBench as a new dataset of 925 videos and 6,475 human-annotated QA pairs across 11 sports, then reports direct accuracy measurements on existing MLLMs (e.g., Doubao-Seed-1.8 and Gemini-3-Pro at ~60%, Qwen3-VL at 47%). No derivations, fitted parameters, equations, or self-referential claims appear; the central conclusion follows from the observed scores relative to the benchmark design. The absence of a human expert baseline affects interpretability but does not create circularity, as the reported numbers are not forced by any internal fit or prior self-citation. The work is self-contained against external model checkpoints and the newly collected annotations.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human annotations accurately reflect authentic officiating logic and multimodal evidence
Reference graph
Works this paper leans on
-
[1]
Mohammad Mahdi Abootorabi, Amirhosein Zobeiri, Mahdi Dehghani, Moham- madali Mohammadkhani, Bardia Mohammadi, Omid Ghahroodi, Mahdieh Soley- mani Baghshah, and Ehsaneddin Asgari. 2025. Ask in any modality: A compre- hensive survey on multimodal retrieval-augmented generation.Findings of the Association for Computational Linguistics: ACL 2025(2025), 16776–16809
2025
-
[2]
Anthropic. 2025. Introducing Claude Sonnet 4.5. https://www.anthropic.com/ news/claude-sonnet-4-5. Accessed: 2026-03-29
2025
-
[3]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al . 2025. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [4]
-
[5]
ByteDance Seed. 2025. Seed1.8: A generalized agentic model that can efficiently and accurately accomplish complex tasks in real-world scenarios. ByteDance Official Website. https://seed.bytedance.com/en/seed1_8 Accessed: 2026-03-18
2025
-
[6]
Joya Chen, Ziyun Zeng, Yiqi Lin, Wei Li, Zejun Ma, and Mike Zheng Shou
-
[7]
In Proceedings of the Computer Vision and Pattern Recognition Conference
Livecc: Learning video llm with streaming speech transcription at scale. In Proceedings of the Computer Vision and Pattern Recognition Conference. 29083– 29095
-
[8]
Yutao Cui, Chenkai Zeng, Xiaoyu Zhao, Yichun Yang, Gangshan Wu, and Limin Wang. 2023. Sportsmot: A large multi-object tracking dataset in multiple sports scenes. InProceedings of the IEEE/CVF international conference on computer vision. 9921–9931
2023
-
[9]
Adrien Deliege, Anthony Cioppa, Silvio Giancola, Meisam J Seikavandi, Jacob V Dueholm, Kamal Nasrollahi, Bernard Ghanem, Thomas B Moeslund, and Marc Van Droogenbroeck. 2021. Soccernet-v2: A dataset and benchmarks for holis- tic understanding of broadcast soccer videos. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 4508–4519
2021
-
[10]
Artur Díaz-Juan, Coloma Ballester, and Gloria Haro. 2025. SoccerHigh: a bench- mark dataset for automatic soccer video summarization. InProceedings of the 8th International ACM Workshop on Multimedia Content Analysis in Sports. 121–130
2025
-
[11]
Nikolay S Falaleev and Ruilong Chen. 2024. Enhancing soccer camera calibra- tion through keypoint exploitation. InProceedings of the 7th ACM International Workshop on Multimedia Content Analysis in Sports. 65–73
2024
- [12]
-
[13]
FIFA. 2022. Semi-automated offside technology to be used at FIFA World Cup
2022
-
[14]
https://www.fifa.com/technical/media-releases/semi-automated-offside- technology-to-be-used-at-fifa-world-cup-2022-tm
2022
-
[15]
Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. 2025. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 24108–24118
2025
-
[16]
Rong Gao, Xin Liu, Zhuozhao Hu, Bohao Xing, Baiqiang Xia, Zitong Yu, and Heikki Kälviäinen. 2025. Fsbench: A figure skating benchmark for advancing artistic sports understanding. InProceedings of the Computer Vision and Pattern Recognition Conference. 13595–13605
2025
- [17]
-
[18]
Silvio Giancola, Mohieddine Amine, Tarek Dghaily, and Bernard Ghanem. 2018. Soccernet: A scalable dataset for action spotting in soccer videos. InProceedings of the IEEE conference on computer vision and pattern recognition workshops. 1711– 1721
2018
-
[19]
Google. 2025. A New Era of Intelligence with Gemini 3. https://blog.google/ products-and-platforms/products/gemini/gemini-3/. Accessed: 2026-03-29
2025
-
[20]
Xusheng He, Wei Liu, Shanshan Ma, Qian Liu, Chenghao Ma, and Jianlong Wu. 2025. Finebadminton: A multi-level dataset for fine-grained badminton video understanding. InProceedings of the 33rd ACM International Conference on Multimedia. 12776–12783
2025
-
[21]
Jan Held, Anthony Cioppa, Silvio Giancola, Bernard Ghanem, and Marc Van Droogenbroeck. 2023. VARS: Video assistant referee system for automated soccer decision making from multiple views. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 5086–5097
2023
-
[22]
Jan Held, Hani Itani, Anthony Cioppa, Silvio Giancola, Bernard Ghanem, and Marc Van Droogenbroeck. 2024. X-vars: Introducing explainability in football refereeing with multi-modal large language models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 3267–3279
2024
-
[23]
Yutaro Honda, Rei Kawakami, Ryota Yoshihashi, Kenta Kato, and Takeshi Nae- mura. 2022. Pass receiver prediction in soccer using video and players’ trajec- tories. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 3503–3512
2022
-
[24]
Yu-Chuan Huang, I-No Liao, Ching-Hsuan Chen, Tsì-Uí İk, and Wen-Chih Peng
-
[25]
In2019 16th IEEE international conference on advanced video and signal based surveillance (A VSS)
Tracknet: A deep learning network for tracking high-speed and tiny objects in sports applications. In2019 16th IEEE international conference on advanced video and signal based surveillance (A VSS). IEEE, 1–8
-
[26]
Christian Keilstrup Ingwersen, Christian Møller Mikkelstrup, Janus Nørtoft Jensen, Morten Rieger Hannemose, and Anders Bjorholm Dahl. 2023. Sportspose- a dynamic 3d sports pose dataset. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5219–5228
2023
- [27]
-
[28]
Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. 2024. Mvbench: A comprehensive multi-modal video understanding benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 22195–22206
2024
-
[29]
Yangning Li, Yinghui Li, Xinyu Wang, Yong Jiang, Zhen Zhang, Xinran Zheng, Hui Wang, Hai-Tao Zheng, Philip S Yu, Fei Huang, et al. 2024. Benchmarking multimodal retrieval augmented generation with dynamic vqa dataset and self- adaptive planning agent.arXiv preprint arXiv:2411.02937(2024)
- [30]
- [31]
-
[32]
Hassan Mkhallati, Anthony Cioppa, Silvio Giancola, Bernard Ghanem, and Marc Van Droogenbroeck. 2023. SoccerNet-caption: Dense video captioning for soccer broadcasts commentaries. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5074–5085
2023
-
[33]
OpenAI. 2024. Hello GPT-4o. https://openai.com/index/hello-gpt-4o/. Accessed: 2026-03-29
2024
-
[34]
OpenAI. 2025. Introducing GPT-5. https://openai.com/index/introducing-gpt-5/. Accessed: 2026-03-29
2025
-
[35]
Ji Qi, Jifan Yu, Teng Tu, Kunyu Gao, Yifan Xu, Xinyu Guan, Xiaozhi Wang, Bin Xu, Lei Hou, Juanzi Li, et al. 2023. GOAL: A challenging knowledge-grounded video captioning benchmark for real-time soccer commentary generation. InPro- ceedings of the 32nd ACM international conference on information and knowledge management. 5391–5395
2023
-
[36]
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2023. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems36 (2023), 53728–53741
2023
-
[37]
Jiayuan Rao, Zifeng Li, Haoning Wu, Ya Zhang, Yanfeng Wang, and Weidi Xie
-
[38]
InProceedings of the 33rd ACM International Conference on Multimedia
Multi-agent system for comprehensive soccer understanding. InProceedings of the 33rd ACM International Conference on Multimedia. 3654–3663
-
[39]
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov
-
[40]
Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347 (2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[41]
Dian Shao, Yue Zhao, Bo Dai, and Dahua Lin. 2020. Finegym: A hierarchical video dataset for fine-grained action understanding. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2616–2625
2020
-
[42]
Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hongsheng Li. 2024. Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought rea- soning.Advances in Neural Information Processing Systems37 (2024), 8612–8642
2024
-
[43]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. 2024. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300(2024). Conference acronym ’XX, XX XX–XX, XX, XX, XX Yichen Xu, Yuanhang Liu, Chuhan Wang et al
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[44]
Enxin Song, Wenhao Chai, Weili Xu, Jianwen Xie, Yuxuan Liu, and Gaoang Wang. 2025. Video-mmlu: A massive multi-discipline lecture understanding benchmark. InProceedings of the IEEE/CVF International Conference on Computer Vision. 6099–6113
2025
-
[45]
Nien-En Sun, Yu-Ching Lin, Shao-Ping Chuang, Tzu-Han Hsu, Dung-Ru Yu, Ho- Yi Chung, and Tsì-Uí İk. 2020. Tracknetv2: Efficient shuttlecock tracking network. In2020 International Conference on Pervasive Artificial Intelligence (ICPAI). IEEE, 86–91
2020
-
[46]
The International Football Association Board (IFAB). 2024. Video Assistant Referee (VAR) Protocol. https://www.theifab.com/laws/latest/video-assistant- referee-var-protocol/
2024
- [47]
-
[48]
Jiaan Wang, Zhixu Li, Qiang Yang, Jianfeng Qu, Zhigang Chen, Qingsheng Liu, and Guoping Hu. 2021. Sportssum2. 0: Generating high-quality sports news from live text commentary. InProceedings of the 30th ACM International Conference on Information & Knowledge Management. 3463–3467
2021
-
[49]
Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. 2025. Internvl3. 5: Ad- vancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265(2025)
work page internal anchor Pith review arXiv 2025
-
[50]
Zhe Wang, Petar Veličković, Daniel Hennes, Nenad Tomašev, Laurel Prince, Michael Kaisers, Yoram Bachrach, Romuald Elie, Li Kevin Wenliang, Federico Piccinini, et al. 2024. TacticAI: an AI assistant for football tactics.Nature commu- nications15, 1 (2024), 1906
2024
-
[51]
Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. 2024. Longvideobench: A benchmark for long-context interleaved video-language understanding.Advances in Neural Information Processing Systems37 (2024), 28828–28857
2024
-
[52]
Haotian Xia, Zhengbang Yang, Yuqing Wang, Rhys Tracy, Yun Zhao, Dongdong Huang, Zezhi Chen, Yan Zhu, Yuan-fang Wang, and Weining Shen. 2024. SportQA: A Benchmark for Sports Understanding in Large Language Models. InProceed- ings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologi...
2024
- [53]
- [54]
-
[55]
Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, et al. 2025. Qwen3-omni technical report. arXiv preprint arXiv:2509.17765(2025)
work page internal anchor Pith review arXiv 2025
-
[56]
Jinglin Xu, Yongming Rao, Xumin Yu, Guangyi Chen, Jie Zhou, and Jiwen Lu
-
[57]
InProceedings of the IEEE/CVF conference on computer vision and pattern recognition
Finediving: A fine-grained dataset for procedure-aware action quality assessment. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2949–2958
-
[58]
Yichen Xu, Jianzhe Ma, Chuhan Wang, Zhonghao Cao, Liangyu Chen, Wenxuan Wang, and Qin Jin. [n. d.]. A Survey of Large Models in Sports. ([n. d.])
-
[59]
Shuhang Xun, Sicheng Tao, Jungang Li, Yibo Shi, Zhixin Lin, Zhanhui Zhu, Yibo Yan, Hanqian Li, Linghao Zhang, Shikang Wang, et al. 2025. Rtv-bench: Bench- marking mllm continuous perception, understanding and reasoning through real-time video.arXiv preprint arXiv:2505.02064(2025)
-
[60]
Ling You, Wenxuan Huang, Xinni Xie, Xiangyi Wei, Bangyan Li, Shaohui Lin, Yang Li, and Changbo Wang. 2025. Timesoccer: An end-to-end multimodal large language model for soccer commentary generation. InProceedings of the 33rd ACM International Conference on Multimedia. 3418–3427
2025
-
[61]
Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, et al. 2025. Videollama 3: Frontier multimodal foundation models for image and video understanding. arXiv preprint arXiv:2501.13106(2025)
work page internal anchor Pith review arXiv 2025
-
[62]
Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. 2024. Llava-video: Video instruction tuning with synthetic data.arXiv preprint arXiv:2410.02713(2024)
work page Pith review arXiv 2024
-
[63]
Ruochen Zhao, Hailin Chen, Weishi Wang, Fangkai Jiao, Xuan Long Do, Chengwei Qin, Bosheng Ding, Xiaobao Guo, Minzhi Li, Xingxuan Li, et al. 2023. Retrieving multimodal information for augmented generation: A survey. InFindings of the Association for Computational Linguistics: EMNLP 2023. 4736–4756
2023
-
[64]
Zhonghan Zhao, Wenhao Chai, Shengyu Hao, Wenhao Hu, Guanhong Wang, Shidong Cao, Mingli Song, Jenq-Neng Hwang, and Gaoang Wang. 2025. A survey of deep learning in sports applications: Perception, comprehension, and decision. IEEE Transactions on Visualization and Computer Graphics(2025). A Data Construction Details A.1 Video Collection License.All raw vide...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.