RefereeBench: Are Video MLLMs Ready to be Multi-Sport Referees

Chuhan Wang, Jianzhe Ma, jinghan luo, Qin Jin, Wenxuan Wang, Yichen Xu, Yuanhang Liu, Zihan Zhao

Pith reviewed 2026-05-10 08:36 UTC · model grok-4.3

classification 💻 cs.CV cs.CL

keywords mllmsmodelsmultimodalbenchmarkfoulofficiatingrefereebenchreferees

0 comments

The pith

RefereeBench shows that even the strongest video MLLMs reach only around 60% accuracy on multi-sport refereeing tasks and struggle with rule application and temporal grounding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates RefereeBench to measure how well AI models that process video can act as sports referees. It covers 11 sports with hundreds of real game clips and thousands of human-written questions about whether a foul happened, what kind it was, why it occurred, which players were involved, and exactly when events took place. All questions follow official rules and are checked by people who understand officiating. When researchers ran top AI models on the benchmark, the best closed models scored about 60 percent correct while the strongest open-source model reached only 47 percent. The models could often spot incidents and name the right players or objects, but they frequently misapplied the actual rules of the sport and had trouble pinning down precise timing. They also tended to call fouls on normal, legal plays. This gap shows that current video AI still lacks the ability to reliably combine what it sees with detailed knowledge of rules and sequence.

Core claim

Extensive evaluations of state-of-the-art MLLMs show that even the strongest models, such as Doubao-Seed-1.8 and Gemini-3-Pro, achieve only around 60% accuracy, while the strongest open-source model, Qwen3-VL, reaches only 47%. These results indicate that current models remain far from being reliable sports referees.

Load-bearing premise

The benchmark is fully human-annotated to ensure high-quality annotations grounded in authentic officiating logic and multimodal evidence (abstract). If the selected videos or question design do not fully capture real referee decision processes or contain annotation biases, the performance numbers would not reliably indicate readiness for officiating.

Figures

Figures reproduced from arXiv: 2604.15736 by Chuhan Wang, Jianzhe Ma, jinghan luo, Qin Jin, Wenxuan Wang, Yichen Xu, Yuanhang Liu, Zihan Zhao.

**Figure 1.** Figure 1: We introduce RefereeBench to provide high-quality assessment of MLLMs in automatic sports refereeing, with all [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Statistics analysis of RefereeBench. (Left) Sports categories and rule taxonomies. Our benchmark covers 11 distinct sports, encompassing 64 unique foul types and 34 unique penalty types. (Right) Video duration and sample distributions across sports. RefereeBench has a full spectrum of video lengths and covers diverse officiating semantics, enabling a comprehensive evaluation of automatic sports refereeing … view at source ↗

**Figure 3.** Figure 3: Overall performance on RefereeBench of representative MLLMs across different sports. Results are presented in [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Representative instance from the Ice-Hockey subset. The foul shown is [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 12.** Figure 12: Representative instance from the Basketball subset. The foul shown is [PITH_FULL_IMAGE:figures/full_fig_p015_12.png] view at source ↗

read the original abstract

While Multimodal Large Language Models (MLLMs) excel at generic video understanding, their ability to support specialized, rule-grounded decision-making remains insufficiently explored. In this paper, we introduce RefereeBench, the first large-scale benchmark for evaluating MLLMs as automatic sports referees. Spanning 11 sports with 925 curated videos and 6,475 QA pairs, RefereeBench evaluates five core officiating abilities: foul existence, foul and penalty classification, foul and penalty reasoning, entity perception, and temporal grounding. The benchmark is fully human-annotated to ensure high-quality annotations grounded in authentic officiating logic and multimodal evidence. Extensive evaluations of state-of-the-art MLLMs show that even the strongest models, such as Doubao-Seed-1.8 and Gemini-3-Pro, achieve only around 60% accuracy, while the strongest open-source model, Qwen3-VL, reaches only 47%. These results indicate that current models remain far from being reliable sports referees. Further analysis shows that while models can often identify incidents and involved entities, they struggle with rule application and temporal grounding, and frequently over-call fouls on normal clips. Our benchmark highlights the need for future MLLMs that better integrate domain knowledge and multimodal understanding, advancing trustworthy AI-assisted officiating and broader multimodal decision-making.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RefereeBench is a useful new benchmark for rule-grounded video reasoning in sports, but the headline claim that models are far from reliable referees rests on absolute scores without a human baseline to calibrate them.

read the letter

RefereeBench introduces the first large-scale benchmark for video MLLMs on multi-sport refereeing, with 925 videos across 11 sports and 6,475 human-annotated QA pairs testing five abilities: foul existence, classification, reasoning, entity perception, and temporal grounding. This is genuinely new; earlier video QA work stays more generic and does not target rule application or officiating logic at this scale. The curation effort and the breakdown into those specific skills stand out as practical contributions, and the qualitative failure analysis (over-calling fouls, weak temporal grounding, trouble applying rules) gives concrete directions for improvement. The reported numbers—top closed models around 60 percent, strongest open model at 47 percent—document clear gaps worth knowing about. The soft spot is the missing human expert baseline on the same items. Without it, or without inter-annotator agreement figures, the conclusion that models remain far from reliable referees is harder to interpret; low scores could partly reflect task ambiguity or edge cases rather than model deficiency alone. The abstract also leaves prompting details and the precise evaluation protocol unspecified, which limits immediate reproducibility. This paper is for researchers working on multimodal models for specialized, high-stakes decision tasks and for anyone building benchmarks that move beyond generic QA. A reader interested in grounded reasoning or sports AI would get value from the dataset and error patterns. It deserves peer review because the benchmark itself is a concrete, usable resource even if the interpretation needs tightening with added human data and protocol clarity.

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark with independent human annotations and model evaluations

full rationale

The paper introduces RefereeBench as a new dataset of 925 videos and 6,475 human-annotated QA pairs across 11 sports, then reports direct accuracy measurements on existing MLLMs (e.g., Doubao-Seed-1.8 and Gemini-3-Pro at ~60%, Qwen3-VL at 47%). No derivations, fitted parameters, equations, or self-referential claims appear; the central conclusion follows from the observed scores relative to the benchmark design. The absence of a human expert baseline affects interpretability but does not create circularity, as the reported numbers are not forced by any internal fit or prior self-citation. The work is self-contained against external model checkpoints and the newly collected annotations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the curated videos and human annotations validly represent authentic officiating requirements; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Human annotations accurately reflect authentic officiating logic and multimodal evidence
The benchmark construction and performance claims depend on the quality and fidelity of the human-annotated QA pairs.

pith-pipeline@v0.9.0 · 5566 in / 1320 out tokens · 40128 ms · 2026-05-10T08:36:00.597826+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

64 extracted references · 19 canonical work pages · 6 internal anchors

[1]

Mohammad Mahdi Abootorabi, Amirhosein Zobeiri, Mahdi Dehghani, Moham- madali Mohammadkhani, Bardia Mohammadi, Omid Ghahroodi, Mahdieh Soley- mani Baghshah, and Ehsaneddin Asgari. 2025. Ask in any modality: A compre- hensive survey on multimodal retrieval-augmented generation.Findings of the Association for Computational Linguistics: ACL 2025(2025), 16776–16809

2025
[2]

Anthropic. 2025. Introducing Claude Sonnet 4.5. https://www.anthropic.com/ news/claude-sonnet-4-5. Accessed: 2026-03-29

2025
[3]

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al . 2025. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Max Bain, Jaesung Huh, Tengda Han, and Andrew Zisserman. 2023. Whis- perx: Time-accurate speech transcription of long-form audio.arXiv preprint arXiv:2303.00747(2023)

work page arXiv 2023
[5]

ByteDance Seed. 2025. Seed1.8: A generalized agentic model that can efficiently and accurately accomplish complex tasks in real-world scenarios. ByteDance Official Website. https://seed.bytedance.com/en/seed1_8 Accessed: 2026-03-18

2025
[6]

Joya Chen, Ziyun Zeng, Yiqi Lin, Wei Li, Zejun Ma, and Mike Zheng Shou
[7]

In Proceedings of the Computer Vision and Pattern Recognition Conference

Livecc: Learning video llm with streaming speech transcription at scale. In Proceedings of the Computer Vision and Pattern Recognition Conference. 29083– 29095
[8]

Yutao Cui, Chenkai Zeng, Xiaoyu Zhao, Yichun Yang, Gangshan Wu, and Limin Wang. 2023. Sportsmot: A large multi-object tracking dataset in multiple sports scenes. InProceedings of the IEEE/CVF international conference on computer vision. 9921–9931

2023
[9]

Adrien Deliege, Anthony Cioppa, Silvio Giancola, Meisam J Seikavandi, Jacob V Dueholm, Kamal Nasrollahi, Bernard Ghanem, Thomas B Moeslund, and Marc Van Droogenbroeck. 2021. Soccernet-v2: A dataset and benchmarks for holis- tic understanding of broadcast soccer videos. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 4508–4519

2021
[10]

Artur Díaz-Juan, Coloma Ballester, and Gloria Haro. 2025. SoccerHigh: a bench- mark dataset for automatic soccer video summarization. InProceedings of the 8th International ACM Workshop on Multimedia Content Analysis in Sports. 121–130

2025
[11]

Nikolay S Falaleev and Ruilong Chen. 2024. Enhancing soccer camera calibra- tion through keypoint exploitation. InProceedings of the 7th ACM International Workshop on Multimedia Content Analysis in Sports. 65–73

2024
[12]

Bo Feng, Zhengfeng Lai, Shiyu Li, Zizhen Wang, Simon Wang, Ping Huang, and Meng Cao. 2025. Breaking Down Video LLM Benchmarks: Knowledge, Spatial Perception, or True Temporal Understanding?arXiv preprint arXiv:2505.14321 (2025)

work page arXiv 2025
[13]

FIFA. 2022. Semi-automated offside technology to be used at FIFA World Cup

2022
[14]

https://www.fifa.com/technical/media-releases/semi-automated-offside- technology-to-be-used-at-fifa-world-cup-2022-tm

2022
[15]

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. 2025. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 24108–24118

2025
[16]

Rong Gao, Xin Liu, Zhuozhao Hu, Bohao Xing, Baiqiang Xia, Zitong Yu, and Heikki Kälviäinen. 2025. Fsbench: A figure skating benchmark for advancing artistic sports understanding. InProceedings of the Computer Vision and Pattern Recognition Conference. 13595–13605

2025
[17]

Kuangzhi Ge, Lingjun Chen, Kevin Zhang, Yulin Luo, Tianyu Shi, Liaoyuan Fan, Xiang Li, Guanqun Wang, and Shanghang Zhang. 2024. Scbench: A sports commentary benchmark for video LLMs.arXiv preprint arXiv:2412.17637(2024)

work page arXiv 2024
[18]

Silvio Giancola, Mohieddine Amine, Tarek Dghaily, and Bernard Ghanem. 2018. Soccernet: A scalable dataset for action spotting in soccer videos. InProceedings of the IEEE conference on computer vision and pattern recognition workshops. 1711– 1721

2018
[19]

Google. 2025. A New Era of Intelligence with Gemini 3. https://blog.google/ products-and-platforms/products/gemini/gemini-3/. Accessed: 2026-03-29

2025
[20]

Xusheng He, Wei Liu, Shanshan Ma, Qian Liu, Chenghao Ma, and Jianlong Wu. 2025. Finebadminton: A multi-level dataset for fine-grained badminton video understanding. InProceedings of the 33rd ACM International Conference on Multimedia. 12776–12783

2025
[21]

Jan Held, Anthony Cioppa, Silvio Giancola, Bernard Ghanem, and Marc Van Droogenbroeck. 2023. VARS: Video assistant referee system for automated soccer decision making from multiple views. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 5086–5097

2023
[22]

Jan Held, Hani Itani, Anthony Cioppa, Silvio Giancola, Bernard Ghanem, and Marc Van Droogenbroeck. 2024. X-vars: Introducing explainability in football refereeing with multi-modal large language models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 3267–3279

2024
[23]

Yutaro Honda, Rei Kawakami, Ryota Yoshihashi, Kenta Kato, and Takeshi Nae- mura. 2022. Pass receiver prediction in soccer using video and players’ trajec- tories. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 3503–3512

2022
[24]

Yu-Chuan Huang, I-No Liao, Ching-Hsuan Chen, Tsì-Uí İk, and Wen-Chih Peng
[25]

In2019 16th IEEE international conference on advanced video and signal based surveillance (A VSS)

Tracknet: A deep learning network for tracking high-speed and tiny objects in sports applications. In2019 16th IEEE international conference on advanced video and signal based surveillance (A VSS). IEEE, 1–8
[26]

Christian Keilstrup Ingwersen, Christian Møller Mikkelstrup, Janus Nørtoft Jensen, Morten Rieger Hannemose, and Anders Bjorholm Dahl. 2023. Sportspose- a dynamic 3d sports pose dataset. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5219–5228

2023
[27]

Haopeng Li, Andong Deng, Qiuhong Ke, Jun Liu, Hossein Rahmani, Yulan Guo, Bernt Schiele, and Chen Chen. 2024. Sports-qa: A large-scale video question answering benchmark for complex and professional sports.arXiv preprint arXiv:2401.01505(2024)

work page arXiv 2024
[28]

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. 2024. Mvbench: A comprehensive multi-modal video understanding benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 22195–22206

2024
[29]

Yangning Li, Yinghui Li, Xinyu Wang, Yong Jiang, Zhen Zhang, Xinran Zheng, Hui Wang, Hai-Tao Zheng, Philip S Yu, Fei Huang, et al. 2024. Benchmarking multimodal retrieval augmented generation with dynamic vqa dataset and self- adaptive planning agent.arXiv preprint arXiv:2411.02937(2024)

work page arXiv 2024
[30]

Yuanxin Liu, Kun Ouyang, Haoning Wu, Yi Liu, Lin Sui, Xinhao Li, Yan Zhong, Y Charles, Xinyu Zhou, and Xu Sun. 2025. VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video Reasoning?arXiv preprint arXiv:2505.23359(2025)

work page arXiv 2025
[31]

Zhaoyu Liu, Kan Jiang, Murong Ma, Zhe Hou, Yun Lin, and Jin Song Dong. 2025. F3 Set: Towards Analyzing Fast, Frequent, and Fine-grained Events from Videos. arXiv preprint arXiv:2504.08222(2025)

work page arXiv 2025
[32]

Hassan Mkhallati, Anthony Cioppa, Silvio Giancola, Bernard Ghanem, and Marc Van Droogenbroeck. 2023. SoccerNet-caption: Dense video captioning for soccer broadcasts commentaries. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5074–5085

2023
[33]

OpenAI. 2024. Hello GPT-4o. https://openai.com/index/hello-gpt-4o/. Accessed: 2026-03-29

2024
[34]

OpenAI. 2025. Introducing GPT-5. https://openai.com/index/introducing-gpt-5/. Accessed: 2026-03-29

2025
[35]

Ji Qi, Jifan Yu, Teng Tu, Kunyu Gao, Yifan Xu, Xinyu Guan, Xiaozhi Wang, Bin Xu, Lei Hou, Juanzi Li, et al. 2023. GOAL: A challenging knowledge-grounded video captioning benchmark for real-time soccer commentary generation. InPro- ceedings of the 32nd ACM international conference on information and knowledge management. 5391–5395

2023
[36]

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2023. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems36 (2023), 53728–53741

2023
[37]

Jiayuan Rao, Zifeng Li, Haoning Wu, Ya Zhang, Yanfeng Wang, and Weidi Xie
[38]

InProceedings of the 33rd ACM International Conference on Multimedia

Multi-agent system for comprehensive soccer understanding. InProceedings of the 33rd ACM International Conference on Multimedia. 3654–3663
[39]

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov
[40]

Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[41]

Dian Shao, Yue Zhao, Bo Dai, and Dahua Lin. 2020. Finegym: A hierarchical video dataset for fine-grained action understanding. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2616–2625

2020
[42]

Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hongsheng Li. 2024. Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought rea- soning.Advances in Neural Information Processing Systems37 (2024), 8612–8642

2024
[43]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. 2024. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300(2024). Conference acronym ’XX, XX XX–XX, XX, XX, XX Yichen Xu, Yuanhang Liu, Chuhan Wang et al

work page internal anchor Pith review Pith/arXiv arXiv 2024
[44]

Enxin Song, Wenhao Chai, Weili Xu, Jianwen Xie, Yuxuan Liu, and Gaoang Wang. 2025. Video-mmlu: A massive multi-discipline lecture understanding benchmark. InProceedings of the IEEE/CVF International Conference on Computer Vision. 6099–6113

2025
[45]

Nien-En Sun, Yu-Ching Lin, Shao-Ping Chuang, Tzu-Han Hsu, Dung-Ru Yu, Ho- Yi Chung, and Tsì-Uí İk. 2020. Tracknetv2: Efficient shuttlecock tracking network. In2020 International Conference on Pervasive Artificial Intelligence (ICPAI). IEEE, 86–91

2020
[46]

The International Football Association Board (IFAB). 2024. Video Assistant Referee (VAR) Protocol. https://www.theifab.com/laws/latest/video-assistant- referee-var-protocol/

2024
[47]

Wenbo Tian, Ruting Lin, Hongxian Zheng, Yaodong Yang, Geng Wu, Zihao Zhang, and Zhang Zhang. 2025. SportsGPT: An LLM-driven Framework for Interpretable Sports Motion Assessment and Training Guidance.arXiv preprint arXiv:2512.14121(2025)

work page arXiv 2025
[48]

Jiaan Wang, Zhixu Li, Qiang Yang, Jianfeng Qu, Zhigang Chen, Qingsheng Liu, and Guoping Hu. 2021. Sportssum2. 0: Generating high-quality sports news from live text commentary. InProceedings of the 30th ACM International Conference on Information & Knowledge Management. 3463–3467

2021
[49]

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. 2025. Internvl3. 5: Ad- vancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265(2025)

work page internal anchor Pith review arXiv 2025
[50]

Zhe Wang, Petar Veličković, Daniel Hennes, Nenad Tomašev, Laurel Prince, Michael Kaisers, Yoram Bachrach, Romuald Elie, Li Kevin Wenliang, Federico Piccinini, et al. 2024. TacticAI: an AI assistant for football tactics.Nature commu- nications15, 1 (2024), 1906

2024
[51]

Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. 2024. Longvideobench: A benchmark for long-context interleaved video-language understanding.Advances in Neural Information Processing Systems37 (2024), 28828–28857

2024
[52]

Haotian Xia, Zhengbang Yang, Yuqing Wang, Rhys Tracy, Yun Zhao, Dongdong Huang, Zezhi Chen, Yan Zhu, Yuan-fang Wang, and Weining Shen. 2024. SportQA: A Benchmark for Sports Understanding in Large Language Models. InProceed- ings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologi...

2024
[53]

Haotian Xia, Zhengbang Yang, Yun Zhao, Yuqing Wang, Jingxi Li, Rhys Tracy, Zhuangdi Zhu, Yuan-fang Wang, Hanjie Chen, and Weining Shen. 2024. Language and multimodal models in sports: A survey of datasets and applications.arXiv preprint arXiv:2406.12252(2024)

work page arXiv 2024
[54]

Haotian Xia, Zhengbang Yang, Junbo Zou, Rhys Tracy, Yuqing Wang, Chi Lu, Christopher Lai, Yanjun He, Xun Shao, Zhuoqing Xie, et al . 2024. Sportu: A comprehensive sports understanding benchmark for multimodal large language models.arXiv preprint arXiv:2410.08474(2024)

work page arXiv 2024
[55]

Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, et al. 2025. Qwen3-omni technical report. arXiv preprint arXiv:2509.17765(2025)

work page internal anchor Pith review arXiv 2025
[56]

Jinglin Xu, Yongming Rao, Xumin Yu, Guangyi Chen, Jie Zhou, and Jiwen Lu
[57]

InProceedings of the IEEE/CVF conference on computer vision and pattern recognition

Finediving: A fine-grained dataset for procedure-aware action quality assessment. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2949–2958
[58]

Yichen Xu, Jianzhe Ma, Chuhan Wang, Zhonghao Cao, Liangyu Chen, Wenxuan Wang, and Qin Jin. [n. d.]. A Survey of Large Models in Sports. ([n. d.])
[59]

Shuhang Xun, Sicheng Tao, Jungang Li, Yibo Shi, Zhixin Lin, Zhanhui Zhu, Yibo Yan, Hanqian Li, Linghao Zhang, Shikang Wang, et al. 2025. Rtv-bench: Bench- marking mllm continuous perception, understanding and reasoning through real-time video.arXiv preprint arXiv:2505.02064(2025)

work page arXiv 2025
[60]

Ling You, Wenxuan Huang, Xinni Xie, Xiangyi Wei, Bangyan Li, Shaohui Lin, Yang Li, and Changbo Wang. 2025. Timesoccer: An end-to-end multimodal large language model for soccer commentary generation. InProceedings of the 33rd ACM International Conference on Multimedia. 3418–3427

2025
[61]

Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, et al. 2025. Videollama 3: Frontier multimodal foundation models for image and video understanding. arXiv preprint arXiv:2501.13106(2025)

work page internal anchor Pith review arXiv 2025
[62]

Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. 2024. Llava-video: Video instruction tuning with synthetic data.arXiv preprint arXiv:2410.02713(2024)

work page Pith review arXiv 2024
[63]

Ruochen Zhao, Hailin Chen, Weishi Wang, Fangkai Jiao, Xuan Long Do, Chengwei Qin, Bosheng Ding, Xiaobao Guo, Minzhi Li, Xingxuan Li, et al. 2023. Retrieving multimodal information for augmented generation: A survey. InFindings of the Association for Computational Linguistics: EMNLP 2023. 4736–4756

2023
[64]

Zhonghan Zhao, Wenhao Chai, Shengyu Hao, Wenhao Hu, Guanhong Wang, Shidong Cao, Mingli Song, Jenq-Neng Hwang, and Gaoang Wang. 2025. A survey of deep learning in sports applications: Perception, comprehension, and decision. IEEE Transactions on Visualization and Computer Graphics(2025). A Data Construction Details A.1 Video Collection License.All raw vide...

work page arXiv 2025