arxiv: 2604.22226 · v1 · submitted 2026-04-24 · 💻 cs.CV

Recognition: unknown

Towards Temporal Compositional Reasoning in Long-Form Sports Videos

Lu Zhang, Ruizhe Zeng, Siyu Cao, Zhi-yong Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-08 12:42 UTC · model grok-4.3

classification 💻 cs.CV

keywords temporal compositional reasoninglong-form sports videomultimodal large language modelstemporal evidence groundingChain-of-Time Reasoningstep-wise video QA

0 comments

The pith

Chain-of-Time Reasoning with temporal-reward training and an evidence-seeking loop lets multimodal models compose answers from sparse time-stamped evidence in long sports videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that long-form sports video reasoning fails mainly because models lack explicit supervision on locating dispersed evidence and because standard inference does not force them to identify, verify, and compose that evidence step by step. It supplies SportsTime, a benchmark of over 14,000 open-ended questions paired with 50,000 step-wise temporal annotations, and introduces Chain-of-Time Reasoning that adds a temporal-reward GRPO objective during training and runs an anchor-observe-infer loop at inference time. A sympathetic reader would care because these changes directly target the two bottlenecks the authors identify, producing measurable gains in both final answer accuracy and the quality of the intermediate temporal grounding that supports each answer.

Core claim

Treating reasoning as temporally grounded evidence composition, implemented by training with a temporal-reward GRPO objective and running an anchor-observe-infer loop that iteratively localizes, verifies, and composes evidence, consistently raises both temporal compositional reasoning accuracy and step-wise grounding quality above strong multimodal LLM baselines on the SportsTime benchmark.

What carries the argument

The anchor-observe-infer evidence-seeking loop, which forces the model to localize candidate time segments, verify their relevance, and compose them before emitting a final answer.

If this is right

Models will produce answers that are accompanied by explicit, verifiable temporal evidence segments.
Step-wise grounding quality will rise, making error analysis and debugging of video reasoning easier.
The same training and inference pattern can be applied to any long video domain that requires sparse evidence integration.
Benchmark scores on SportsTime will become a stricter test of genuine temporal composition rather than pattern matching.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could be tested on instructional or surveillance videos to check whether the sports-specific dynamics are necessary for the gains.
Replacing the GRPO reward with a simpler temporal supervision signal might reveal how much of the benefit is due to the reinforcement-learning formulation.
Integrating the anchor-observe-infer loop into open-ended video chat systems would add built-in evidence citation without extra post-processing.

Load-bearing premise

The performance gains come chiefly from the temporal-reward GRPO and the anchor-observe-infer loop rather than from dataset construction details or baseline implementation choices.

What would settle it

A controlled ablation on SportsTime in which the temporal-reward term is removed from training and the inference loop is replaced by direct answer generation, with all other factors held fixed, showing whether the reported improvements disappear.

Figures

Figures reproduced from arXiv: 2604.22226 by Lu Zhang, Ruizhe Zeng, Siyu Cao, Zhi-yong Liu.

**Figure 1.** Figure 1: Chain-of-Time reasoning enables more reliable and verifiable answers view at source ↗

**Figure 2.** Figure 2: Overview of the SportsTime benchmark covering five sports and five reasoning types, with Chain-of-Time examples view at source ↗

**Figure 3.** Figure 3: Overview of our expert-guided semi-automatic annotation pipeline. view at source ↗

**Figure 4.** Figure 4: Statistics of SportsTime. From left to right: video-length distribution, wordlength distributions of reasoning chains and answers, and Chain-of-Time statistics view at source ↗

**Figure 5.** Figure 5: Overview of the Chain-of-Time Reasoning Framework. grounded reasoning (Sec. 4.3) to encourage reasoning based on temporal evidence. Third, we introduce anchor-triggered interactive observation (Sec. 4.4) to iteratively verify and revise reasoning via anchor-based local clip retrieval. The overall framework of CoTR is illustrated in view at source ↗

**Figure 6.** Figure 6: SGA Evaluation. “mIoU” denotes the mean span IoU between predicted and reference time windows. “H@τ ” reports the fraction of examples whose span IoU exceeds threshold τ . (2) Human Assessment view at source ↗

**Figure 7.** Figure 7: Video setting ablation studies. (a) Accuracy as a function of frame budget. (b) Accuracy as a function of video length. Video Setting Ablation view at source ↗

read the original abstract

Sports videos are a challenging domain for multimodal understanding because they involve complex and dynamic human activities. Despite rapid progress in Multimodal Large Language Models (MLLMs), long-horizon reasoning in sports videos remains difficult, as answering questions requires both locating temporally sparse evidence and integrating it into reasoning. We attribute this limitation to two closely coupled factors: insufficient supervision over temporally dispersed evidence, and the lack of methods that require models to identify, localize, and justify temporal evidence. To address these gaps, we introduce SportsTime, a large-scale benchmark for long-form sports video understanding, comprising 14K+ open-ended QA pairs and 50K+ step-wise temporal evidence annotations. Building on SportsTime, we propose Chain-of-Time Reasoning (CoTR), which treats reasoning as a process of temporally grounded evidence composition. Specifically, during training, CoTR introduces a temporal-reward GRPO to encourage temporally grounded reasoning. During inference, it employs an anchor-observe-infer evidence-seeking loop to iteratively localize, verify, and compose temporal evidence before producing the final answer. Experiments demonstrate the usefulness of SportsTime as a benchmark and the effectiveness of CoTR, which consistently improves temporal compositional reasoning and step-wise grounding quality over strong MLLM baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces SportsTime, a benchmark of 14K+ open-ended QA pairs and 50K+ step-wise temporal evidence annotations for long-form sports video understanding, and proposes Chain-of-Time Reasoning (CoTR). CoTR applies temporal-reward GRPO during training to encourage grounded reasoning and an anchor-observe-infer iterative loop at inference to localize, verify, and compose temporal evidence. Experiments claim consistent gains in temporal compositional reasoning and step-wise grounding over strong MLLM baselines.

Significance. If the central attribution holds, the work would supply a much-needed large-scale benchmark and a concrete training/inference recipe for improving temporal evidence handling in MLLMs on dynamic, long-horizon video domains. The combination of a new dataset with explicit step-wise annotations and a method that explicitly couples reward shaping with iterative evidence seeking could serve as a template for other video reasoning tasks.

major comments (2)

[Experiments] Experiments section: the headline claim that CoTR (via temporal-reward GRPO and the anchor-observe-infer loop) drives the reported gains is not supported by any ablation that isolates either component while holding the underlying MLLM and SportsTime data fixed. Without such controls, it remains possible that gains arise from dataset artifacts, prompt engineering, or baseline implementation details rather than the proposed mechanisms.
[§4 and Experiments] §4 (CoTR description) and Experiments: the paper asserts that the temporal-reward GRPO and anchor-observe-infer loop are the primary drivers of improved step-wise grounding, yet no quantitative comparison is provided that removes the temporal reward term or the iterative loop while measuring grounding quality and final answer accuracy on the same splits.

minor comments (2)

[Abstract] Abstract: quantitative metrics, ablation tables, and statistical significance tests are absent, making it impossible to assess the magnitude or reliability of the claimed improvements from the abstract alone.
[Benchmark] Benchmark construction: the paper should clarify how the 50K+ step-wise temporal evidence annotations were collected and validated (e.g., inter-annotator agreement, quality control) to establish that SportsTime is a reliable testbed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The comments correctly identify a gap in the experimental validation of CoTR's components. We address each point below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Experiments] Experiments section: the headline claim that CoTR (via temporal-reward GRPO and the anchor-observe-infer loop) drives the reported gains is not supported by any ablation that isolates either component while holding the underlying MLLM and SportsTime data fixed. Without such controls, it remains possible that gains arise from dataset artifacts, prompt engineering, or baseline implementation details rather than the proposed mechanisms.

Authors: We agree that the current experiments do not contain ablations that isolate the temporal-reward GRPO and the anchor-observe-infer loop while holding the base MLLM and SportsTime dataset fixed. The reported results compare full CoTR against strong MLLM baselines, but do not disentangle the two proposed mechanisms from other factors. In the revised manuscript we will add the requested controls: we will train and evaluate (1) CoTR with standard GRPO (temporal reward removed), (2) CoTR with single-pass inference (iterative loop removed), and (3) the full CoTR pipeline. All variants will share the identical MLLM backbone and be evaluated on the same SportsTime train/validation/test splits, reporting both final-answer accuracy and step-wise grounding metrics. revision: yes
Referee: [§4 and Experiments] §4 (CoTR description) and Experiments: the paper asserts that the temporal-reward GRPO and anchor-observe-infer loop are the primary drivers of improved step-wise grounding, yet no quantitative comparison is provided that removes the temporal reward term or the iterative loop while measuring grounding quality and final answer accuracy on the same splits.

Authors: We acknowledge the absence of the direct quantitative comparisons requested. The manuscript currently shows overall gains of CoTR over baselines but does not quantify the marginal contribution of each component to grounding quality and accuracy on identical splits. As stated in the response to the first comment, the revision will include exactly these ablations: variants with the temporal reward term ablated and variants with the iterative loop replaced by single-pass inference. Grounding quality (step-wise evidence localization) and final answer accuracy will be reported side-by-side for all variants on the same data partitions. revision: yes

Circularity Check

0 steps flagged

No significant circularity; new benchmark and method evaluated against external baselines.

full rationale

The paper introduces SportsTime as an independent benchmark with 14K+ QA pairs and 50K+ annotations, and proposes CoTR with temporal-reward GRPO training and anchor-observe-infer inference. These are tested empirically against strong external MLLM baselines, with claims of improvement presented as experimental outcomes rather than derivations by construction. No equations, self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations appear in the abstract or described structure. The central claims rest on dataset creation and method application, not tautological reduction to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an applied empirical ML paper; the abstract describes no mathematical axioms, free parameters, or invented entities.

pith-pipeline@v0.9.0 · 5520 in / 1018 out tokens · 39996 ms · 2026-05-08T12:42:42.676065+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

55 extracted references · 42 canonical work pages · 5 internal anchors

[1]

An, X., Xie, Y., Yang, K., Zhang, W., Zhao, X., et al.: LLaVA-OneVision-1.5: Fully open framework for democratized multimodal training (Dec 2025).https: //doi.org/10.48550/arXiv.2509.2366112

work page doi:10.48550/arxiv.2509.2366112 2025
[2]

Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., et al.: Qwen3-VL technical report (2025).https://doi.org/10.48550/ARXIV.2511.216311, 11

work page doi:10.48550/arxiv.2511.216311 2025
[3]

Chen, G., Liu, Y., Huang, Y., He, Y., Pei, B., et al.: CG-bench: Clue-grounded question answering benchmark for long video understanding (Dec 2024).https: //doi.org/10.48550/arXiv.2412.120752, 4

work page doi:10.48550/arxiv.2412.120752 2024
[4]

Chen, Z., Wang, W., Cao, Y., Liu, Y., Gao, Z., et al.: Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling (Sep 2025).https://doi.org/10.48550/arXiv.2412.0527111, 12

work page doi:10.48550/arxiv.2412.0527111 2025
[5]

Clark, C., Zhang, J., Ma, Z., Park, J.S., Salehi, M., et al.: Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding (Jan 2026).https://doi.org/10.48550/arXiv.2601.106111

work page doi:10.48550/arxiv.2601.106111 2026
[6]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Cui, Y., Zeng, C., Zhao, X., Yang, Y., Wu, G., Wang, L.: Sportsmot: A large multi- object tracking dataset in multiple sports scenes. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 9921–9931 (October
[7]

In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops

Deliege, A., Cioppa, A., Giancola, S., Seikavandi, M.J., Dueholm, J.V., Nasrollahi, K., Ghanem, B., Moeslund, T.B., Van Droogenbroeck, M.: Soccernet-v2: A dataset and benchmarks for holistic understanding of broadcast soccer videos. In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops. pp. 4508–4519 (Jun...

2021
[8]

Feng, K., Gong, K., Li, B., Guo, Z., Wang, Y., et al.: Video-R1: Reinforcing Video Reasoning in MLLMs (Oct 2025).https://doi.org/10.48550/arXiv.2503.21776 1, 11

work page internal anchor Pith review doi:10.48550/arxiv.2503.21776 2025
[9]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Fu, C., Dai, Y., Luo, Y., Li, L., Ren, S., Zhang, R., Wang, Z., Zhou, C., Shen, Y., Zhang, M., Chen, P., Li, Y., Lin, S., Zhao, S., Li, K., Xu, T., Zheng, X., Chen, E., Shan, C., He, R., Sun, X.: Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. In: Proceedings of the IEEE/CVF Conference on Computer Vision...

2025
[10]

Cao et al

Ghasemzadeh, S.A., Zandycke, G.V., Istasse, M., Sayez, N., Moshtaghpour, A., et al.: DeepSportLab: a Unified Framework for Ball Detection, Player Instance Segmentation and Pose Estimation in Team Sports Scenes (Dec 2021).https: //doi.org/10.48550/arXiv.2112.00627, arXiv:2112.00627 [cs] 1 16 S. Cao et al

work page doi:10.48550/arxiv.2112.00627 2021
[11]

Guo, Y., Liu, J., Li, M., Cheng, D., Tang, X., Sui, D., Liu, Q., Chen, X., Zhao, K.: Vtg-llm: Integrating timestamp knowledge into video llms for enhanced video tem- poral grounding. Proceedings of the AAAI Conference on Artificial Intelligence 39(3), 3302–3310 (Apr 2025).https://doi.org/10.1609/aaai.v39i3.32341, https://ojs.aaai.org/index.php/AAAI/articl...

work page doi:10.1609/aaai.v39i3.32341 2025
[12]

https://doi.org/10.48550/arXiv.2506.094454

Gupta, A., Roy, A., Chellappa, R., Bastian, N.D., Velasquez, A., et al.: TOGA: Temporally grounded open-ended video QA with weak supervision (Jun 2025). https://doi.org/10.48550/arXiv.2506.094454

work page doi:10.48550/arxiv.2506.094454 2025
[13]

In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A

Kojima,T.,Gu,S.S.,Reid,M.,Matsuo,Y.,Iwasawa,Y.:Largelanguagemodelsare zero-shot reasoners. In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A. (eds.) Advances in Neural Information Processing Systems. vol. 35, pp. 22199–22213. Curran Associates, Inc. (2022),https://proceedings.neurips.cc/ paper _ files / paper / 2022 / file / 8bb0d291a...

2022
[14]

Noah: Benchmarking narrative prior driven hallucination and omission in video large language models.arXiv preprint arXiv:2511.06475, 2025

Lee, K., Kim, E., Choi, J., Chang, B.: NOAH: Benchmarking Narrative Prior driven Hallucination and Omission in Video Large Language Models (Nov 2025). https://doi.org/10.48550/arXiv.2511.06475, arXiv:2511.06475 [cs] 2

work page doi:10.48550/arxiv.2511.06475 2025
[15]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Li, C., Im, E.W., Fazli, P.: Vidhalluc: Evaluating temporal hallucinations in mul- timodal large language models for video understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 13723–13733 (June 2025) 2

2025
[16]

https://doi.org/10.48550/arXiv.2401.015057

Li,H.,Deng,A.,Ke,Q.,Liu,J.,Rahmani,H.,etal.:Sports-QA:Alarge-scalevideo question answering benchmark for complex and professional sports (Jan 2024). https://doi.org/10.48550/arXiv.2401.015057

work page doi:10.48550/arxiv.2401.015057 2024
[17]

Li, R., Wang, X., Zhang, Y., Zohar, O., Wang, Z., et al.: Temporal Preference Optimization for Long-Form Video Understanding (Sep 2025).https://doi.org/ 10.48550/arXiv.2501.139194

work page doi:10.48550/arxiv.2501.139194 2025
[18]

In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV)

Li, Y., Xiao, J., Feng, C., Wang, X., Chua, T.S.: Discovering spatio-temporal ratio- nales for video question answering. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV). pp. 13823–13832. IEEE, Paris, France (Oct 2023). https://doi.org/10.1109/ICCV51070.2023.012752

work page doi:10.1109/iccv51070.2023.012752 2023
[19]

In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G

Liu, H., Ma, X., Zhong, C., Zhang, Y., Lin, W.: TimeCraft: Navigate weakly- supervised temporal grounded video question answering via bi-directional reason- ing. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds.) Computer Vision – ECCV 2024, vol. 15063. Springer Nature Switzerland, Cham (2025).https://doi.org/10.1007/97...

work page doi:10.1007/978-3-031-72652-1_64 2024
[20]

Lu, H., Wang, J., Zhang, Y., Wang, R., et al.: ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long Video Understanding (Sep 2025).https:// doi.org/10.48550/arXiv.2508.21496, arXiv:2508.21496 [cs] 2

work page doi:10.48550/arxiv.2508.21496 2025
[21]

https://doi.org/10.48550/arXiv.2508.1173712

Lu, S., Li, Y., Xia, Y., Hu, Y., Zhao, S., et al.: Ovis2.5 technical report (Aug 2025). https://doi.org/10.48550/arXiv.2508.1173712

work page doi:10.48550/arxiv.2508.1173712 2025
[22]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Nagrani, A., Menon, S., Iscen, A., Buch, S., Mehran, R., Jha, N., Hauth, A., Zhu, Y., Vondrick, C., Sirotenko, M., Schmid, C., Weyand, T.: Minerva: Evaluating com- plex video reasoning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 23968–23978 (October 2025) 7

2025
[23]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Rao, J., Wu, H., Jiang, H., Zhang, Y., Wang, Y., Xie, W.: Towards universal soccer video understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 8384–8394 (June 2025) 4

2025
[24]

In: Al-Onaizan, Y., Bansal, M., Chen, Y.N

Rao, J., Wu, H., Liu, C., Wang, Y., Xie, W.: MatchTime: Towards automatic soccer game commentary generation. In: Al-Onaizan, Y., Bansal, M., Chen, Y.N. (eds.) Towards Temporal Compositional Reasoning in Long-Form Sports Videos 17 Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. pp. 1671–1685. Association for Computat...

work page doi:10.18653/v1/2024.emnlp-main.99 2024
[25]

Rawal, R., Shirkavand, R., Huang, H., Somepalli, G., Goldstein, T.: ARGUS: Hal- lucination and Omission Evaluation in Video-LLMs (Jun 2025).https://doi.org/ 10.48550/arXiv.2506.07371, arXiv:2506.07371 [cs] version: 1 2

work page doi:10.48550/arxiv.2506.07371 2025
[26]

In: Bouamor, H., Pino, J., Bali, K

Ren, S., Chen, S., Li, S., Sun, X., Hou, L.: TESTA: Temporal-spatial token aggre- gation for long-form video-language understanding. In: Bouamor, H., Pino, J., Bali, K. (eds.) Findings of the Association for Computational Linguistics: EMNLP 2023. pp. 932–947. Association for Computational Linguistics, Singapore (Dec 2023). https://doi.org/10.18653/v1/2023...

work page doi:10.18653/v1/2023.findings-emnlp.664 2023
[27]

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., et al.: DeepSeekMath: Pushing the limits of mathematical reasoning in open language models (Apr 2024).https: //doi.org/10.48550/arXiv.2402.0330013

work page doi:10.48550/arxiv.2402.0330013 2024
[28]

org/10.48550/arXiv.2512.055134

Sugandhika, C., Li, C., Rajan, D., Fernando, B.: Know-show: Benchmarking video- language models on spatio-temporal grounded reasoning (Dec 2025).https://doi. org/10.48550/arXiv.2512.055134

work page doi:10.48550/arxiv.2512.055134 2025
[29]

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

Team, G..., Zeng, A., Lv, X., Zheng, Q., Hou, Z., et al.: GLM-4.5: Agentic, rea- soning, and coding (ARC) foundation models (Aug 2025).https://doi.org/10. 48550/arXiv.2508.0647114

work page internal anchor Pith review arXiv 2025
[30]

Team, V., Hong, W., Yu, W., Gu, X., Wang, G., Gan, G., et al.: Glm-4.5v and glm- 4.1v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning (2025),https://arxiv.org/abs/2507.0100611

work page internal anchor Pith review arXiv 2025
[31]

Wang, H., Xu, Z., Cheng, Y., Diao, S., Zhou, Y., et al.: Grounded-VideoLLM: Sharpening fine-grained temporal grounding in video large language models (Aug 2025).https://doi.org/10.48550/arXiv.2410.032904

work page doi:10.48550/arxiv.2410.032904 2025
[32]

Wang, S., Chen, G., Huang, D.a., Li, Z., Li, M., et al.: VideoITG: Multimodal video understanding with instructed temporal grounding (Jul 2025).https://doi.org/ 10.48550/arXiv.2507.133534

work page doi:10.48550/arxiv.2507.133534 2025
[33]

In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV)

Wang, W., He, Z., Hong, W., Cheng, Y., Zhang, X., et al.: Lvbench: An extreme long video understanding benchmark. In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV). pp. 22958–22967 (October 2025) 2, 7

2025
[34]

Time-r1: Post-training large vision language model for temporal video grounding,

Wang, Y., Wang, Z., Xu, B., Du, Y., Lin, K., Xiao, Z., Yue, Z., Ju, J., Zhang, L., Yang, D., et al.: Time-r1: Post-training large vision language model for temporal video grounding. arXiv preprint arXiv:2503.13377 (2025) 4

work page arXiv 2025
[35]

https://doi.org/10.48550/arXiv.2406.163382

Wang, Y., Wang, Y., Zhao, D., Xie, C., Zheng, Z.: VideoHallucer: Evaluating intrinsic and extrinsic hallucinations in large video-language models (Jun 2024). https://doi.org/10.48550/arXiv.2406.163382

work page doi:10.48550/arxiv.2406.163382 2024
[36]

https://doi.org/10.48550/arXiv.2510.182349

Wei, H., Sun, Y., Li, Y.: DeepSeek-OCR: Contexts optical compression (Oct 2025). https://doi.org/10.48550/arXiv.2510.182349

work page doi:10.48550/arxiv.2510.182349 2025
[37]

Advances in Neural Information Pro- cessing Systems37, 28828–28857 (2024) 7

Wu, H., Li, D., Chen, B., Li, J.: Longvideobench: A benchmark for long-context interleaved video-language understanding. Advances in Neural Information Pro- cessing Systems37, 28828–28857 (2024) 7

2024
[38]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Wu, Y., Hu, X., Sun, Y., Zhou, Y., Zhu, W., et al.: Number it: Temporal ground- ing videos like flipping manga. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 13754–13765 (June 2025) 4, 9 18 S. Cao et al

2025
[39]

org/10.48550/arXiv.2511.064994, 7

Xia, H., Ge, H., Zou, J., Choi, H.W., Zhang, X., et al.: SportR: A benchmark for multimodal large language model reasoning in sports (Nov 2025).https://doi. org/10.48550/arXiv.2511.064994, 7

work page doi:10.48550/arxiv.2511.064994 2025
[40]

Xia, H., Yang, Z., Wang, Y., Tracy, R., Zhao, Y., et al.: SportQA: A benchmark for sports understanding in large language models (Jun 2024).https://doi.org/ 10.48550/arXiv.2402.158622, 4, 7

work page doi:10.48550/arxiv.2402.158622 2024
[41]

Xia, H., Yang, Z., Zou, J., Tracy, R., Wang, Y., et al.: SPORTU: A comprehen- sive sports understanding benchmark for multimodal large language models (Mar 2025).https://doi.org/10.48550/arXiv.2410.084744, 7

work page doi:10.48550/arxiv.2410.084744 2025
[42]

T-VSL: text-guided visual sound source localization in mixtures

Xu, J., Zhao, G., Yin, S., Zhou, W., Peng, Y.: FineSports: A multi-person hi- erarchical sports video dataset for fine-grained action understanding. In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 21773–21782. IEEE, Seattle, WA, USA (Jun 2024).https://doi.org/10. 1109/CVPR52733.2024.020571, 4

work page arXiv 2024
[43]

Xu, M., Gao, M., Li, S., Lu, J., Gan, Z., et al.: SlowFast-LLaVA-1.5: A family of token-efficient video large language models for long-form video understanding (Mar 2025).https://doi.org/10.48550/arXiv.2503.1894314

work page doi:10.48550/arxiv.2503.1894314 2025
[44]

Yang, H., Rao, J., Wu, H., Xie, W.: SoccerMaster: A vision foundation model for soccer understanding (Dec 2025).https://doi.org/10.48550/arXiv.2512.11016 4

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2512.11016 2025
[45]

thinking with long videos

Yang, Z., Wang, S., Zhang, K., Wu, K., Leng, S., et al.: LongVT: Incentivizing "thinking with long videos" via native tool calling (Nov 2025).https://doi.org/ 10.48550/arXiv.2511.207854

work page doi:10.48550/arxiv.2511.207854 2025
[46]

Yang, Z., Yu, Y., Zhao, Y., Lu, S., Bai, S.: TimeExpert: An expert-guided video LLM for video temporal grounding (Aug 2025).https://doi.org/10.48550/ arXiv.2508.016994

work page arXiv 2025
[47]

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

Yao, Y., Yu, T., Zhang, A., Wang, C., Cui, J., et al.: Minicpm-v: A gpt-4v level mllm on your phone. arXiv preprint arXiv:2408.01800 (2024) 11, 12

work page internal anchor Pith review arXiv 2024
[48]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Ye,J.,Wang,Z.,Sun,H.,Chandrasegaran,K.,Durante,Z.,etal.:Re-thinkingtem- poral search for long-form video understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 8579–8591 (June 2025) 4

2025
[49]

Yu, J., Wu, Y., Chu, M., Ren, Z., Huang, Z., et al.: VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos (Aug 2025).https://doi.org/ 10.48550/arXiv.2506.108577

work page doi:10.48550/arxiv.2506.108577 2025
[50]

In: Feng, Y., Lefever, E

Zhang, H., Li, X., Bing, L.: Video-LLaMA: An instruction-tuned audio-visual lan- guage model for video understanding. In: Feng, Y., Lefever, E. (eds.) Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. pp. 543–553. Association for Computational Linguistics, Singapore (Dec 2023).https://doi.org/10...

work page doi:10.18653/v1/2023.emnlp-demo.4911 2023
[51]

Zhang, H., Gu, X., Li, J., Ma, C., Bai, S., et al.: Thinking with videos: Multimodal tool-augmented reinforcement learning for long video reasoning (Sep 2025).https: //doi.org/10.48550/arXiv.2508.044164

work page doi:10.48550/arxiv.2508.044164 2025
[52]

arXiv preprint arXiv:2409.16597 (2024).https: //doi.org/10.48550/arXiv.2409.165972

Zhang, J., Jiao, Y., Chen, S., Zhao, N., Tan, Z., et al.: Eventhallusion: Diagnosing event hallucinations in video llms. arXiv preprint arXiv:2409.16597 (2024).https: //doi.org/10.48550/arXiv.2409.165972

work page doi:10.48550/arxiv.2409.165972 2024
[53]

In: Proceedings of the European Conference on Computer Vision (ECCV) (September 2018) 4 Towards Temporal Compositional Reasoning in Long-Form Sports Videos 19

Zhou, B., Andonian, A., Oliva, A., Torralba, A.: Temporal relational reasoning in videos. In: Proceedings of the European Conference on Computer Vision (ECCV) (September 2018) 4 Towards Temporal Compositional Reasoning in Long-Form Sports Videos 19

2018
[54]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Zhou, J., Shu, Y., Zhao, B., Wu, B., Liang, Z., et al.: Mlvu: Benchmarking multi- task long video understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 13691–13701 (June 2025) 7

2025
[55]

Zou, J., Xia, H., Ye, Z., Zhang, S., Lai, C., et al.: DeepSport: A multimodal large language model for comprehensive sports video reasoning via agentic reinforcement learning (Nov 2025).https://doi.org/10.48550/arXiv.2511.129087

work page doi:10.48550/arxiv.2511.129087 2025