pith. machine review for the scientific record. sign in

arxiv: 2604.22226 · v1 · submitted 2026-04-24 · 💻 cs.CV

Recognition: unknown

Towards Temporal Compositional Reasoning in Long-Form Sports Videos

Lu Zhang, Ruizhe Zeng, Siyu Cao, Zhi-yong Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-08 12:42 UTC · model grok-4.3

classification 💻 cs.CV
keywords temporal compositional reasoninglong-form sports videomultimodal large language modelstemporal evidence groundingChain-of-Time Reasoningstep-wise video QA
0
0 comments X

The pith

Chain-of-Time Reasoning with temporal-reward training and an evidence-seeking loop lets multimodal models compose answers from sparse time-stamped evidence in long sports videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that long-form sports video reasoning fails mainly because models lack explicit supervision on locating dispersed evidence and because standard inference does not force them to identify, verify, and compose that evidence step by step. It supplies SportsTime, a benchmark of over 14,000 open-ended questions paired with 50,000 step-wise temporal annotations, and introduces Chain-of-Time Reasoning that adds a temporal-reward GRPO objective during training and runs an anchor-observe-infer loop at inference time. A sympathetic reader would care because these changes directly target the two bottlenecks the authors identify, producing measurable gains in both final answer accuracy and the quality of the intermediate temporal grounding that supports each answer.

Core claim

Treating reasoning as temporally grounded evidence composition, implemented by training with a temporal-reward GRPO objective and running an anchor-observe-infer loop that iteratively localizes, verifies, and composes evidence, consistently raises both temporal compositional reasoning accuracy and step-wise grounding quality above strong multimodal LLM baselines on the SportsTime benchmark.

What carries the argument

The anchor-observe-infer evidence-seeking loop, which forces the model to localize candidate time segments, verify their relevance, and compose them before emitting a final answer.

If this is right

  • Models will produce answers that are accompanied by explicit, verifiable temporal evidence segments.
  • Step-wise grounding quality will rise, making error analysis and debugging of video reasoning easier.
  • The same training and inference pattern can be applied to any long video domain that requires sparse evidence integration.
  • Benchmark scores on SportsTime will become a stricter test of genuine temporal composition rather than pattern matching.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could be tested on instructional or surveillance videos to check whether the sports-specific dynamics are necessary for the gains.
  • Replacing the GRPO reward with a simpler temporal supervision signal might reveal how much of the benefit is due to the reinforcement-learning formulation.
  • Integrating the anchor-observe-infer loop into open-ended video chat systems would add built-in evidence citation without extra post-processing.

Load-bearing premise

The performance gains come chiefly from the temporal-reward GRPO and the anchor-observe-infer loop rather than from dataset construction details or baseline implementation choices.

What would settle it

A controlled ablation on SportsTime in which the temporal-reward term is removed from training and the inference loop is replaced by direct answer generation, with all other factors held fixed, showing whether the reported improvements disappear.

Figures

Figures reproduced from arXiv: 2604.22226 by Lu Zhang, Ruizhe Zeng, Siyu Cao, Zhi-yong Liu.

Figure 1
Figure 1. Figure 1: Chain-of-Time reasoning enables more reliable and verifiable answers view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the SportsTime benchmark covering five sports and five reasoning types, with Chain-of-Time examples view at source ↗
Figure 3
Figure 3. Figure 3: Overview of our expert-guided semi-automatic annotation pipeline. view at source ↗
Figure 4
Figure 4. Figure 4: Statistics of SportsTime. From left to right: video-length distribution, word￾length distributions of reasoning chains and answers, and Chain-of-Time statistics view at source ↗
Figure 5
Figure 5. Figure 5: Overview of the Chain-of-Time Reasoning Framework. grounded reasoning (Sec. 4.3) to encourage reasoning based on temporal evidence. Third, we introduce anchor-triggered interactive observation (Sec. 4.4) to iteratively verify and revise reasoning via anchor-based local clip retrieval. The overall framework of CoTR is illustrated in view at source ↗
Figure 6
Figure 6. Figure 6: SGA Evaluation. “mIoU” denotes the mean span IoU between predicted and reference time windows. “H@τ ” reports the fraction of examples whose span IoU exceeds threshold τ . (2) Human Assessment view at source ↗
Figure 7
Figure 7. Figure 7: Video setting ablation studies. (a) Accuracy as a function of frame budget. (b) Accuracy as a function of video length. Video Setting Ablation view at source ↗
read the original abstract

Sports videos are a challenging domain for multimodal understanding because they involve complex and dynamic human activities. Despite rapid progress in Multimodal Large Language Models (MLLMs), long-horizon reasoning in sports videos remains difficult, as answering questions requires both locating temporally sparse evidence and integrating it into reasoning. We attribute this limitation to two closely coupled factors: insufficient supervision over temporally dispersed evidence, and the lack of methods that require models to identify, localize, and justify temporal evidence. To address these gaps, we introduce SportsTime, a large-scale benchmark for long-form sports video understanding, comprising 14K+ open-ended QA pairs and 50K+ step-wise temporal evidence annotations. Building on SportsTime, we propose Chain-of-Time Reasoning (CoTR), which treats reasoning as a process of temporally grounded evidence composition. Specifically, during training, CoTR introduces a temporal-reward GRPO to encourage temporally grounded reasoning. During inference, it employs an anchor-observe-infer evidence-seeking loop to iteratively localize, verify, and compose temporal evidence before producing the final answer. Experiments demonstrate the usefulness of SportsTime as a benchmark and the effectiveness of CoTR, which consistently improves temporal compositional reasoning and step-wise grounding quality over strong MLLM baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces SportsTime, a benchmark of 14K+ open-ended QA pairs and 50K+ step-wise temporal evidence annotations for long-form sports video understanding, and proposes Chain-of-Time Reasoning (CoTR). CoTR applies temporal-reward GRPO during training to encourage grounded reasoning and an anchor-observe-infer iterative loop at inference to localize, verify, and compose temporal evidence. Experiments claim consistent gains in temporal compositional reasoning and step-wise grounding over strong MLLM baselines.

Significance. If the central attribution holds, the work would supply a much-needed large-scale benchmark and a concrete training/inference recipe for improving temporal evidence handling in MLLMs on dynamic, long-horizon video domains. The combination of a new dataset with explicit step-wise annotations and a method that explicitly couples reward shaping with iterative evidence seeking could serve as a template for other video reasoning tasks.

major comments (2)
  1. [Experiments] Experiments section: the headline claim that CoTR (via temporal-reward GRPO and the anchor-observe-infer loop) drives the reported gains is not supported by any ablation that isolates either component while holding the underlying MLLM and SportsTime data fixed. Without such controls, it remains possible that gains arise from dataset artifacts, prompt engineering, or baseline implementation details rather than the proposed mechanisms.
  2. [§4 and Experiments] §4 (CoTR description) and Experiments: the paper asserts that the temporal-reward GRPO and anchor-observe-infer loop are the primary drivers of improved step-wise grounding, yet no quantitative comparison is provided that removes the temporal reward term or the iterative loop while measuring grounding quality and final answer accuracy on the same splits.
minor comments (2)
  1. [Abstract] Abstract: quantitative metrics, ablation tables, and statistical significance tests are absent, making it impossible to assess the magnitude or reliability of the claimed improvements from the abstract alone.
  2. [Benchmark] Benchmark construction: the paper should clarify how the 50K+ step-wise temporal evidence annotations were collected and validated (e.g., inter-annotator agreement, quality control) to establish that SportsTime is a reliable testbed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The comments correctly identify a gap in the experimental validation of CoTR's components. We address each point below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: the headline claim that CoTR (via temporal-reward GRPO and the anchor-observe-infer loop) drives the reported gains is not supported by any ablation that isolates either component while holding the underlying MLLM and SportsTime data fixed. Without such controls, it remains possible that gains arise from dataset artifacts, prompt engineering, or baseline implementation details rather than the proposed mechanisms.

    Authors: We agree that the current experiments do not contain ablations that isolate the temporal-reward GRPO and the anchor-observe-infer loop while holding the base MLLM and SportsTime dataset fixed. The reported results compare full CoTR against strong MLLM baselines, but do not disentangle the two proposed mechanisms from other factors. In the revised manuscript we will add the requested controls: we will train and evaluate (1) CoTR with standard GRPO (temporal reward removed), (2) CoTR with single-pass inference (iterative loop removed), and (3) the full CoTR pipeline. All variants will share the identical MLLM backbone and be evaluated on the same SportsTime train/validation/test splits, reporting both final-answer accuracy and step-wise grounding metrics. revision: yes

  2. Referee: [§4 and Experiments] §4 (CoTR description) and Experiments: the paper asserts that the temporal-reward GRPO and anchor-observe-infer loop are the primary drivers of improved step-wise grounding, yet no quantitative comparison is provided that removes the temporal reward term or the iterative loop while measuring grounding quality and final answer accuracy on the same splits.

    Authors: We acknowledge the absence of the direct quantitative comparisons requested. The manuscript currently shows overall gains of CoTR over baselines but does not quantify the marginal contribution of each component to grounding quality and accuracy on identical splits. As stated in the response to the first comment, the revision will include exactly these ablations: variants with the temporal reward term ablated and variants with the iterative loop replaced by single-pass inference. Grounding quality (step-wise evidence localization) and final answer accuracy will be reported side-by-side for all variants on the same data partitions. revision: yes

Circularity Check

0 steps flagged

No significant circularity; new benchmark and method evaluated against external baselines.

full rationale

The paper introduces SportsTime as an independent benchmark with 14K+ QA pairs and 50K+ annotations, and proposes CoTR with temporal-reward GRPO training and anchor-observe-infer inference. These are tested empirically against strong external MLLM baselines, with claims of improvement presented as experimental outcomes rather than derivations by construction. No equations, self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations appear in the abstract or described structure. The central claims rest on dataset creation and method application, not tautological reduction to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an applied empirical ML paper; the abstract describes no mathematical axioms, free parameters, or invented entities.

pith-pipeline@v0.9.0 · 5520 in / 1018 out tokens · 39996 ms · 2026-05-08T12:42:42.676065+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

55 extracted references · 42 canonical work pages · 5 internal anchors

  1. [1]

    An, X., Xie, Y., Yang, K., Zhang, W., Zhao, X., et al.: LLaVA-OneVision-1.5: Fully open framework for democratized multimodal training (Dec 2025).https: //doi.org/10.48550/arXiv.2509.2366112

  2. [2]

    Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., et al.: Qwen3-VL technical report (2025).https://doi.org/10.48550/ARXIV.2511.216311, 11

  3. [3]

    Chen, G., Liu, Y., Huang, Y., He, Y., Pei, B., et al.: CG-bench: Clue-grounded question answering benchmark for long video understanding (Dec 2024).https: //doi.org/10.48550/arXiv.2412.120752, 4

  4. [4]

    Chen, Z., Wang, W., Cao, Y., Liu, Y., Gao, Z., et al.: Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling (Sep 2025).https://doi.org/10.48550/arXiv.2412.0527111, 12

  5. [5]

    Clark, C., Zhang, J., Ma, Z., Park, J.S., Salehi, M., et al.: Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding (Jan 2026).https://doi.org/10.48550/arXiv.2601.106111

  6. [6]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

    Cui, Y., Zeng, C., Zhao, X., Yang, Y., Wu, G., Wang, L.: Sportsmot: A large multi- object tracking dataset in multiple sports scenes. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 9921–9931 (October

  7. [7]

    In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops

    Deliege, A., Cioppa, A., Giancola, S., Seikavandi, M.J., Dueholm, J.V., Nasrollahi, K., Ghanem, B., Moeslund, T.B., Van Droogenbroeck, M.: Soccernet-v2: A dataset and benchmarks for holistic understanding of broadcast soccer videos. In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops. pp. 4508–4519 (Jun...

  8. [8]

    Feng, K., Gong, K., Li, B., Guo, Z., Wang, Y., et al.: Video-R1: Reinforcing Video Reasoning in MLLMs (Oct 2025).https://doi.org/10.48550/arXiv.2503.21776 1, 11

  9. [9]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Fu, C., Dai, Y., Luo, Y., Li, L., Ren, S., Zhang, R., Wang, Z., Zhou, C., Shen, Y., Zhang, M., Chen, P., Li, Y., Lin, S., Zhao, S., Li, K., Xu, T., Zheng, X., Chen, E., Shan, C., He, R., Sun, X.: Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. In: Proceedings of the IEEE/CVF Conference on Computer Vision...

  10. [10]

    Cao et al

    Ghasemzadeh, S.A., Zandycke, G.V., Istasse, M., Sayez, N., Moshtaghpour, A., et al.: DeepSportLab: a Unified Framework for Ball Detection, Player Instance Segmentation and Pose Estimation in Team Sports Scenes (Dec 2021).https: //doi.org/10.48550/arXiv.2112.00627, arXiv:2112.00627 [cs] 1 16 S. Cao et al

  11. [11]

    Guo, Y., Liu, J., Li, M., Cheng, D., Tang, X., Sui, D., Liu, Q., Chen, X., Zhao, K.: Vtg-llm: Integrating timestamp knowledge into video llms for enhanced video tem- poral grounding. Proceedings of the AAAI Conference on Artificial Intelligence 39(3), 3302–3310 (Apr 2025).https://doi.org/10.1609/aaai.v39i3.32341, https://ojs.aaai.org/index.php/AAAI/articl...

  12. [12]

    https://doi.org/10.48550/arXiv.2506.094454

    Gupta, A., Roy, A., Chellappa, R., Bastian, N.D., Velasquez, A., et al.: TOGA: Temporally grounded open-ended video QA with weak supervision (Jun 2025). https://doi.org/10.48550/arXiv.2506.094454

  13. [13]

    In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A

    Kojima,T.,Gu,S.S.,Reid,M.,Matsuo,Y.,Iwasawa,Y.:Largelanguagemodelsare zero-shot reasoners. In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A. (eds.) Advances in Neural Information Processing Systems. vol. 35, pp. 22199–22213. Curran Associates, Inc. (2022),https://proceedings.neurips.cc/ paper _ files / paper / 2022 / file / 8bb0d291a...

  14. [14]

    Noah: Benchmarking narrative prior driven hallucination and omission in video large language models.arXiv preprint arXiv:2511.06475, 2025

    Lee, K., Kim, E., Choi, J., Chang, B.: NOAH: Benchmarking Narrative Prior driven Hallucination and Omission in Video Large Language Models (Nov 2025). https://doi.org/10.48550/arXiv.2511.06475, arXiv:2511.06475 [cs] 2

  15. [15]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Li, C., Im, E.W., Fazli, P.: Vidhalluc: Evaluating temporal hallucinations in mul- timodal large language models for video understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 13723–13733 (June 2025) 2

  16. [16]

    https://doi.org/10.48550/arXiv.2401.015057

    Li,H.,Deng,A.,Ke,Q.,Liu,J.,Rahmani,H.,etal.:Sports-QA:Alarge-scalevideo question answering benchmark for complex and professional sports (Jan 2024). https://doi.org/10.48550/arXiv.2401.015057

  17. [17]

    Li, R., Wang, X., Zhang, Y., Zohar, O., Wang, Z., et al.: Temporal Preference Optimization for Long-Form Video Understanding (Sep 2025).https://doi.org/ 10.48550/arXiv.2501.139194

  18. [18]

    In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV)

    Li, Y., Xiao, J., Feng, C., Wang, X., Chua, T.S.: Discovering spatio-temporal ratio- nales for video question answering. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV). pp. 13823–13832. IEEE, Paris, France (Oct 2023). https://doi.org/10.1109/ICCV51070.2023.012752

  19. [19]

    In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G

    Liu, H., Ma, X., Zhong, C., Zhang, Y., Lin, W.: TimeCraft: Navigate weakly- supervised temporal grounded video question answering via bi-directional reason- ing. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds.) Computer Vision – ECCV 2024, vol. 15063. Springer Nature Switzerland, Cham (2025).https://doi.org/10.1007/97...

  20. [20]

    Lu, H., Wang, J., Zhang, Y., Wang, R., et al.: ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long Video Understanding (Sep 2025).https:// doi.org/10.48550/arXiv.2508.21496, arXiv:2508.21496 [cs] 2

  21. [21]

    https://doi.org/10.48550/arXiv.2508.1173712

    Lu, S., Li, Y., Xia, Y., Hu, Y., Zhao, S., et al.: Ovis2.5 technical report (Aug 2025). https://doi.org/10.48550/arXiv.2508.1173712

  22. [22]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

    Nagrani, A., Menon, S., Iscen, A., Buch, S., Mehran, R., Jha, N., Hauth, A., Zhu, Y., Vondrick, C., Sirotenko, M., Schmid, C., Weyand, T.: Minerva: Evaluating com- plex video reasoning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 23968–23978 (October 2025) 7

  23. [23]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Rao, J., Wu, H., Jiang, H., Zhang, Y., Wang, Y., Xie, W.: Towards universal soccer video understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 8384–8394 (June 2025) 4

  24. [24]

    In: Al-Onaizan, Y., Bansal, M., Chen, Y.N

    Rao, J., Wu, H., Liu, C., Wang, Y., Xie, W.: MatchTime: Towards automatic soccer game commentary generation. In: Al-Onaizan, Y., Bansal, M., Chen, Y.N. (eds.) Towards Temporal Compositional Reasoning in Long-Form Sports Videos 17 Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. pp. 1671–1685. Association for Computat...

  25. [25]

    Rawal, R., Shirkavand, R., Huang, H., Somepalli, G., Goldstein, T.: ARGUS: Hal- lucination and Omission Evaluation in Video-LLMs (Jun 2025).https://doi.org/ 10.48550/arXiv.2506.07371, arXiv:2506.07371 [cs] version: 1 2

  26. [26]

    In: Bouamor, H., Pino, J., Bali, K

    Ren, S., Chen, S., Li, S., Sun, X., Hou, L.: TESTA: Temporal-spatial token aggre- gation for long-form video-language understanding. In: Bouamor, H., Pino, J., Bali, K. (eds.) Findings of the Association for Computational Linguistics: EMNLP 2023. pp. 932–947. Association for Computational Linguistics, Singapore (Dec 2023). https://doi.org/10.18653/v1/2023...

  27. [27]

    Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., et al.: DeepSeekMath: Pushing the limits of mathematical reasoning in open language models (Apr 2024).https: //doi.org/10.48550/arXiv.2402.0330013

  28. [28]

    org/10.48550/arXiv.2512.055134

    Sugandhika, C., Li, C., Rajan, D., Fernando, B.: Know-show: Benchmarking video- language models on spatio-temporal grounded reasoning (Dec 2025).https://doi. org/10.48550/arXiv.2512.055134

  29. [29]

    GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

    Team, G..., Zeng, A., Lv, X., Zheng, Q., Hou, Z., et al.: GLM-4.5: Agentic, rea- soning, and coding (ARC) foundation models (Aug 2025).https://doi.org/10. 48550/arXiv.2508.0647114

  30. [30]

    Team, V., Hong, W., Yu, W., Gu, X., Wang, G., Gan, G., et al.: Glm-4.5v and glm- 4.1v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning (2025),https://arxiv.org/abs/2507.0100611

  31. [31]

    Wang, H., Xu, Z., Cheng, Y., Diao, S., Zhou, Y., et al.: Grounded-VideoLLM: Sharpening fine-grained temporal grounding in video large language models (Aug 2025).https://doi.org/10.48550/arXiv.2410.032904

  32. [32]

    Wang, S., Chen, G., Huang, D.a., Li, Z., Li, M., et al.: VideoITG: Multimodal video understanding with instructed temporal grounding (Jul 2025).https://doi.org/ 10.48550/arXiv.2507.133534

  33. [33]

    In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV)

    Wang, W., He, Z., Hong, W., Cheng, Y., Zhang, X., et al.: Lvbench: An extreme long video understanding benchmark. In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV). pp. 22958–22967 (October 2025) 2, 7

  34. [34]

    Time-r1: Post-training large vision language model for temporal video grounding,

    Wang, Y., Wang, Z., Xu, B., Du, Y., Lin, K., Xiao, Z., Yue, Z., Ju, J., Zhang, L., Yang, D., et al.: Time-r1: Post-training large vision language model for temporal video grounding. arXiv preprint arXiv:2503.13377 (2025) 4

  35. [35]

    https://doi.org/10.48550/arXiv.2406.163382

    Wang, Y., Wang, Y., Zhao, D., Xie, C., Zheng, Z.: VideoHallucer: Evaluating intrinsic and extrinsic hallucinations in large video-language models (Jun 2024). https://doi.org/10.48550/arXiv.2406.163382

  36. [36]

    https://doi.org/10.48550/arXiv.2510.182349

    Wei, H., Sun, Y., Li, Y.: DeepSeek-OCR: Contexts optical compression (Oct 2025). https://doi.org/10.48550/arXiv.2510.182349

  37. [37]

    Advances in Neural Information Pro- cessing Systems37, 28828–28857 (2024) 7

    Wu, H., Li, D., Chen, B., Li, J.: Longvideobench: A benchmark for long-context interleaved video-language understanding. Advances in Neural Information Pro- cessing Systems37, 28828–28857 (2024) 7

  38. [38]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Wu, Y., Hu, X., Sun, Y., Zhou, Y., Zhu, W., et al.: Number it: Temporal ground- ing videos like flipping manga. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 13754–13765 (June 2025) 4, 9 18 S. Cao et al

  39. [39]

    org/10.48550/arXiv.2511.064994, 7

    Xia, H., Ge, H., Zou, J., Choi, H.W., Zhang, X., et al.: SportR: A benchmark for multimodal large language model reasoning in sports (Nov 2025).https://doi. org/10.48550/arXiv.2511.064994, 7

  40. [40]

    Xia, H., Yang, Z., Wang, Y., Tracy, R., Zhao, Y., et al.: SportQA: A benchmark for sports understanding in large language models (Jun 2024).https://doi.org/ 10.48550/arXiv.2402.158622, 4, 7

  41. [41]

    Xia, H., Yang, Z., Zou, J., Tracy, R., Wang, Y., et al.: SPORTU: A comprehen- sive sports understanding benchmark for multimodal large language models (Mar 2025).https://doi.org/10.48550/arXiv.2410.084744, 7

  42. [42]

    T-VSL: text-guided visual sound source localization in mixtures

    Xu, J., Zhao, G., Yin, S., Zhou, W., Peng, Y.: FineSports: A multi-person hi- erarchical sports video dataset for fine-grained action understanding. In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 21773–21782. IEEE, Seattle, WA, USA (Jun 2024).https://doi.org/10. 1109/CVPR52733.2024.020571, 4

  43. [43]

    Xu, M., Gao, M., Li, S., Lu, J., Gan, Z., et al.: SlowFast-LLaVA-1.5: A family of token-efficient video large language models for long-form video understanding (Mar 2025).https://doi.org/10.48550/arXiv.2503.1894314

  44. [44]

    Yang, H., Rao, J., Wu, H., Xie, W.: SoccerMaster: A vision foundation model for soccer understanding (Dec 2025).https://doi.org/10.48550/arXiv.2512.11016 4

  45. [45]

    thinking with long videos

    Yang, Z., Wang, S., Zhang, K., Wu, K., Leng, S., et al.: LongVT: Incentivizing "thinking with long videos" via native tool calling (Nov 2025).https://doi.org/ 10.48550/arXiv.2511.207854

  46. [46]

    Yang, Z., Yu, Y., Zhao, Y., Lu, S., Bai, S.: TimeExpert: An expert-guided video LLM for video temporal grounding (Aug 2025).https://doi.org/10.48550/ arXiv.2508.016994

  47. [47]

    MiniCPM-V: A GPT-4V Level MLLM on Your Phone

    Yao, Y., Yu, T., Zhang, A., Wang, C., Cui, J., et al.: Minicpm-v: A gpt-4v level mllm on your phone. arXiv preprint arXiv:2408.01800 (2024) 11, 12

  48. [48]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Ye,J.,Wang,Z.,Sun,H.,Chandrasegaran,K.,Durante,Z.,etal.:Re-thinkingtem- poral search for long-form video understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 8579–8591 (June 2025) 4

  49. [49]

    Yu, J., Wu, Y., Chu, M., Ren, Z., Huang, Z., et al.: VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos (Aug 2025).https://doi.org/ 10.48550/arXiv.2506.108577

  50. [50]

    In: Feng, Y., Lefever, E

    Zhang, H., Li, X., Bing, L.: Video-LLaMA: An instruction-tuned audio-visual lan- guage model for video understanding. In: Feng, Y., Lefever, E. (eds.) Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. pp. 543–553. Association for Computational Linguistics, Singapore (Dec 2023).https://doi.org/10...

  51. [51]

    Zhang, H., Gu, X., Li, J., Ma, C., Bai, S., et al.: Thinking with videos: Multimodal tool-augmented reinforcement learning for long video reasoning (Sep 2025).https: //doi.org/10.48550/arXiv.2508.044164

  52. [52]

    arXiv preprint arXiv:2409.16597 (2024).https: //doi.org/10.48550/arXiv.2409.165972

    Zhang, J., Jiao, Y., Chen, S., Zhao, N., Tan, Z., et al.: Eventhallusion: Diagnosing event hallucinations in video llms. arXiv preprint arXiv:2409.16597 (2024).https: //doi.org/10.48550/arXiv.2409.165972

  53. [53]

    In: Proceedings of the European Conference on Computer Vision (ECCV) (September 2018) 4 Towards Temporal Compositional Reasoning in Long-Form Sports Videos 19

    Zhou, B., Andonian, A., Oliva, A., Torralba, A.: Temporal relational reasoning in videos. In: Proceedings of the European Conference on Computer Vision (ECCV) (September 2018) 4 Towards Temporal Compositional Reasoning in Long-Form Sports Videos 19

  54. [54]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Zhou, J., Shu, Y., Zhao, B., Wu, B., Liang, Z., et al.: Mlvu: Benchmarking multi- task long video understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 13691–13701 (June 2025) 7

  55. [55]

    Zou, J., Xia, H., Ye, Z., Zhang, S., Lai, C., et al.: DeepSport: A multimodal large language model for comprehensive sports video reasoning via agentic reinforcement learning (Nov 2025).https://doi.org/10.48550/arXiv.2511.129087