Recognition: unknown
Towards Temporal Compositional Reasoning in Long-Form Sports Videos
Pith reviewed 2026-05-08 12:42 UTC · model grok-4.3
The pith
Chain-of-Time Reasoning with temporal-reward training and an evidence-seeking loop lets multimodal models compose answers from sparse time-stamped evidence in long sports videos.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Treating reasoning as temporally grounded evidence composition, implemented by training with a temporal-reward GRPO objective and running an anchor-observe-infer loop that iteratively localizes, verifies, and composes evidence, consistently raises both temporal compositional reasoning accuracy and step-wise grounding quality above strong multimodal LLM baselines on the SportsTime benchmark.
What carries the argument
The anchor-observe-infer evidence-seeking loop, which forces the model to localize candidate time segments, verify their relevance, and compose them before emitting a final answer.
If this is right
- Models will produce answers that are accompanied by explicit, verifiable temporal evidence segments.
- Step-wise grounding quality will rise, making error analysis and debugging of video reasoning easier.
- The same training and inference pattern can be applied to any long video domain that requires sparse evidence integration.
- Benchmark scores on SportsTime will become a stricter test of genuine temporal composition rather than pattern matching.
Where Pith is reading between the lines
- The method could be tested on instructional or surveillance videos to check whether the sports-specific dynamics are necessary for the gains.
- Replacing the GRPO reward with a simpler temporal supervision signal might reveal how much of the benefit is due to the reinforcement-learning formulation.
- Integrating the anchor-observe-infer loop into open-ended video chat systems would add built-in evidence citation without extra post-processing.
Load-bearing premise
The performance gains come chiefly from the temporal-reward GRPO and the anchor-observe-infer loop rather than from dataset construction details or baseline implementation choices.
What would settle it
A controlled ablation on SportsTime in which the temporal-reward term is removed from training and the inference loop is replaced by direct answer generation, with all other factors held fixed, showing whether the reported improvements disappear.
Figures
read the original abstract
Sports videos are a challenging domain for multimodal understanding because they involve complex and dynamic human activities. Despite rapid progress in Multimodal Large Language Models (MLLMs), long-horizon reasoning in sports videos remains difficult, as answering questions requires both locating temporally sparse evidence and integrating it into reasoning. We attribute this limitation to two closely coupled factors: insufficient supervision over temporally dispersed evidence, and the lack of methods that require models to identify, localize, and justify temporal evidence. To address these gaps, we introduce SportsTime, a large-scale benchmark for long-form sports video understanding, comprising 14K+ open-ended QA pairs and 50K+ step-wise temporal evidence annotations. Building on SportsTime, we propose Chain-of-Time Reasoning (CoTR), which treats reasoning as a process of temporally grounded evidence composition. Specifically, during training, CoTR introduces a temporal-reward GRPO to encourage temporally grounded reasoning. During inference, it employs an anchor-observe-infer evidence-seeking loop to iteratively localize, verify, and compose temporal evidence before producing the final answer. Experiments demonstrate the usefulness of SportsTime as a benchmark and the effectiveness of CoTR, which consistently improves temporal compositional reasoning and step-wise grounding quality over strong MLLM baselines.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SportsTime, a benchmark of 14K+ open-ended QA pairs and 50K+ step-wise temporal evidence annotations for long-form sports video understanding, and proposes Chain-of-Time Reasoning (CoTR). CoTR applies temporal-reward GRPO during training to encourage grounded reasoning and an anchor-observe-infer iterative loop at inference to localize, verify, and compose temporal evidence. Experiments claim consistent gains in temporal compositional reasoning and step-wise grounding over strong MLLM baselines.
Significance. If the central attribution holds, the work would supply a much-needed large-scale benchmark and a concrete training/inference recipe for improving temporal evidence handling in MLLMs on dynamic, long-horizon video domains. The combination of a new dataset with explicit step-wise annotations and a method that explicitly couples reward shaping with iterative evidence seeking could serve as a template for other video reasoning tasks.
major comments (2)
- [Experiments] Experiments section: the headline claim that CoTR (via temporal-reward GRPO and the anchor-observe-infer loop) drives the reported gains is not supported by any ablation that isolates either component while holding the underlying MLLM and SportsTime data fixed. Without such controls, it remains possible that gains arise from dataset artifacts, prompt engineering, or baseline implementation details rather than the proposed mechanisms.
- [§4 and Experiments] §4 (CoTR description) and Experiments: the paper asserts that the temporal-reward GRPO and anchor-observe-infer loop are the primary drivers of improved step-wise grounding, yet no quantitative comparison is provided that removes the temporal reward term or the iterative loop while measuring grounding quality and final answer accuracy on the same splits.
minor comments (2)
- [Abstract] Abstract: quantitative metrics, ablation tables, and statistical significance tests are absent, making it impossible to assess the magnitude or reliability of the claimed improvements from the abstract alone.
- [Benchmark] Benchmark construction: the paper should clarify how the 50K+ step-wise temporal evidence annotations were collected and validated (e.g., inter-annotator agreement, quality control) to establish that SportsTime is a reliable testbed.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review. The comments correctly identify a gap in the experimental validation of CoTR's components. We address each point below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Experiments] Experiments section: the headline claim that CoTR (via temporal-reward GRPO and the anchor-observe-infer loop) drives the reported gains is not supported by any ablation that isolates either component while holding the underlying MLLM and SportsTime data fixed. Without such controls, it remains possible that gains arise from dataset artifacts, prompt engineering, or baseline implementation details rather than the proposed mechanisms.
Authors: We agree that the current experiments do not contain ablations that isolate the temporal-reward GRPO and the anchor-observe-infer loop while holding the base MLLM and SportsTime dataset fixed. The reported results compare full CoTR against strong MLLM baselines, but do not disentangle the two proposed mechanisms from other factors. In the revised manuscript we will add the requested controls: we will train and evaluate (1) CoTR with standard GRPO (temporal reward removed), (2) CoTR with single-pass inference (iterative loop removed), and (3) the full CoTR pipeline. All variants will share the identical MLLM backbone and be evaluated on the same SportsTime train/validation/test splits, reporting both final-answer accuracy and step-wise grounding metrics. revision: yes
-
Referee: [§4 and Experiments] §4 (CoTR description) and Experiments: the paper asserts that the temporal-reward GRPO and anchor-observe-infer loop are the primary drivers of improved step-wise grounding, yet no quantitative comparison is provided that removes the temporal reward term or the iterative loop while measuring grounding quality and final answer accuracy on the same splits.
Authors: We acknowledge the absence of the direct quantitative comparisons requested. The manuscript currently shows overall gains of CoTR over baselines but does not quantify the marginal contribution of each component to grounding quality and accuracy on identical splits. As stated in the response to the first comment, the revision will include exactly these ablations: variants with the temporal reward term ablated and variants with the iterative loop replaced by single-pass inference. Grounding quality (step-wise evidence localization) and final answer accuracy will be reported side-by-side for all variants on the same data partitions. revision: yes
Circularity Check
No significant circularity; new benchmark and method evaluated against external baselines.
full rationale
The paper introduces SportsTime as an independent benchmark with 14K+ QA pairs and 50K+ annotations, and proposes CoTR with temporal-reward GRPO training and anchor-observe-infer inference. These are tested empirically against strong external MLLM baselines, with claims of improvement presented as experimental outcomes rather than derivations by construction. No equations, self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations appear in the abstract or described structure. The central claims rest on dataset creation and method application, not tautological reduction to inputs.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
An, X., Xie, Y., Yang, K., Zhang, W., Zhao, X., et al.: LLaVA-OneVision-1.5: Fully open framework for democratized multimodal training (Dec 2025).https: //doi.org/10.48550/arXiv.2509.2366112
-
[2]
Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., et al.: Qwen3-VL technical report (2025).https://doi.org/10.48550/ARXIV.2511.216311, 11
-
[3]
Chen, G., Liu, Y., Huang, Y., He, Y., Pei, B., et al.: CG-bench: Clue-grounded question answering benchmark for long video understanding (Dec 2024).https: //doi.org/10.48550/arXiv.2412.120752, 4
-
[4]
Chen, Z., Wang, W., Cao, Y., Liu, Y., Gao, Z., et al.: Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling (Sep 2025).https://doi.org/10.48550/arXiv.2412.0527111, 12
-
[5]
Clark, C., Zhang, J., Ma, Z., Park, J.S., Salehi, M., et al.: Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding (Jan 2026).https://doi.org/10.48550/arXiv.2601.106111
-
[6]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)
Cui, Y., Zeng, C., Zhao, X., Yang, Y., Wu, G., Wang, L.: Sportsmot: A large multi- object tracking dataset in multiple sports scenes. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 9921–9931 (October
-
[7]
In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops
Deliege, A., Cioppa, A., Giancola, S., Seikavandi, M.J., Dueholm, J.V., Nasrollahi, K., Ghanem, B., Moeslund, T.B., Van Droogenbroeck, M.: Soccernet-v2: A dataset and benchmarks for holistic understanding of broadcast soccer videos. In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops. pp. 4508–4519 (Jun...
2021
-
[8]
Feng, K., Gong, K., Li, B., Guo, Z., Wang, Y., et al.: Video-R1: Reinforcing Video Reasoning in MLLMs (Oct 2025).https://doi.org/10.48550/arXiv.2503.21776 1, 11
work page internal anchor Pith review doi:10.48550/arxiv.2503.21776 2025
-
[9]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Fu, C., Dai, Y., Luo, Y., Li, L., Ren, S., Zhang, R., Wang, Z., Zhou, C., Shen, Y., Zhang, M., Chen, P., Li, Y., Lin, S., Zhao, S., Li, K., Xu, T., Zheng, X., Chen, E., Shan, C., He, R., Sun, X.: Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. In: Proceedings of the IEEE/CVF Conference on Computer Vision...
2025
-
[10]
Ghasemzadeh, S.A., Zandycke, G.V., Istasse, M., Sayez, N., Moshtaghpour, A., et al.: DeepSportLab: a Unified Framework for Ball Detection, Player Instance Segmentation and Pose Estimation in Team Sports Scenes (Dec 2021).https: //doi.org/10.48550/arXiv.2112.00627, arXiv:2112.00627 [cs] 1 16 S. Cao et al
-
[11]
Guo, Y., Liu, J., Li, M., Cheng, D., Tang, X., Sui, D., Liu, Q., Chen, X., Zhao, K.: Vtg-llm: Integrating timestamp knowledge into video llms for enhanced video tem- poral grounding. Proceedings of the AAAI Conference on Artificial Intelligence 39(3), 3302–3310 (Apr 2025).https://doi.org/10.1609/aaai.v39i3.32341, https://ojs.aaai.org/index.php/AAAI/articl...
-
[12]
https://doi.org/10.48550/arXiv.2506.094454
Gupta, A., Roy, A., Chellappa, R., Bastian, N.D., Velasquez, A., et al.: TOGA: Temporally grounded open-ended video QA with weak supervision (Jun 2025). https://doi.org/10.48550/arXiv.2506.094454
-
[13]
In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A
Kojima,T.,Gu,S.S.,Reid,M.,Matsuo,Y.,Iwasawa,Y.:Largelanguagemodelsare zero-shot reasoners. In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A. (eds.) Advances in Neural Information Processing Systems. vol. 35, pp. 22199–22213. Curran Associates, Inc. (2022),https://proceedings.neurips.cc/ paper _ files / paper / 2022 / file / 8bb0d291a...
2022
-
[14]
Lee, K., Kim, E., Choi, J., Chang, B.: NOAH: Benchmarking Narrative Prior driven Hallucination and Omission in Video Large Language Models (Nov 2025). https://doi.org/10.48550/arXiv.2511.06475, arXiv:2511.06475 [cs] 2
-
[15]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Li, C., Im, E.W., Fazli, P.: Vidhalluc: Evaluating temporal hallucinations in mul- timodal large language models for video understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 13723–13733 (June 2025) 2
2025
-
[16]
https://doi.org/10.48550/arXiv.2401.015057
Li,H.,Deng,A.,Ke,Q.,Liu,J.,Rahmani,H.,etal.:Sports-QA:Alarge-scalevideo question answering benchmark for complex and professional sports (Jan 2024). https://doi.org/10.48550/arXiv.2401.015057
-
[17]
Li, R., Wang, X., Zhang, Y., Zohar, O., Wang, Z., et al.: Temporal Preference Optimization for Long-Form Video Understanding (Sep 2025).https://doi.org/ 10.48550/arXiv.2501.139194
-
[18]
In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV)
Li, Y., Xiao, J., Feng, C., Wang, X., Chua, T.S.: Discovering spatio-temporal ratio- nales for video question answering. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV). pp. 13823–13832. IEEE, Paris, France (Oct 2023). https://doi.org/10.1109/ICCV51070.2023.012752
-
[19]
In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G
Liu, H., Ma, X., Zhong, C., Zhang, Y., Lin, W.: TimeCraft: Navigate weakly- supervised temporal grounded video question answering via bi-directional reason- ing. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds.) Computer Vision – ECCV 2024, vol. 15063. Springer Nature Switzerland, Cham (2025).https://doi.org/10.1007/97...
-
[20]
Lu, H., Wang, J., Zhang, Y., Wang, R., et al.: ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long Video Understanding (Sep 2025).https:// doi.org/10.48550/arXiv.2508.21496, arXiv:2508.21496 [cs] 2
-
[21]
https://doi.org/10.48550/arXiv.2508.1173712
Lu, S., Li, Y., Xia, Y., Hu, Y., Zhao, S., et al.: Ovis2.5 technical report (Aug 2025). https://doi.org/10.48550/arXiv.2508.1173712
-
[22]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)
Nagrani, A., Menon, S., Iscen, A., Buch, S., Mehran, R., Jha, N., Hauth, A., Zhu, Y., Vondrick, C., Sirotenko, M., Schmid, C., Weyand, T.: Minerva: Evaluating com- plex video reasoning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 23968–23978 (October 2025) 7
2025
-
[23]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Rao, J., Wu, H., Jiang, H., Zhang, Y., Wang, Y., Xie, W.: Towards universal soccer video understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 8384–8394 (June 2025) 4
2025
-
[24]
In: Al-Onaizan, Y., Bansal, M., Chen, Y.N
Rao, J., Wu, H., Liu, C., Wang, Y., Xie, W.: MatchTime: Towards automatic soccer game commentary generation. In: Al-Onaizan, Y., Bansal, M., Chen, Y.N. (eds.) Towards Temporal Compositional Reasoning in Long-Form Sports Videos 17 Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. pp. 1671–1685. Association for Computat...
-
[25]
Rawal, R., Shirkavand, R., Huang, H., Somepalli, G., Goldstein, T.: ARGUS: Hal- lucination and Omission Evaluation in Video-LLMs (Jun 2025).https://doi.org/ 10.48550/arXiv.2506.07371, arXiv:2506.07371 [cs] version: 1 2
-
[26]
In: Bouamor, H., Pino, J., Bali, K
Ren, S., Chen, S., Li, S., Sun, X., Hou, L.: TESTA: Temporal-spatial token aggre- gation for long-form video-language understanding. In: Bouamor, H., Pino, J., Bali, K. (eds.) Findings of the Association for Computational Linguistics: EMNLP 2023. pp. 932–947. Association for Computational Linguistics, Singapore (Dec 2023). https://doi.org/10.18653/v1/2023...
-
[27]
Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., et al.: DeepSeekMath: Pushing the limits of mathematical reasoning in open language models (Apr 2024).https: //doi.org/10.48550/arXiv.2402.0330013
-
[28]
org/10.48550/arXiv.2512.055134
Sugandhika, C., Li, C., Rajan, D., Fernando, B.: Know-show: Benchmarking video- language models on spatio-temporal grounded reasoning (Dec 2025).https://doi. org/10.48550/arXiv.2512.055134
-
[29]
GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models
Team, G..., Zeng, A., Lv, X., Zheng, Q., Hou, Z., et al.: GLM-4.5: Agentic, rea- soning, and coding (ARC) foundation models (Aug 2025).https://doi.org/10. 48550/arXiv.2508.0647114
work page internal anchor Pith review arXiv 2025
-
[30]
Team, V., Hong, W., Yu, W., Gu, X., Wang, G., Gan, G., et al.: Glm-4.5v and glm- 4.1v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning (2025),https://arxiv.org/abs/2507.0100611
work page internal anchor Pith review arXiv 2025
-
[31]
Wang, H., Xu, Z., Cheng, Y., Diao, S., Zhou, Y., et al.: Grounded-VideoLLM: Sharpening fine-grained temporal grounding in video large language models (Aug 2025).https://doi.org/10.48550/arXiv.2410.032904
-
[32]
Wang, S., Chen, G., Huang, D.a., Li, Z., Li, M., et al.: VideoITG: Multimodal video understanding with instructed temporal grounding (Jul 2025).https://doi.org/ 10.48550/arXiv.2507.133534
-
[33]
In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV)
Wang, W., He, Z., Hong, W., Cheng, Y., Zhang, X., et al.: Lvbench: An extreme long video understanding benchmark. In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV). pp. 22958–22967 (October 2025) 2, 7
2025
-
[34]
Time-r1: Post-training large vision language model for temporal video grounding,
Wang, Y., Wang, Z., Xu, B., Du, Y., Lin, K., Xiao, Z., Yue, Z., Ju, J., Zhang, L., Yang, D., et al.: Time-r1: Post-training large vision language model for temporal video grounding. arXiv preprint arXiv:2503.13377 (2025) 4
-
[35]
https://doi.org/10.48550/arXiv.2406.163382
Wang, Y., Wang, Y., Zhao, D., Xie, C., Zheng, Z.: VideoHallucer: Evaluating intrinsic and extrinsic hallucinations in large video-language models (Jun 2024). https://doi.org/10.48550/arXiv.2406.163382
-
[36]
https://doi.org/10.48550/arXiv.2510.182349
Wei, H., Sun, Y., Li, Y.: DeepSeek-OCR: Contexts optical compression (Oct 2025). https://doi.org/10.48550/arXiv.2510.182349
-
[37]
Advances in Neural Information Pro- cessing Systems37, 28828–28857 (2024) 7
Wu, H., Li, D., Chen, B., Li, J.: Longvideobench: A benchmark for long-context interleaved video-language understanding. Advances in Neural Information Pro- cessing Systems37, 28828–28857 (2024) 7
2024
-
[38]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Wu, Y., Hu, X., Sun, Y., Zhou, Y., Zhu, W., et al.: Number it: Temporal ground- ing videos like flipping manga. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 13754–13765 (June 2025) 4, 9 18 S. Cao et al
2025
-
[39]
org/10.48550/arXiv.2511.064994, 7
Xia, H., Ge, H., Zou, J., Choi, H.W., Zhang, X., et al.: SportR: A benchmark for multimodal large language model reasoning in sports (Nov 2025).https://doi. org/10.48550/arXiv.2511.064994, 7
-
[40]
Xia, H., Yang, Z., Wang, Y., Tracy, R., Zhao, Y., et al.: SportQA: A benchmark for sports understanding in large language models (Jun 2024).https://doi.org/ 10.48550/arXiv.2402.158622, 4, 7
-
[41]
Xia, H., Yang, Z., Zou, J., Tracy, R., Wang, Y., et al.: SPORTU: A comprehen- sive sports understanding benchmark for multimodal large language models (Mar 2025).https://doi.org/10.48550/arXiv.2410.084744, 7
-
[42]
T-VSL: text-guided visual sound source localization in mixtures
Xu, J., Zhao, G., Yin, S., Zhou, W., Peng, Y.: FineSports: A multi-person hi- erarchical sports video dataset for fine-grained action understanding. In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 21773–21782. IEEE, Seattle, WA, USA (Jun 2024).https://doi.org/10. 1109/CVPR52733.2024.020571, 4
-
[43]
Xu, M., Gao, M., Li, S., Lu, J., Gan, Z., et al.: SlowFast-LLaVA-1.5: A family of token-efficient video large language models for long-form video understanding (Mar 2025).https://doi.org/10.48550/arXiv.2503.1894314
-
[44]
Yang, H., Rao, J., Wu, H., Xie, W.: SoccerMaster: A vision foundation model for soccer understanding (Dec 2025).https://doi.org/10.48550/arXiv.2512.11016 4
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2512.11016 2025
-
[45]
Yang, Z., Wang, S., Zhang, K., Wu, K., Leng, S., et al.: LongVT: Incentivizing "thinking with long videos" via native tool calling (Nov 2025).https://doi.org/ 10.48550/arXiv.2511.207854
- [46]
-
[47]
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
Yao, Y., Yu, T., Zhang, A., Wang, C., Cui, J., et al.: Minicpm-v: A gpt-4v level mllm on your phone. arXiv preprint arXiv:2408.01800 (2024) 11, 12
work page internal anchor Pith review arXiv 2024
-
[48]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Ye,J.,Wang,Z.,Sun,H.,Chandrasegaran,K.,Durante,Z.,etal.:Re-thinkingtem- poral search for long-form video understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 8579–8591 (June 2025) 4
2025
-
[49]
Yu, J., Wu, Y., Chu, M., Ren, Z., Huang, Z., et al.: VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos (Aug 2025).https://doi.org/ 10.48550/arXiv.2506.108577
-
[50]
Zhang, H., Li, X., Bing, L.: Video-LLaMA: An instruction-tuned audio-visual lan- guage model for video understanding. In: Feng, Y., Lefever, E. (eds.) Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. pp. 543–553. Association for Computational Linguistics, Singapore (Dec 2023).https://doi.org/10...
-
[51]
Zhang, H., Gu, X., Li, J., Ma, C., Bai, S., et al.: Thinking with videos: Multimodal tool-augmented reinforcement learning for long video reasoning (Sep 2025).https: //doi.org/10.48550/arXiv.2508.044164
-
[52]
arXiv preprint arXiv:2409.16597 (2024).https: //doi.org/10.48550/arXiv.2409.165972
Zhang, J., Jiao, Y., Chen, S., Zhao, N., Tan, Z., et al.: Eventhallusion: Diagnosing event hallucinations in video llms. arXiv preprint arXiv:2409.16597 (2024).https: //doi.org/10.48550/arXiv.2409.165972
-
[53]
In: Proceedings of the European Conference on Computer Vision (ECCV) (September 2018) 4 Towards Temporal Compositional Reasoning in Long-Form Sports Videos 19
Zhou, B., Andonian, A., Oliva, A., Torralba, A.: Temporal relational reasoning in videos. In: Proceedings of the European Conference on Computer Vision (ECCV) (September 2018) 4 Towards Temporal Compositional Reasoning in Long-Form Sports Videos 19
2018
-
[54]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Zhou, J., Shu, Y., Zhao, B., Wu, B., Liang, Z., et al.: Mlvu: Benchmarking multi- task long video understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 13691–13701 (June 2025) 7
2025
-
[55]
Zou, J., Xia, H., Ye, Z., Zhang, S., Lai, C., et al.: DeepSport: A multimodal large language model for comprehensive sports video reasoning via agentic reinforcement learning (Nov 2025).https://doi.org/10.48550/arXiv.2511.129087
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.