pith. sign in

arxiv: 2505.20715 · v2 · submitted 2025-05-27 · 💻 cs.CV · cs.CL

MUSEG: Reinforcing Video Temporal Understanding via Timestamp-Aware Multi-Segment Grounding

Pith reviewed 2026-05-19 13:09 UTC · model grok-4.3

classification 💻 cs.CV cs.CL
keywords video temporal understandingmulti-segment groundingreinforcement learningmultimodal large language modelstimestamp awarenesstemporal reasoningvideo question answeringgrounding tasks
0
0 comments X

The pith

MUSEG uses timestamp-aware multi-segment grounding in reinforcement learning to improve fine-grained temporal reasoning in video MLLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MUSEG as a way to strengthen multimodal large language models on tasks that require understanding when events occur in videos. Current models often fail at precise timing even after general video training, and prior reinforcement learning attempts have not closed the gap. MUSEG adds a mechanism that lets the model link a query to several relevant video segments marked by timestamps, then trains it with a sequence of rewards that first encourage basic alignment and later demand accurate timing. If this works, models should handle time-sensitive video questions and grounding problems more reliably across different kinds of video content.

Core claim

MUSEG is a reinforcement-learning method that improves video temporal understanding in multimodal large language models by introducing timestamp-aware multi-segment grounding, which aligns queries with multiple relevant video segments, and by using a customized training recipe with phased rewards that progressively guide the model toward temporally grounded reasoning, resulting in higher performance on temporal grounding and time-sensitive video question-answering tasks.

What carries the argument

Timestamp-aware multi-segment grounding, which lets the model associate a single query with several timestamp-marked video segments at once and is trained through phased rewards that move from coarse to fine temporal alignment.

If this is right

  • Models produce more accurate start and end times when asked to locate multiple events in one video.
  • Performance rises on video question-answering items that depend on order or duration of actions.
  • The same training recipe transfers to other time-sensitive multimodal tasks without task-specific redesign.
  • Error rates drop on queries that require distinguishing overlapping or sequential events.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could support longer-form video analysis if the multi-segment mechanism is extended to handle dozens of segments without exploding computation.
  • Combining the grounding signal with audio or text transcripts might further tighten temporal alignment in mixed-media content.
  • Downstream applications such as automatic video editing or surveillance event logging would gain reliability once the approach is shown to work on uncurated real-world footage.

Load-bearing premise

The assumption that the phased reward schedule will steer the model toward genuine temporal reasoning improvements rather than new overfitting patterns or benchmark-specific shortcuts.

What would settle it

A new test set of videos with event timings and lengths outside the training distribution on which MUSEG shows no accuracy gain over standard RL baselines without multi-segment grounding.

Figures

Figures reproduced from arXiv: 2505.20715 by Chenliang Li, Chi Chen, Fei Huang, Fuwen Luo, Jiyue Guo, Ji Zhang, Ming Yan, Peng Li, Shengfeng Lou, Weizhou Shen, Yang Liu, Ziyue Wang.

Figure 1
Figure 1. Figure 1: Performance of our MUSEG-7B across multiple [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of MUSEG. (a) Our proposed segment matching reward (up) and timestamp reward (down). (b) RL-based [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Cases of our MUSEG-7B and baselines on multi-segment grounding (in-domain) and referred action recognition [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Segment matching reward (a) w/o local matching, (b) w/ local matching (sequential), and (c) w/ local matching [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: (a) Model performance with different training [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
read the original abstract

Video temporal understanding is crucial for multimodal large language models (MLLMs) to reason over events in videos. Despite recent advances in general video understanding, current MLLMs still struggle with fine-grained temporal reasoning. While reinforcement learning (RL) has been explored to address this issue recently, existing RL approaches remain limited in performance on time-sensitive tasks. In this work, we propose MUSEG, a novel RL-based method that enhances temporal understanding by introducing timestamp-aware multi-segment grounding. MUSEG enables MLLMs to align queries with multiple relevant video segments, promoting more comprehensive temporal reasoning. To facilitate effective learning, we design a customized RL training recipe with phased rewards that progressively guides the model toward temporally grounded reasoning. Extensive experiments on temporal grounding and time-sensitive video question answering (QA) tasks demonstrate that MUSEG significantly outperforms existing methods and generalizes well across diverse temporal understanding scenarios. View our project at https://github.com/THUNLP-MT/MUSEG.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes MUSEG, an RL-based method for multimodal large language models that introduces timestamp-aware multi-segment grounding to align queries with multiple relevant video segments and employs a customized phased-reward training recipe to improve fine-grained temporal reasoning. Experiments on temporal grounding and time-sensitive video QA tasks are reported to show significant outperformance over existing methods with good generalization across scenarios.

Significance. If the central empirical claims hold after addressing robustness concerns, the work would offer a concrete advance in RL training for video MLLMs by moving beyond single-segment or non-phased approaches, potentially influencing how temporal supervision is structured in future models.

major comments (2)
  1. [Abstract and §5] Abstract and §5 (Experiments): the generalization claim that MUSEG 'generalizes well across diverse temporal understanding scenarios' is not supported by any reported evaluation on temporally shifted, out-of-distribution, or adversarially perturbed video distributions; without such tests or an ablation that isolates the timestamp-aware multi-segment component from the phased rewards, the reported gains could be explained by increased effective supervision on the same annotations rather than the proposed novelty.
  2. [Method] Method (phased reward recipe): the description of the phased rewards as 'progressively guid[ing] the model toward temporally grounded reasoning' lacks any analysis or ablation demonstrating that the schedule prevents reward hacking or overfitting to the grounding/QA splits; a concrete test (e.g., reward-component ablation or training-curve comparison) is needed to establish that the phased structure is load-bearing for the claimed improvements.
minor comments (2)
  1. [Method] Notation for the multi-segment grounding loss and timestamp encoding should be made fully explicit (including any hyperparameters) to support reproducibility.
  2. [Experiments] Figure captions and table headers could more clearly distinguish between grounding and QA metrics.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point by point below, agreeing where revisions are needed to strengthen the empirical support and providing clarifications on the current evidence.

read point-by-point responses
  1. Referee: [Abstract and §5] Abstract and §5 (Experiments): the generalization claim that MUSEG 'generalizes well across diverse temporal understanding scenarios' is not supported by any reported evaluation on temporally shifted, out-of-distribution, or adversarially perturbed video distributions; without such tests or an ablation that isolates the timestamp-aware multi-segment component from the phased rewards, the reported gains could be explained by increased effective supervision on the same annotations rather than the proposed novelty.

    Authors: We acknowledge that explicit tests on temporally shifted or adversarially perturbed distributions would provide stronger evidence for the generalization claim. Our current evaluations span multiple benchmarks with varying video durations, event densities, and query types, which we view as supporting generalization across diverse scenarios. To directly address the isolation concern, we will add an ablation study in the revised manuscript that separately evaluates the timestamp-aware multi-segment grounding and the phased-reward components against a baseline with equivalent supervision volume. This will help demonstrate that the gains arise from the proposed mechanisms rather than annotation volume alone. revision: yes

  2. Referee: [Method] Method (phased reward recipe): the description of the phased rewards as 'progressively guid[ing] the model toward temporally grounded reasoning' lacks any analysis or ablation demonstrating that the schedule prevents reward hacking or overfitting to the grounding/QA splits; a concrete test (e.g., reward-component ablation or training-curve comparison) is needed to establish that the phased structure is load-bearing for the claimed improvements.

    Authors: We agree that additional analysis is required to substantiate the role of the phased reward schedule. In the revised version we will include training-curve comparisons between phased and non-phased reward variants as well as a reward-component ablation. These results will show whether the phased structure mitigates reward hacking and overfitting to individual splits, thereby confirming that it is load-bearing for the observed performance gains. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical RL method with independent experimental validation

full rationale

The paper introduces MUSEG as a constructive RL-based architecture for timestamp-aware multi-segment grounding together with a phased-reward training schedule. All central claims rest on reported experimental outperformance across temporal grounding and time-sensitive QA benchmarks rather than on any derivation that reduces by construction to fitted parameters, self-citations, or renamed inputs. No equations or training objectives are shown to be tautological with the evaluation metrics, and the method description supplies independent architectural and reward-design choices whose effects are measured externally. The work is therefore self-contained against its own empirical results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that reinforcement learning with appropriately designed rewards can instill temporal grounding capabilities in MLLMs; no free parameters or invented entities are mentioned in the abstract.

axioms (1)
  • domain assumption Reinforcement learning with phased rewards can progressively improve temporal reasoning in video MLLMs
    Invoked as the basis for the customized training recipe.

pith-pipeline@v0.9.0 · 5731 in / 1168 out tokens · 24799 ms · 2026-05-19T13:09:49.734364+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning

    cs.CV 2026-05 unverdicted novelty 7.0

    ParaVT is a parallel video tool-calling RL framework that resolves the Tool Prior Paradox via PARA-GRPO, delivering +7.9% average gains on six long-video benchmarks and raising format compliance from 0.13 to 0.64.

  2. ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning

    cs.CV 2026-05 unverdicted novelty 6.0

    ParaVT introduces the first multi-agent RL framework for parallel video tool calling in LMMs, using PARA-GRPO to resolve the Tool Prior Paradox and achieve +7.9% average improvement over Qwen3-VL baseline across six b...

  3. GraphThinker: Reinforcing Temporally Grounded Video Reasoning with Event Graph Thinking

    cs.CV 2026-02 unverdicted novelty 6.0

    GraphThinker reduces temporal hallucinations in video reasoning by constructing event-based scene graphs and applying visual attention rewards in reinforcement finetuning.

  4. TempR1: Improving Temporal Understanding of MLLMs via Temporal-Aware Multi-Task Reinforcement Learning

    cs.CV 2025-12 unverdicted novelty 5.0

    TempR1 applies temporal-aware multi-task RL using GRPO and three types of localization rewards to achieve SOTA temporal understanding in MLLMs with synergistic gains from joint optimization.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · cited by 3 Pith papers · 10 internal anchors

  1. [1]

    E.T. Bench: Towards Open-Ended Event-Level Video- Language Understanding,

    Y. Liu, Z. Ma, Z. Qi, Y. Wu, Y. Shan, and C. W. Chen, “E.T. Bench: Towards Open-Ended Event-Level Video- Language Understanding,”Advances in Neural Infor- mation Processing Systems, vol. 37, pp. 32076–32110, 2024

  2. [2]

    Cg-bench: Clue-grounded question answering benchmark for long video understanding

    G.Chen,Y.Liu,Y.Huang,Y.He,B.Pei,J.Xu,Y.Wang, T. Lu, and L. Wang, “CG-Bench: Clue-grounded Ques- tion Answering Benchmark for Long Video Understand- ing,”arXiv preprint arXiv:2412.12075, 2024

  3. [3]

    V-star: Bench- marking video-llms on video spatio-temporal reasoning.arXiv preprint arXiv:2503.11495, 2025

    Z. Cheng, J. Hu, Z. Liu, C. Si, W. Li, and S. Gong, “V-STaR: Benchmarking Video-LLMs on Video Spatio- Temporal Reasoning,”arXiv preprint arXiv:2503.11495, 2025

  4. [4]

    TALL: Tem- poral Activity Localization via Language Query,

    J. Gao, C. Sun, Z. Yang, and R. Nevatia, “TALL: Tem- poral Activity Localization via Language Query,” in Proceedings of the IEEE international conference on computer vision, pp. 5267–5275, 2017

  5. [5]

    Tarsier: Recipes for training and evaluating large video description models

    J. Wang, L. Yuan, Y. Zhang, and H. Sun, “Tarsier: Recipes for Training and Evaluating Large Video De- scription Models,”arXiv preprint arXiv:2407.00634, 2024

  6. [6]

    Can I Trust Your Answer? Visually Grounded Video Question An- swering,

    J. Xiao, A. Yao, Y. Li, and T.-S. Chua, “Can I Trust Your Answer? Visually Grounded Video Question An- swering,”Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13204–13214, 2024

  7. [7]

    GPT-4o System Card

    A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A.Ramesh,A.Clark,A.Ostrow,A.Welihinda,A.Hayes, A.Radford,et al.,“GPT-4oSystemCard,”arXiv preprint arXiv:2410.21276, 2024

  8. [8]

    Gemini: A Family of Highly Capable Multimodal Models

    G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Mil- lican,et al., “Gemini: A Family of Highly Capable Multimodal Models,”arXiv preprint arXiv:2312.11805, 2023

  9. [9]

    Qwen2.5-VL Technical Report

    S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang,et al., “Qwen2.5- VLTechnicalReport,”arXiv preprint arXiv:2502.13923, 2025

  10. [10]

    TempCompass: Do Video LLMs ReallyUnderstandVideos?,

    Y. Liu, S. Li, Y. Liu, Y. Wang, S. Ren, L. Li, S. Chen, X. Sun, and L. Hou, “TempCompass: Do Video LLMs ReallyUnderstandVideos?,”Findings of the Association for Computational Linguistics: ACL 2024, pp. 8731– 8772, 2024

  11. [11]

    Exploring the Role of Explicit Temporal Modeling in Multimodal Large Lan- guage Models for Video Understanding,

    Y. Li, Z. Liu, Y. Kong, G. Li, J. Zhang, C. Bian, F. Liu, L. Yao, and Z. Sun, “Exploring the Role of Explicit Temporal Modeling in Multimodal Large Lan- guage Models for Video Understanding,”arXiv preprint arXiv:2501.16786, 2025

  12. [12]

    LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal Understanding,

    H. Li, J. Chen, Z. Wei, S. Huang, T. Hui, J. Gao, X. Wei, and S. Liu, “LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal Understanding,”arXiv preprint arXiv:2501.08282,2025

  13. [13]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi,et al., “DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Rein- forcement Learning,”arXiv preprint arXiv:2501.12948, 2025

  14. [14]

    Video-R1: Reinforcing Video Reasoning in MLLMs

    K. Feng, K. Gong, B. Li, Z. Guo, Y. Wang, T. Peng, B. Wang, and X. Yue, “Video-R1: Rein- forcing Video Reasoning in MLLMs,”arXiv preprint arXiv:2503.21776, 2025

  15. [15]

    VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

    X. Li, Z. Yan, D. Meng, L. Dong, X. Zeng, Y. He, Y. Wang, Y. Qiao, Y. Wang, and L. Wang, “VideoChat-R1: Enhancing Spatio-Temporal Percep- tion via Reinforcement Fine-Tuning,”arXiv preprint arXiv:2504.06958, 2025

  16. [16]

    Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding

    Y. Wang, B. Xu, Z. Yue, Z. Xiao, Z. Wang, L. Zhang, D. Yang, W. Wang, and Q. Jin, “TimeZero: Temporal VideoGroundingwithReasoning-GuidedLVLM,”arXiv preprint arXiv:2503.13377, 2025

  17. [17]

    Tinyllava-video-r1: Towards smaller lmms for video reason- ing.arXiv preprint arXiv:2504.09641, 2025

    X. Zhang, S. Wen, W. Wu, and L. Huang, “TinyLLaVA- Video-R1: Towards Smaller LMMs for Video Reason- ing,”arXiv preprint arXiv:2504.09641, 2025. 9

  18. [18]

    ViViT: A Video Vision Transformer,

    A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lucic, and C. Schmid, “ViViT: A Video Vision Transformer,” 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 6816–6826, 2021

  19. [19]

    CLIP4Clip: AnEmpiricalStudyofCLIPforEnd toEndVideoClipRetrieval,

    H. Luo, L. Ji, M. Zhong, Y. Chen, W. Lei, N. Duan, and T.Li,“CLIP4Clip: AnEmpiricalStudyofCLIPforEnd toEndVideoClipRetrieval,”Neurocomputing,vol.508, pp. 293–304, 2021

  20. [20]

    Video Swin Transformer,

    Z. Liu, J. Ning, Y. Cao, Y. Wei, Z. Zhang, S. Lin, and H. Hu, “Video Swin Transformer,”Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3192–3201, 2021

  21. [21]

    VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Un- derstanding,

    H.Xu, G.Ghosh, P.-Y.B.Huang, D.Okhonko, A.Agha- janyan, and F. M. L. Z. C. Feichtenhofer, “VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Un- derstanding,” inConference on Empirical Methods in Natural Language Processing, 2021

  22. [22]

    Spatio-temporal interaction graph parsing networks for human-object interaction recognition,

    N. Wang, G. Zhu, L. Zhang, P. Shen, H. Li, and C. Hua, “Spatio-temporal interaction graph parsing networks for human-object interaction recognition,” inProceedings of the 29th ACM international conference on multimedia, pp. 4985–4993, 2021

  23. [23]

    Learning Streaming Video Representation via Multitask Training,

    Y. Yan, J. Xu, S. Di, Y. Liu, Y. Shi, Q. Chen, Z. Li, Y. Huang, and W. Xie, “Learning Streaming Video Representation via Multitask Training,”arXiv preprint arXiv:2504.20041, 2025

  24. [24]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Mod- els,

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou,et al., “Chain-of-Thought Prompting Elicits Reasoning in Large Language Mod- els,”Advances in neural information processing systems, vol. 35, pp. 24824–24837, 2022

  25. [25]

    Temporalbench: Benchmarking fine-grained temporal understanding for multimodal video models, 2024

    M. Cai, R. Tan, J. Zhang, B. Zou, K. Zhang, F. Yao, F. Zhu, J. Gu, Y. Zhong, Y. Shang,et al., “Tempo- ralbench: Benchmarking fine-grained temporal under- standing for multimodal video models,”arXiv preprint arXiv:2410.10818, 2024

  26. [26]

    Online Video Under- standing: A Comprehensive Benchmark and Memory- Augmented Method,

    Z.Huang,X.Li,J.Li,J.Wang,X.Zeng,C.Liang,T.Wu, X. Chen, L. Li, and L. Wang, “Online Video Under- standing: A Comprehensive Benchmark and Memory- Augmented Method,”arXiv preprint arXiv:2501.00584, 2024

  27. [27]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y. K. Li, Y. Wu, and D. Guo, “DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models,”arXiv preprint arXiv:2402.03300, 2024

  28. [28]

    Training language models to follow instructions with human feedback,

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al., “Training language models to follow instructions with human feedback,”Advances in neural information processing systems, vol. 35, pp. 27730–27744, 2022

  29. [29]

    Proximal Policy Optimization Algorithms

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal Policy Optimization Algorithms,” arXiv preprint arXiv:1707.06347, 2017

  30. [30]

    Exploring the effect of reinforcement learning on video understanding: Insights from seed-bench- r1.arXiv preprint arXiv:2503.24376, 2025

    Y. Chen, Y. Ge, R. Wang, Y. Ge, L. Qiu, Y. Shan, and X.Liu,“ExploringtheEffectofReinforcementLearning on Video Understanding: Insights from SEED-Bench- R1,”arXiv preprint arXiv:2503.24376, 2025

  31. [31]

    Reinforcing VideoReasoningwithFocusedThinking,

    J. Dang, J. Wu, T. Wang, X. Lin, N. Zhu, H. Chen, W.-S. Zheng, M. Wang, and T.-S. Chua, “Reinforcing VideoReasoningwithFocusedThinking,”arXiv preprint arXiv:2505.24718, 2025

  32. [32]

    Videorft: Incentivizing video reasoning capability in mllms via reinforced fine-tuning.arXiv preprint arXiv:2505.12434, 2025

    Q. Wang, Y. Yu, Y. Yuan, R. Mao, and T. Zhou, “Vide- oRFT: Incentivizing Video Reasoning Capability in MLLMs via Reinforced Fine-Tuning,”arXiv preprint arXiv:2505.12434, 2025

  33. [33]

    VersaVid-R1: A Versatile Video Understanding and Reasoning Model from Question Answering to Captioning Tasks,

    X. Chen, Y. Zhang, Y. Guan, B. Zeng, Y. Shi, S. Yang, P. Wan, Q. Liu, L. Wang, and T. Tan, “VersaVid-R1: A Versatile Video Understanding and Reasoning Model from Question Answering to Captioning Tasks,”arXiv preprint arXiv:2506.09079, 2025

  34. [34]

    Thinking with videos: Multimodal tool-augmented reinforcement learning for long video reasoning.arXiv preprint arXiv:2508.04416, 2025

    H. Zhang, X. Gu, J. Li, C. Ma, S. Bai, C. Zhang, B. Zhang, Z. Zhou, D. He, and Y. Tang, “Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning,”arXiv preprint arXiv:2508.04416, 2025

  35. [35]

    Tempura: Temporal event masked prediction and understanding for reasoning in action.arXiv preprint arXiv:2505.01583, 2025

    J.-H.Cheng,V.Wang,H.Wang,H.Zhou,Y.-H.Peng,H.- I. Liu, H.-W. Huang, K.-M. Chen, C.-Y. Yang, W. Chai, et al., “TEMPURA: Temporal Event Masked Prediction and Understanding for Reasoning in Action,”arXiv preprint arXiv:2505.01583, 2025

  36. [36]

    Qwen3 Technical Report

    A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B.Yu,C.Gao,C.Huang,C.Lv,et al.,“Qwen3Technical Report,”arXiv preprint arXiv:2505.09388, 2025

  37. [37]

    Generalized Intersection Over Union: A Metric and a Loss for Bounding Box Regression,

    H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and S. Savarese, “Generalized Intersection Over Union: A Metric and a Loss for Bounding Box Regression,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019

  38. [38]

    Unhackable temporal rewarding for scalable video mllms.arXiv preprint arXiv:2502.12081, 2025

    E. Yu, K. Lin, L. Zhao, Y. Wei, Z. Zhu, H. Wei, J. Sun, Z. Ge, X. Zhang, J. Wang,et al., “Unhackable Temporal Rewarding for Scalable Video MLLMs,”arXiv preprint arXiv:2502.12081, 2025. 10

  39. [39]

    The THUMOS challenge onactionrecognitionforvideos“inthewild

    H. Idrees, A. R. Zamir, Y. Jiang, A. Gorban, I. Laptev, R. Sukthankar, and M. Shah, “The THUMOS challenge onactionrecognitionforvideos“inthewild”,”Computer Vision and Image Understanding, vol. 155, pp. 1–23, 2017

  40. [40]

    PerceptionTest: ADiagnosticBench- markforMultimodalVideoModels,

    V.Patraucean,L.Smaira,A.Gupta,A.Recasens,L.Mar- keeva,D.Banarse,S.Koppula,M.Malinowski,Y.Yang, C.Doersch,et al.,“PerceptionTest: ADiagnosticBench- markforMultimodalVideoModels,”Advances in Neural Information Processing Systems, vol. 36, pp. 42748– 42761, 2023

  41. [41]

    MVBench: A Compre- hensive Multi-modal Video Understanding Benchmark,

    K. Li, Y. Wang, Y. He, Y. Li, Y. Wang, Y. Liu, Z. Wang, J. Xu, G. Chen, P. Luo,et al., “MVBench: A Compre- hensive Multi-modal Video Understanding Benchmark,” inProceedings of the Computer Vision and Pattern Recognition Conference, pp. 22195–22206, 2024

  42. [42]

    Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis,

    C. Fu, Y. Dai, Y. Luo, L. Li, S. Ren, R. Zhang, Z. Wang, C. Zhou, Y. Shen, M. Zhang,et al., “Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis,” inProceed- ings of the Computer Vision and Pattern Recognition Conference, pp. 24108–24118, 2025. A Implementation Details We leverage 7B and 3B models of Qwen2.5-...