Learning Spatiotemporal Sensitivity in Video LLMs via Counterfactual Reinforcement Learning
Pith reviewed 2026-05-22 07:42 UTC · model grok-4.3
The pith
Counterfactual RL trains video LLMs to change answers only when video dynamics actually change.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CRPO is a dual-branch reinforcement learning procedure in which each training example is paired with a horizontally flipped or temporally reversed counterpart. A Counterfactual Relation Reward is computed between the two branches so that the policy is rewarded only when its answers differ on dynamic questions and match on static questions, thereby reducing the viability of shortcut policies that ignore video dynamics.
What carries the argument
Counterfactual Relation Reward (CRR) applied across original and transformed video branches inside a policy optimization loop.
If this is right
- DyBench pair-accuracy rises because fixed-answer shortcuts are penalized across branches.
- TimeBlind independent accuracy improves as models must use temporal order rather than static cues.
- General video QA performance remains competitive because the method does not discard useful static information.
- Shortcut policies become inconsistent across branches and therefore receive lower average reward.
Where Pith is reading between the lines
- The same relational-reward idea could be tested on audio or 3D scene models that currently exploit static background features.
- Replacing flips and reversals with learned counterfactual generators might produce harder negatives and stronger sensitivity gains.
- The pair-accuracy metric on DyBench could be adapted to measure shortcut robustness in other multimodal benchmarks.
Load-bearing premise
Horizontal flips and temporal reversals cleanly separate dynamic from static content while preserving question meaning and without creating new exploitable artifacts.
What would settle it
Run the trained model on DyBench pairs where the counterfactual video is replaced by an unrelated clip or by a version containing visible editing artifacts; if pair-accuracy collapses to baseline levels the claim is supported.
Figures
read the original abstract
Video large language models (Video LLMs) achieve strong benchmark accuracy, yet often answer video questions through shortcuts such as single-frame cues and language priors rather than by tracking spatiotemporal dynamics. This issue is exacerbated in RL post-training, where correctness-only rewards can further reinforce shortcut policies that obtain high reward without tracking video dynamics. We address this by asking a controlled counterfactual question: if the visual world changed while the question remained fixed, should the answer change or stay the same? Based on this view, we propose \textbf{Counterfactual Relational Policy Optimization (CRPO)}, a dual-branch RL framework for improving \emph{spatiotemporal sensitivity}. CRPO constructs counterfactual videos through horizontal flips and temporal reversals, trains on both original and counterfactual branches, and introduces a \textbf{Counterfactual Relation Reward (CRR)} between their answers. CRR encourages answers to change for dynamic questions and remain unchanged for static questions. This cross-branch constraint makes it difficult for shortcut policies to be consistently rewarded across both branches. To evaluate this property, we introduce \textbf{DyBench}, a paired counterfactual video benchmark with 3,014 videos covering reversible dynamics, moving direction, and event sequence, together with a strict pair-accuracy metric that prevents fixed-answer shortcuts from inflating scores. Experiments show that CRPO outperforms prior RL methods on spatiotemporal-sensitive evaluations while maintaining competitive general video performance. On Qwen3-VL-8B, CRPO improves DyBench P-Acc by +7.7 and TimeBlind I-Acc by +8.2 over the base model, indicating improved spatiotemporal sensitivity rather than stronger reliance on static shortcuts. The project website can be found at https://ddz16.github.io/crpo.github.io/ .
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that Video LLMs often rely on static shortcuts rather than spatiotemporal dynamics, and that standard RL post-training exacerbates this. It proposes Counterfactual Relational Policy Optimization (CRPO), a dual-branch RL method that generates counterfactual videos via horizontal flips and temporal reversals, applies a Counterfactual Relation Reward (CRR) to encourage answer changes only on dynamic questions, and introduces DyBench (3,014 paired videos) with a strict pair-accuracy (P-Acc) metric. On Qwen3-VL-8B, CRPO yields +7.7 DyBench P-Acc and +8.2 TimeBlind I-Acc over the base model, which the authors interpret as evidence of genuine sensitivity gains.
Significance. If the central empirical claims hold after validation, the work would be a meaningful contribution to mitigating shortcut learning in video understanding. The CRR formulation and the paired DyBench benchmark with its P-Acc metric provide a concrete, falsifiable way to measure cross-branch consistency, and the project website offers a route to reproducibility. These elements strengthen the paper beyond typical RL fine-tuning results.
major comments (3)
- [Abstract and §4] Abstract and §4 (Experiments): the reported gains of +7.7 DyBench P-Acc and +8.2 TimeBlind I-Acc are presented without error bars, statistical significance tests, ablation controls on the reward weighting, or details on how the 3,014 DyBench videos were selected and validated; these omissions make it impossible to assess whether the improvements are robust or attributable to the proposed CRR mechanism.
- [§3.2] §3.2 (Counterfactual Construction and CRR): the claim that CRR enforces genuine dynamic tracking rather than branch-specific shortcuts rests on the unverified assumption that horizontal flips and temporal reversals preserve question semantics while altering only the targeted dynamics and introduce no new exploitable artifacts (e.g., mirrored identities or detectable transformation signatures); no ablation or filtering procedure is described that rules out models learning to detect the transformation itself.
- [§4.2] §4.2 (DyBench and TimeBlind): the pair-accuracy metric is designed to penalize fixed-answer shortcuts, yet the paper provides no evidence that the benchmark questions were explicitly filtered for clean separability between dynamic and static cases or that transformation-detection shortcuts were ablated; without this, the +7.7 / +8.2 gains remain compatible with an alternative explanation of asymmetric shortcut exploitation across branches.
minor comments (2)
- [§3.3] Clarify the exact reward weighting hyper-parameter between original and counterfactual branches and report its sensitivity in an ablation; it is listed as a free parameter but its impact on the final numbers is not quantified.
- [Figure 2] Ensure all figures showing example counterfactual pairs include both the original and transformed frames side-by-side with the corresponding question and model answers to allow readers to visually inspect potential artifacts.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of robustness, verification of assumptions, and potential alternative explanations, which we address point by point below. We have revised the manuscript to incorporate additional analyses, ablations, and clarifications where feasible.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experiments): the reported gains of +7.7 DyBench P-Acc and +8.2 TimeBlind I-Acc are presented without error bars, statistical significance tests, ablation controls on the reward weighting, or details on how the 3,014 DyBench videos were selected and validated; these omissions make it impossible to assess whether the improvements are robust or attributable to the proposed CRR mechanism.
Authors: We agree that the original presentation omitted key elements for assessing robustness. In the revised manuscript, we now report error bars from three independent runs with different random seeds for all main results in §4 and the abstract. We added paired t-tests confirming statistical significance (p < 0.01) for the +7.7 and +8.2 gains. A new ablation table in the appendix varies the CRR weighting hyperparameter and shows that performance degrades without the cross-branch term, supporting attribution to CRR. For DyBench construction, we expanded §4.2 with details on sourcing from public video datasets, manual review by three annotators for dynamic content, and inter-annotator agreement statistics (Cohen's kappa = 0.82). These changes directly address the concerns. revision: yes
-
Referee: [§3.2] §3.2 (Counterfactual Construction and CRR): the claim that CRR enforces genuine dynamic tracking rather than branch-specific shortcuts rests on the unverified assumption that horizontal flips and temporal reversals preserve question semantics while altering only the targeted dynamics and introduce no new exploitable artifacts (e.g., mirrored identities or detectable transformation signatures); no ablation or filtering procedure is described that rules out models learning to detect the transformation itself.
Authors: This concern is well-taken, as unverified assumptions can weaken causal claims. We maintain that horizontal flips and temporal reversals are appropriate for the question types in DyBench because they alter direction or sequence without changing object identities or event categories, but we acknowledge the need for explicit checks. In the revision, we added an ablation in §3.2 training a lightweight video classifier to predict transformation type (original vs. flipped vs. reversed) from frame features alone, which achieves accuracy near chance (34%). We also describe a lightweight filtering step applied during data preparation that excludes pairs where the transformation visibly affects semantics (e.g., readable text or asymmetric objects). These additions provide supporting evidence, though we note that exhaustive artifact exclusion remains an open challenge for any synthetic counterfactual method. revision: partial
-
Referee: [§4.2] §4.2 (DyBench and TimeBlind): the pair-accuracy metric is designed to penalize fixed-answer shortcuts, yet the paper provides no evidence that the benchmark questions were explicitly filtered for clean separability between dynamic and static cases or that transformation-detection shortcuts were ablated; without this, the +7.7 / +8.2 gains remain compatible with an alternative explanation of asymmetric shortcut exploitation across branches.
Authors: We agree that demonstrating clean separability and ruling out transformation shortcuts strengthens the interpretation. In the revised §4.2, we now explicitly describe how questions were labeled as dynamic or static according to whether the ground-truth answer differs under the counterfactual transformation, with examples provided. We added an analysis comparing answer consistency across branches for the base model versus CRPO, showing that CRPO increases appropriate changes on dynamic pairs while preserving consistency on static ones. The transformation-detection ablation referenced in our response to the §3.2 comment further mitigates the asymmetric shortcut concern. While these revisions make the alternative explanation less plausible, we recognize that no benchmark can eliminate every conceivable shortcut in large models. revision: yes
Circularity Check
No circularity: CRR and DyBench defined independently of reported gains
full rationale
The derivation introduces CRPO as a dual-branch RL method that applies horizontal flips and temporal reversals to create counterfactual videos, then defines Counterfactual Relation Reward (CRR) to penalize inconsistent answers across branches for dynamic questions while rewarding consistency for static ones. DyBench is separately constructed as a paired benchmark with its own pair-accuracy metric. Neither the reward formulation nor the benchmark metric is defined in terms of the final performance numbers on Qwen3-VL-8B; the +7.7 P-Acc and +8.2 I-Acc improvements are presented as empirical outcomes rather than quantities that reduce to the training loop by construction. No self-citation chains, ansatz smuggling, or uniqueness theorems from prior author work are invoked to force the central result. The chain remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- Reward weighting between original and counterfactual branches
axioms (1)
- domain assumption Counterfactual transformations (horizontal flip, temporal reversal) preserve question semantics while altering only the targeted dynamics.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
CRPO constructs counterfactual videos through horizontal flips and temporal reversals... Counterfactual Relation Reward (CRR) encourages answers to change for dynamic questions and remain unchanged for static questions.
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce DyBench, a paired counterfactual video benchmark... with strict pair-accuracy metric
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
STRIVE: Structured Spatiotemporal Exploration for Reinforcement Learning in Video Question Answering
Emad Bahrami, Olga Zatsarynna, Parth Pathak, Sunando Sengupta, Juergen Gall, and Mohsen Fayyaz. Strive: Structured spatiotemporal exploration for reinforcement learning in video question answering.arXiv preprint arXiv:2604.01824, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[2]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Mu Cai, Reuben Tan, Jianrui Zhang, Bocheng Zou, Kai Zhang, Feng Yao, Fangrui Zhu, Jing Gu, Yiwu Zhong, Yuzhang Shang, et al. Temporalbench: Benchmarking fine-grained temporal understanding for multimodal video models.arXiv preprint arXiv:2410.10818, 2024
-
[5]
Perceptionlm: Open-access data and models for detailed visual understanding.arXiv:2504.13180, 2025
Jang Hyun Cho, Andrea Madotto, Effrosyni Mavroudi, Triantafyllos Afouras, Tushar Nagarajan, Muhammad Maaz, Yale Song, Tengyu Ma, Shuming Hu, Suyog Jain, et al. Perceptionlm: Open- access data and models for detailed visual understanding.arXiv preprint arXiv:2504.13180, 2025
-
[6]
Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding
Christopher Clark, Jieyu Zhang, Zixian Ma, Jae Sung Park, Mohammadreza Salehi, Rohun Tripathi, Sangho Lee, Zhongzheng Ren, Chris Dongjoo Kim, Yinuo Yang, et al. Molmo2: Open weights and data for vision-language models with video understanding and grounding.arXiv preprint arXiv:2601.10611, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[7]
Lost in time: A new temporal benchmark for videollms.arXiv preprint arXiv:2410.07752, 2024
Daniel Cores, Michael Dorkenwald, Manuel Mucientes, Cees GM Snoek, and Yuki M Asano. Lost in time: A new temporal benchmark for videollms.arXiv preprint arXiv:2410.07752, 2024
-
[8]
Vlmevalkit: An open-source toolkit for evaluating large multi-modality models
Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, et al. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. InProceedings of the 32nd ACM international conference on multimedia, pages 11198–11201, 2024
work page 2024
-
[9]
Video-r1: Reinforcing video reasoning in mllms
Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms. InNeurIPS, 2025. 10
work page 2025
-
[10]
OneThinker: All-in-one Reasoning Model for Image and Video
Kaituo Feng, Manyuan Zhang, Hongyu Li, Kaixuan Fan, Shuang Chen, Yilei Jiang, Dian Zheng, Peiwen Sun, Yiyuan Zhang, Haoze Sun, et al. Onethinker: All-in-one reasoning model for image and video.arXiv preprint arXiv:2512.03043, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis
Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24108–24118, 2025
work page 2025
-
[12]
Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The" something something" video database for learning and evaluating visual common sense. InProceedings of the IEEE international conference on computer vision, pages 5842– 5850, 2017
work page 2017
-
[13]
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025
work page 2025
-
[14]
Wenyi Hong, Yean Cheng, Zhuoyi Yang, Weihan Wang, Lefan Wang, Xiaotao Gu, Shiyu Huang, Yuxiao Dong, and Jie Tang. Motionbench: Benchmarking and improving fine-grained video motion understanding for vision language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 8450–8460, 2025
work page 2025
-
[15]
REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization
Jian Hu, Jason Klein Liu, Haotian Xu, and Wei Shen. Reinforce++: Stabilizing critic-free policy optimization with global normalization.arXiv preprint arXiv:2501.03262, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
Lianghua Huang, Xin Zhao, and Kaiqi Huang. Got-10k: A large high-diversity benchmark for generic object tracking in the wild.IEEE transactions on pattern analysis and machine intelligence, 43(5):1562–1577, 2019
work page 2019
-
[17]
Mhbench: Demystifying motion hallucination in videollms
Ming Kong, Xianzhou Zeng, Luyuan Chen, Yadong Li, Bo Yan, and Qiang Zhu. Mhbench: Demystifying motion hallucination in videollms. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 4401–4409, 2025
work page 2025
-
[18]
Ku, M., Chong, T., Leung, J., Shah, K., Yu, A., and Chen, W
Benno Krojer, Mojtaba Komeili, Candace Ross, Quentin Garrido, Koustuv Sinha, Nicolas Ballas, and Mahmoud Assran. A shortcut-aware video-qa benchmark for physical understanding via minimal video pairs.arXiv preprint arXiv:2506.09987, 2025
-
[19]
The language of actions: Recovering the syntax and semantics of goal-directed human activities
Hilde Kuehne, Ali Arslan, and Thomas Serre. The language of actions: Recovering the syntax and semantics of goal-directed human activities. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 780–787, 2014
work page 2014
-
[20]
Revealing single frame bias for video-and-language learning
Jie Lei, Tamara Berg, and Mohit Bansal. Revealing single frame bias for video-and-language learning. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 487–507, 2023
work page 2023
-
[21]
Baiqi Li, Kangyi Zhao, Ce Zhang, Chancharik Mitra, Jean de Dieu Nyandwi, and Gedas Bertasius. Timeblind: A spatio-temporal compositionality benchmark for video llms.arXiv preprint arXiv:2602.00288, 2026
-
[22]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
Chaoyu Li, Eun Woo Im, and Pooyan Fazli. Vidhalluc: Evaluating temporal hallucinations in multimodal large language models for video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13723–13733, 2025
work page 2025
-
[24]
Mvbench: A comprehensive multi-modal video understanding benchmark
Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195–22206, 2024. 11
work page 2024
-
[25]
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning
Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, and Limin Wang. Videochat-r1: Enhancing spatio-temporal perception via reinforce- ment fine-tuning.arXiv preprint arXiv:2504.06958, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[26]
Yun Li, Yiming Zhang, Tao Lin, XiangRui Liu, Wenxiao Cai, Zheng Liu, and Bo Zhao. Sti- bench: Are mllms ready for precise spatial-temporal world understanding? InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5622–5632, 2025
work page 2025
-
[27]
Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023
work page 2023
-
[28]
Shuming Liu, Mingchen Zhuge, Changsheng Zhao, Jun Chen, Lemeng Wu, Zechun Liu, Chenchen Zhu, Zhipeng Cai, Chong Zhou, Haozhe Liu, et al. Videoauto-r1: Video auto reasoning via thinking once, answering twice.arXiv preprint arXiv:2601.05175, 2026
-
[29]
Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. Tempcompass: Do video llms really understand videos? InFindings of the Association for Computational Linguistics: ACL 2024, pages 8731–8772, 2024
work page 2024
-
[30]
Part i: Tricks or traps? a deep dive into rl for llm reasoning
Zihe Liu, Jiashun Liu, Yancheng He, Weixun Wang, Jiaheng Liu, Ling Pan, Xinyu Hu, Shaopan Xiong, Ju Huang, Jian Hu, et al. Part i: Tricks or traps? a deep dive into rl for llm reasoning. arXiv preprint arXiv:2508.08221, 2025
-
[31]
Deepvideo-r1: Video reinforce- ment fine-tuning via difficulty-aware regressive grpo
Jinyoung Park, Jeehye Na, Jinyoung Kim, and Hyunwoo J Kim. Deepvideo-r1: Video reinforce- ment fine-tuning via difficulty-aware regressive grpo. InNeurIPS, 2025
work page 2025
-
[32]
A new era of intelligence with gemini 3, 2025
Sundar Pichai, Demis Hassabis, and Koray Kavukcuoglu. A new era of intelligence with gemini 3, 2025. URL https://blog.google/intl/en-africa/company-news/ outreach-and-initiatives/a-new-era-of-intelligence-with-gemini-3/
work page 2025
-
[33]
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding
Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Balakrishnan Varadarajan, Florian Bordes, et al. Longvu: Spa- tiotemporal adaptive compression for long video-language understanding.arXiv preprint arXiv:2410.17434, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[34]
Jiapeng Shi, Junke Wang, Zuyao You, Bo He, and Zuxuan Wu. Videoloom: A video large language model for joint spatial-temporal understanding.arXiv preprint arXiv:2601.07290, 2026
-
[35]
Video-xl: Extra-long vision language model for hour-scale video understanding
Yan Shu, Zheng Liu, Peitian Zhang, Minghao Qin, Junjie Zhou, Zhengyang Liang, Tiejun Huang, and Bo Zhao. Video-xl: Extra-long vision language model for hour-scale video understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26160–26169, 2025
work page 2025
-
[36]
Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[37]
Moviechat: From dense token to sparse memory for long video understanding
Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, et al. Moviechat: From dense token to sparse memory for long video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18221–18232, 2024
work page 2024
-
[38]
Combining embedded accelerometers with computer vision for recognizing food preparation activities
Sebastian Stein and Stephen J McKenna. Combining embedded accelerometers with computer vision for recognizing food preparation activities. InProceedings of the 2013 ACM international joint conference on Pervasive and ubiquitous computing, pages 729–738, 2013
work page 2013
-
[39]
VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning
Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. Vl- rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning. arXiv preprint arXiv:2504.08837, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[40]
Videorft: Incentivizing video reasoning capability in mllms via reinforced fine-tuning
Qi Wang, Yanrui Yu, Ye Yuan, Rui Mao, and Tianfei Zhou. Videorft: Incentivizing video reasoning capability in mllms via reinforced fine-tuning. InNeurIPS, 2025. 12
work page 2025
-
[41]
Lvbench: An extreme long video understanding benchmark
Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Ming Ding, Xiaotao Gu, Shiyu Huang, Bin Xu, et al. Lvbench: An extreme long video understanding benchmark. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22958– 22967, 2025
work page 2025
-
[42]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[43]
Xiyao Wang, Zhengyuan Yang, Chao Feng, Hongjin Lu, Linjie Li, Chung-Ching Lin, Kevin Lin, Furong Huang, and Lijuan Wang. Sota with less: Mcts-guided sample selection for data-efficient visual reasoning self-improvement.arXiv preprint arXiv:2504.07934, 2025
-
[44]
Ziyue Wang, Sheng Jin, Zhongrong Zuo, Jiawei Wu, Han Qiu, Qi She, Hao Zhang, and Xudong Jiang. Video-ktr: Reinforcing video reasoning via key token attribution.arXiv preprint arXiv:2601.19686, 2026
-
[45]
Seeing the arrow of time in large multimodal models
Zihui Xue, Mi Luo, and Kristen Grauman. Seeing the arrow of time in large multimodal models. InNeurIPS, 2025
work page 2025
-
[46]
CLEVRER: CoLlision Events for Video REpresentation and Reasoning
Kexin Yi, Chuang Gan, Yunzhu Li, Pushmeet Kohli, Jiajun Wu, Antonio Torralba, and Joshua B Tenenbaum. Clevrer: Collision events for video representation and reasoning.arXiv preprint arXiv:1910.01442, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1910
-
[47]
Unhackable temporal rewarding for scalable video mllms.arXiv preprint arXiv:2502.12081, 2025
En Yu, Kangheng Lin, Liang Zhao, Yana Wei, Zining Zhu, Haoran Wei, Jianjian Sun, Zheng Ge, Xiangyu Zhang, Jingyu Wang, et al. Unhackable temporal rewarding for scalable video mllms.arXiv preprint arXiv:2502.12081, 2025
-
[48]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[49]
Videorefer suite: Advancing spatial-temporal object understanding with video llm
Yuqian Yuan, Hang Zhang, Wentong Li, Zesen Cheng, Boqiang Zhang, Long Li, Xin Li, Deli Zhao, Wenqiao Zhang, Yueting Zhuang, et al. Videorefer suite: Advancing spatial-temporal object understanding with video llm. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18970–18980, 2025
work page 2025
-
[50]
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, et al. Videollama 3: Frontier multimodal foundation models for image and video understanding.arXiv preprint arXiv:2501.13106, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[51]
LLaVA-Video: Video Instruction Tuning With Synthetic Data
Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Llava- video: Video instruction tuning with synthetic data.arXiv preprint arXiv:2410.02713, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[52]
Mmvu: Measuring expert-level multi-discipline video understanding
Yilun Zhao, Haowei Zhang, Lujing Xie, Tongyan Hu, Guo Gan, Yitao Long, Zhiyuan Hu, Weiyuan Chen, Chuhan Li, Zhijian Xu, et al. Mmvu: Measuring expert-level multi-discipline video understanding. InProceedings of the Computer Vision and Pattern Recognition Confer- ence, pages 8475–8489, 2025
work page 2025
-
[53]
Kejian Zhu, Zhuoran Jin, Hongbang Yuan, Jiachun Li, Shangqing Tu, Pengfei Cao, Yubo Chen, Kang Liu, and Jun Zhao. Mmr-v: What’s left unsaid? a benchmark for multimodal deep reasoning in videos.arXiv preprint arXiv:2506.04141, 2025
-
[54]
Which of the following best describes the action in the video?
Orr Zohar, Xiaohan Wang, Yann Dubois, Nikhil Mehta, Tong Xiao, Philippe Hansen-Estruch, Licheng Yu, Xiaofang Wang, Felix Juefei-Xu, Ning Zhang, et al. Apollo: An exploration of video understanding in large multimodal models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18891–18901, 2025. 13 A DyBench: Construction, Compar...
work page 2025
-
[55]
Horizontal Flip Test (Spatial Sensitivity).If the video is mirrored horizontally (Left↔Right): • YES (Change): queries about left/right, clockwise/counter-clockwise, or horizontal orientation. • NO (No Change): queries about verticality (top/bottom, high/low, tall/short), color, identity, size, or count
-
[56]
Time-Reversal Test (Temporal Sensitivity).If the video is played in REVERSE (frames played backwards, time flows backward): 18 • YES (Change): directional actions where forward ̸= backward (opening→closing, picking up→putting down, entering →exiting); questions about FIRST/LAST or BEFORE/AFTER; questions about cause/result of a sequence. • NO (No Change):...
-
[57]
The question does not rely on left/right orientation or horizontal positioning.NO (No Change)
Horizontal Flip Test (Spatial Sensitivity).If we mirror the video horizontally (left becomes right), the action of the person putting down the clothes remains exactly the same. The question does not rely on left/right orientation or horizontal positioning.NO (No Change). 19 Q: This is a clip from a movie. What scene changes occur in this video? A: From th...
-
[58]
Time-Reversal Test (Temporal Sensitivity).If we play the video in reverse, “putting down” the clothes would look like the person is picking them up or “taking” them (Option B). Because the direction of time fundamentally changes the nature of this action, the correct answer would change.YES (Change). Fails Flip Test, Passes Time-Reversal Test⇒Temporal. Ex...
-
[59]
Horizontal Flip Test.The question asks about the reason an electric device did not work, specifically whether it was unplugged at the beginning or during the video. Flipping horizontally does not affect the state of the device or the actions performed on it.NO (No Change)
-
[60]
Time-Reversal Test.The question relies on the chronological order of events — distinguishing the state of the device “at the beginning of the video” versus an action occurring “during the video.” Under reversal, the beginning becomes the end, and an action like “unplugging” would look like “plugging in,” which would alter the correct answer.YES (Change). ...
-
[61]
Horizontal Flip Test.If the video is mirrored horizontally, the action of closing the refrigerator remains the same; it does not depend on left/right orientation.NO (No Change)
-
[62]
Time-Reversal Test.If the video is played in reverse, the action of closing the refrigerator would appear as opening the refrigerator. The answer would change from “Closed” to “Opened.”YES (Change). Fails Flip Test, Passes Time-Reversal Test⇒Temporal. 20 Example 4•Source: PerceptionTest•Decision:Spatial Q:In which hand does the person conceal a small obje...
-
[63]
Horizontal Flip Test.If the video is mirrored horizontally, the person’s right hand will appear as their left hand. The answer to “In which hand. . . ” would change from “right” to “left.”YES (Change)
-
[64]
The flow of time does not change the identity of the hand holding the object.NO (No Change)
Time-Reversal Test.If the video is played in reverse, the hand in which the object is concealed remains the same hand. The flow of time does not change the identity of the hand holding the object.NO (No Change). Passes Flip Test, Fails Time-Reversal Test⇒Spatial. Example 5•Source: PerceptionTest•Decision:Spatiotemporal Q:The person uses multiple similar o...
-
[65]
Horizontal Flip Test.The options explicitly rely on left/right spatial orientation (“first object from the left,” “second object from the left”). If the video is mirrored horizontally, the left-to-right order of the objects is reversed, changing the correct answer.YES (Change)
-
[66]
If the video is played in reverse, the original end of the game becomes the new beginning
Time-Reversal Test.The question specifically asks about the location of the hidden objectat the beginning of the game. If the video is played in reverse, the original end of the game becomes the new beginning. Since the object is moved around during the occlusion game, its location at the end is generally different from its location at the start, so the a...
-
[67]
The answer remains the same.NO (No Change)
Horizontal Flip Test.If the video is mirrored horizontally (left becomes right), the type of room the person is in (kitchen, living room/bedroom, bathroom) does not change. The answer remains the same.NO (No Change)
-
[68]
randomize on transformed videos
Time-Reversal Test.If the video is played in reverse, the setting or room where the person is located does not change. The person stays in the same room regardless of the direction of time. NO (No Change). Fails both tests⇒Static. D Reward Analysis D.1 The four CRR reward components Figure 7 visualizes the four reward terms introduced by CRPO. The origina...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.