pith. machine review for the scientific record. sign in

arxiv: 2605.01324 · v2 · submitted 2026-05-02 · 💻 cs.CV

Recognition: unknown

Beyond Perceptual Shortcuts: Causal-Inspired Debiasing Optimization for Generalizable Video Reasoning in Lightweight MLLMs

Hongbo Chen, Hongfei Suo, Jingze Wu, Quan Zhang, Zeqiang Cai

Authors on Pith no claims yet

Pith reviewed 2026-05-09 14:28 UTC · model grok-4.3

classification 💻 cs.CV
keywords video reasoninglightweight multimodal modelsperceptual shortcutscausal debiasingreinforcement learning fine-tuningbias modelpolicy optimizationgeneralizable reasoning
0
0 comments X

The pith

Lightweight video reasoning models develop genuine causal understanding when a dedicated bias model and repulsive optimization push them away from perceptual shortcuts during RL fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that reinforcement learning on small multimodal models causes them to exploit easy perceptual patterns created by biases in video data rather than building true reasoning skills. It counters this by first training a separate bias model to capture those shortcut behaviors, then applying a causal debiasing policy optimization step whose repulsive objective steers the main model toward correct, generalizable answers while distancing it from the bias model's logic. This two-stage process allows the resulting model to reach strong benchmark results without any supervised fine-tuning and with only a fraction of the usual RL training data. A reader would care because it offers a concrete route to reliable video understanding in resource-limited systems that must operate on the edge.

Core claim

Reinforcement learning fine-tuning compels lightweight MLLMs to adopt perceptual shortcuts induced by data biases instead of developing genuine reasoning; a bias-aware training stage creates a dedicated bias model that embodies these shortcuts, after which the causal debiasing policy optimization algorithm applies a repulsive objective that actively distances the primary model from the bias model's flawed logic while aligning it with correct solutions, thereby producing more generalizable video reasoning.

What carries the argument

The causal debiasing policy optimization (CDPO) step, which pairs a bias model with a repulsive objective to drive the primary model away from shortcut logic toward generalizable reasoning.

If this is right

  • The primary model reaches higher accuracy on standard video reasoning benchmarks than prior lightweight and even some larger models.
  • Only a small fraction of typical RL training data is required, with no supervised fine-tuning stage needed beforehand.
  • The approach reduces reliance on surface-level perceptual cues in favor of causal reasoning chains across video tasks.
  • Efficiency gains make robust video reasoning feasible for deployment on edge devices with limited compute.
  • Similar debiasing could be applied to other multimodal reasoning domains that suffer from data-induced shortcuts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The repulsive objective might be combined with other regularization methods to further prevent the primary model from developing alternative unintended shortcuts.
  • Testing the framework on longer or more complex video sequences could check whether the causal separation scales to extended temporal reasoning.
  • The bias-model construction step could be adapted for text-only or image-only models facing analogous shortcut problems in reasoning tasks.
  • If the repulsive mechanism generalizes, it might reduce the need for ever-larger training corpora in multimodal systems by focusing optimization on what the model should avoid.

Load-bearing premise

The bias model faithfully encodes the perceptual shortcuts that RL would otherwise induce, and the repulsive objective reliably steers the primary model toward robust reasoning rather than new unintended shortcuts or performance loss.

What would settle it

Compare performance on held-out video reasoning benchmarks that contain novel causal structures after standard RL fine-tuning versus after the full bias-model-plus-CDPO process; a large gap favoring the debiasing version would support the claim.

Figures

Figures reproduced from arXiv: 2605.01324 by Hongbo Chen, Hongfei Suo, Jingze Wu, Quan Zhang, Zeqiang Cai.

Figure 1
Figure 1. Figure 1: An illustration of observational versus inferential reason view at source ↗
Figure 2
Figure 2. Figure 2: The difference between our framework and others. view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of Perceptual Shortcuts in Counterfactual Reasoning Tasks. The video depicts a multi-object collision scenario. view at source ↗
Figure 4
Figure 4. Figure 4: An overview of our VideoThinker framework designed to mitigate perceptual shortcuts in reasoning. We first train a Bias Model view at source ↗
Figure 5
Figure 5. Figure 5: Our modeling. (a) shows our causal assumption in the view at source ↗
Figure 6
Figure 6. Figure 6: Performance of our method training on other datasets. view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative examples of VideoThinker-R1’s reasoning performance. ( view at source ↗
read the original abstract

Although reinforcement learning (RL) has significantly advanced reasoning capabilities in large multimodal language models (MLLMs), its efficacy remains limited for lightweight models essential for edge deployments. To address this issue, we leverage causal analysis and experiment to reveal the underlying phenomenon of perceptual bias, demonstrating that RL-based fine-tuning compels lightweight models to preferentially adopt perceptual shortcuts induced by data biases, rather than developing genuine reasoning abilities. Motivated by this insight, we propose VideoThinker, a causal-inspired framework that cultivates robust reasoning in lightweight models through a two-stage debiasing process. First, the Bias Aware Training stage forges a dedicated "bias model" to embody these shortcut behaviors. Then, the Causal Debiasing Policy Optimization (CDPO) algorithm fine-tunes the primary model, employing an innovative repulsive objective to actively push it away from the bias model's flawed logic while simultaneously pulling it toward correct, generalizable solutions. Our model, VideoThinker-R1, establishes a new state-of-the-art in video reasoning efficiency. For same-scale comparison, requiring no Supervised Fine-Tuning (SFT) and using only 1 of the training data for RL, it surpasses VideoRFT-3B with a 3.2% average gain on widely-used benchmarks and a 7% lead on VideoMME. For cross-scale comparison, it outperforms the larger Video-UTR-7B model on multiple benchmarks, including a 2.1% gain on MVBench and a 3.8% gain on TempCompass. Code is available at https://github.com/falonss703/VideoThinker.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that reinforcement learning fine-tuning on lightweight multimodal LLMs for video reasoning induces reliance on perceptual shortcuts from data biases rather than genuine reasoning. It introduces VideoThinker, a two-stage causal debiasing framework: (1) Bias Aware Training to create a dedicated bias model capturing shortcut behaviors, and (2) Causal Debiasing Policy Optimization (CDPO) that applies a repulsive objective to push the primary model away from the bias model's logic while attracting it to correct answers. VideoThinker-R1 reportedly achieves SOTA efficiency, outperforming VideoRFT-3B by 3.2% average on benchmarks and 7% on VideoMME using 1/10 the RL data and no SFT, while also beating larger models like Video-UTR-7B on several tasks.

Significance. If the debiasing mechanism holds and generalizes, this could meaningfully advance efficient, robust video reasoning in resource-constrained MLLMs for edge applications. The reported data efficiency (no SFT, 1/10 RL data) and cross-scale gains would be notable contributions if supported by rigorous controls and ablations. The causal framing and explicit bias model are conceptually interesting for addressing shortcut learning in RL-tuned models.

major comments (2)
  1. [Abstract] Abstract and central claim: The repulsive objective in CDPO is presented as driving the primary model toward generalizable reasoning, but the manuscript provides no ablations isolating the repulsive term's effect, no direct probes measuring shortcut usage (e.g., controlled bias tests), and no verification that the bias model encodes only undesired shortcuts rather than useful features. This is load-bearing because the performance gains (3.2% average, 7% on VideoMME) cannot be attributed to the proposed mechanism without such evidence.
  2. [Abstract] Circularity concern in two-stage process: The bias model is trained on the same data distribution as the primary model, yet the paper does not demonstrate that the 'correct' solutions or repulsive targets are identified independently of the bias model itself. Without this, the CDPO objective risks encoding information already present in training rather than truly debiasing, undermining the causal analysis that RL induces shortcuts.
minor comments (1)
  1. [Abstract] The abstract mentions 'causal analysis and experiment' but does not specify the experimental design, baseline implementations, or statistical significance testing for the reported gains.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and for recognizing the potential significance of our causal debiasing approach for lightweight video reasoning models. We address each major comment point by point below, providing clarifications based on the manuscript while agreeing to strengthen the presentation where evidence was insufficient.

read point-by-point responses
  1. Referee: [Abstract] Abstract and central claim: The repulsive objective in CDPO is presented as driving the primary model toward generalizable reasoning, but the manuscript provides no ablations isolating the repulsive term's effect, no direct probes measuring shortcut usage (e.g., controlled bias tests), and no verification that the bias model encodes only undesired shortcuts rather than useful features. This is load-bearing because the performance gains (3.2% average, 7% on VideoMME) cannot be attributed to the proposed mechanism without such evidence.

    Authors: We acknowledge that the abstract and central claims would benefit from more explicit isolation of the repulsive term's contribution. The full manuscript reports overall performance gains through comparisons to baselines like VideoRFT-3B, but does not include dedicated ablations removing only the repulsive component of CDPO or controlled bias probes (e.g., accuracy on bias-augmented test subsets). We will revise by adding a dedicated ablation subsection that compares CDPO with and without the repulsive objective, reports shortcut usage metrics (such as reliance on visual cues vs. temporal reasoning on curated subsets), and provides qualitative/quantitative analysis of bias model outputs to confirm predominant capture of undesired perceptual shortcuts rather than useful features. These additions will allow direct attribution of gains to the debiasing mechanism. revision: yes

  2. Referee: [Abstract] Circularity concern in two-stage process: The bias model is trained on the same data distribution as the primary model, yet the paper does not demonstrate that the 'correct' solutions or repulsive targets are identified independently of the bias model itself. Without this, the CDPO objective risks encoding information already present in training rather than truly debiasing, undermining the causal analysis that RL induces shortcuts.

    Authors: This is a valid concern regarding potential circularity. In the proposed framework, the bias model is trained to capture shortcut behaviors via a dedicated objective on the training distribution, while correct solutions are sourced directly from ground-truth labels in the video reasoning benchmarks (independent of any model outputs). Repulsive targets are then derived by contrasting the primary model's policy against the bias model's predictions on identical inputs. However, the manuscript does not explicitly demonstrate independence through held-out identification of targets or ablations. We will revise the methods and causal analysis sections to clarify the role of ground-truth labels, add an experiment using held-out data for repulsive target generation, and include results showing the debiasing effect holds under this stricter separation. This will better substantiate the claim that RL induces shortcuts addressable via the two-stage process. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's derivation consists of an empirical observation (RL fine-tuning induces perceptual shortcuts in lightweight MLLMs, shown via experiments) followed by a proposed two-stage algorithmic intervention (Bias Aware Training to create a bias model, then CDPO with repulsive+attractive objectives). These steps are presented as a new training procedure whose validity is assessed through external benchmark comparisons (e.g., gains on VideoMME, MVBench) rather than any mathematical reduction to the inputs by construction. No equations are shown to equate a 'prediction' with a fitted parameter, no self-citation chain bears the central claim, and the bias model is explicitly a separate trained component rather than a definitional tautology. The method is self-contained against the reported benchmarks and does not rely on imported uniqueness theorems or renamed known results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the unproven premise that perceptual bias is the dominant failure mode of RL in lightweight MLLMs and on the new constructs of the bias model and CDPO objective, neither of which has independent validation outside this work.

axioms (1)
  • domain assumption RL fine-tuning on biased video data causes lightweight MLLMs to adopt perceptual shortcuts instead of genuine reasoning
    Stated as the key insight revealed by causal analysis and experiment in the abstract
invented entities (2)
  • bias model no independent evidence
    purpose: to embody the shortcut behaviors induced by data biases
    New auxiliary model trained in the first stage of the framework
  • Causal Debiasing Policy Optimization (CDPO) no independent evidence
    purpose: to fine-tune the primary model using a repulsive objective away from the bias model
    Novel optimization algorithm introduced in the second stage

pith-pipeline@v0.9.0 · 5611 in / 1464 out tokens · 37423 ms · 2026-05-09T14:28:34.629050+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 12 canonical work pages · 6 internal anchors

  1. [1]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathe- matical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 1, 3, 7

  2. [2]

    Fast-slow thinking GRPO for large vision-language model reasoning

    Wenyi Xiao and Leilei Gan. Fast-slow thinking GRPO for large vision-language model reasoning. InNeurIPS, 2025. 1

  3. [3]

    Time-r1: Post- training large vision language model for temporal video grounding

    Ye Wang, Ziheng Wang, Boshen Xu, Yang Du, Kejun Lin, Zihan Xiao, Zihao Yue, Jianzhong Ju, Liang Zhang, Dingyi Yang, Xiangnan Fang, Zewen He, Zhenbo Luo, Wenxuan Wang, Junqi Lin, Jian Luan, and Qin Jin. Time-r1: Post- training large vision language model for temporal video grounding. InNeurIPS, 2025

  4. [4]

    Video-r1: Reinforcing video reasoning in MLLMs

    Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in MLLMs. InNeurIPS, 2025. 3, 4, 6, 8, 1

  5. [5]

    Videochat-r1: Enhancing spatio-temporal perception via reinforcement fine-tuning,

    Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, and Limin Wang. Videochat-r1: Enhancing spatio-temporal perception via reinforcement fine-tuning.arXiv preprint arXiv:2504.06958, 2025. 3, 1

  6. [6]

    Reinforcing video reasoning with focused thinking

    Jisheng Dang, Jingze Wu, Teng Wang, Xuanhui Lin, Nan- nan Zhu, Hongbo Chen, Wei-Shi Zheng, Meng Wang, and Tat-Seng Chua. Reinforcing video reasoning with focused thinking.arXiv preprint arXiv:2505.24718, 2025. 1, 3, 6, 2

  7. [7]

    Tinyllava-video-r1: Towards smaller lmms for video reason- ing.arXiv preprint arXiv:2504.09641, 2025

    Xingjian Zhang, Siwei Wen, Wenjun Wu, and Lei Huang. Tinyllava-video-r1: Towards smaller lmms for video reason- ing.arXiv preprint arXiv:2504.09641, 2025. 1, 4, 7

  8. [8]

    VideoRFT: Incentivizing video reasoning capability in MLLMs via reinforced fine-tuning

    Qi Wang, Yanrui Yu, Ye Yuan, Rui Mao, and Tianfei Zhou. VideoRFT: Incentivizing video reasoning capability in MLLMs via reinforced fine-tuning. InNeurIPS, 2025. 1, 3, 4, 6, 7, 8, 2

  9. [9]

    Clevrer: Collision events for video representation and reasoning

    Kexin Yi, Chuang Gan, Yunzhu Li, Pushmeet Kohli, Jiajun Wu, Antonio Torralba, and Joshua B Tenenbaum. Clevrer: Collision events for video representation and reasoning. In ICLR, 2020. 2, 4, 6, 1

  10. [10]

    Multi-object hallucination in vision language models

    Xuweiyi Chen, Ziqiao Ma, Xuejun Zhang, Sihan Xu, Shengyi Qian, Jianing Yang, David Fouhey, and Joyce Chai. Multi-object hallucination in vision language models. In NeurIPS, volume 37, pages 44393–44418, 2024. 2

  11. [11]

    Beyond task performance: evaluating and re- ducing the flaws of large multimodal models with in-context- learning

    Mustafa Shukor, Alexandre Rame, Corentin Dancette, and Matthieu Cord. Beyond task performance: evaluating and re- ducing the flaws of large multimodal models with in-context- learning. InICLR, 2024

  12. [12]

    More thinking, less seeing? assessing amplified hallucina- tion in multimodal reasoning models

    Zhongxing Xu, Chengzhi Liu, Qingyue Wei, Juncheng Wu, James Zou, Xin Eric Wang, Yuyin Zhou, and Sheng Liu. More thinking, less seeing? assessing amplified hallucina- tion in multimodal reasoning models. InNeurIPS, 2025. 2

  13. [13]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 2, 4, 6, 7

  14. [14]

    Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InCVPR, pages 24108–24118, 2025. 3, 6, 1, 2

  15. [15]

    Vision-r1: Evolving human-free alignment in large vision-language models via vision- guided reinforcement learning.arXiv preprint arXiv:2503.18013,

    Yufei Zhan, Yousong Zhu, Shurong Zheng, Hongyin Zhao, Fan Yang, Ming Tang, and Jinqiao Wang. Vision-r1: Evolv- ing human-free alignment in large vision-language models via vision-guided reinforcement learning.arXiv preprint arXiv:2503.18013, 2025. 3

  16. [16]

    Unbiased look at dataset bias

    Antonio Torralba and Alexei A Efros. Unbiased look at dataset bias. InCVPR, pages 1521–1528, 2011. 3

  17. [17]

    Seeing like a human: Asynchronous learning with dynamic progressive refinement for person re-identification

    Quan Zhang, Jianhuang Lai, Zhanxiang Feng, and Xiao- hua Xie. Seeing like a human: Asynchronous learning with dynamic progressive refinement for person re-identification. IEEE TIP, 31:352–365, 2021

  18. [18]

    Separable spatial-temporal residual graph for cloth-changing group re-identification.IEEE TPAMI, 46(8):5791–5805, 2024

    Quan Zhang, Jianhuang Lai, Xiaohua Xie, Xiaofeng Jin, and Sien Huang. Separable spatial-temporal residual graph for cloth-changing group re-identification.IEEE TPAMI, 46(8):5791–5805, 2024

  19. [19]

    Debiasing multimodal large language mod- els via noise-aware preference optimization

    Zefeng Zhang, Hengzhu Tang, Jiawei Sheng, Zhenyu Zhang, Yiming Ren, Zhenyang Li, Dawei Yin, Duohe Ma, and Tingwen Liu. Debiasing multimodal large language mod- els via noise-aware preference optimization. InCVPR, pages 9423–9433, 2025. 3

  20. [20]

    Language prior is not the only shortcut: A bench- mark for shortcut learning in vqa

    Qingyi Si, Fandong Meng, Mingyu Zheng, Zheng Lin, Yuanxin Liu, Peng Fu, Yanan Cao, Weiping Wang, and Jie Zhou. Language prior is not the only shortcut: A bench- mark for shortcut learning in vqa. InEMNLP Findings, pages 3698–3712, 2022. 3

  21. [21]

    Discovering the real association: Multimodal causal reason- ing in video question answering

    Chuanqi Zang, Hanqing Wang, Mingtao Pei, and Wei Liang. Discovering the real association: Multimodal causal reason- ing in video question answering. InCVPR, pages 19027– 19036, June 2023

  22. [22]

    Mitigating modality prior-induced halluci- nations in multimodal large language models via deciphering attention causality

    Guanyu Zhou, Yibo Yan, Xin Zou, Kun Wang, Aiwei Liu, and Xuming Hu. Mitigating modality prior-induced halluci- nations in multimodal large language models via deciphering attention causality. InICLR, 2025

  23. [23]

    Assessing modality bias in video question answer- ing benchmarks with multimodal large language models

    Jean Park, Kuk Jin Jang, Basam Alasaly, Sriharsha Mopi- devi, Andrew Zolensky, Eric Eaton, Insup Lee, and Kevin Johnson. Assessing modality bias in video question answer- ing benchmarks with multimodal large language models. In AAAI, volume 39, pages 19821–19829, 2025. 3

  24. [24]

    Counterfactual vqa: A cause- effect look at language bias

    Yulei Niu, Kaihua Tang, Hanwang Zhang, Zhiwu Lu, Xian- Sheng Hua, and Ji-Rong Wen. Counterfactual vqa: A cause- effect look at language bias. InCVPR, pages 12700–12710,

  25. [25]

    Quantifying and mitigating unimodal biases in multimodal large language models: A causal perspective

    Meiqi Chen, Yixin Cao, Yan Zhang, and Chaochao Lu. Quantifying and mitigating unimodal biases in multimodal large language models: A causal perspective. InEMNLP Findings, pages 16449–16469, 2024. 3

  26. [26]

    Language models are causal knowledge ex- tractors for zero-shot video question answering

    Hung-Ting Su, Yulei Niu, Xudong Lin, Winston H Hsu, and Shih-Fu Chang. Language models are causal knowledge ex- tractors for zero-shot video question answering. InCVPR, pages 4950–4959, 2023. 3

  27. [27]

    Link- context learning for multimodal llms

    Yan Tai, Weichen Fan, Zhao Zhang, and Ziwei Liu. Link- context learning for multimodal llms. InCVPR, pages 27176–27185, 2024

  28. [28]

    Causal-cog: A causal-effect look at context genera- tion for boosting multi-modal language models

    Shitian Zhao, Zhuowan Li, Yadong Lu, Alan Yuille, and Yan Wang. Causal-cog: A causal-effect look at context genera- tion for boosting multi-modal language models. InCVPR, pages 13342–13351, 2024

  29. [29]

    Cross-modal causal relation alignment for video question grounding

    Weixing Chen, Yang Liu, Binglin Chen, Jiandong Su, Yongsen Zheng, and Liang Lin. Cross-modal causal relation alignment for video question grounding. InCVPR, pages 24087–24096, 2025. 3

  30. [30]

    Causal interventional training for image recog- nition.IEEE TMM, 25:1033–1044, 2021

    Wei Qin, Hanwang Zhang, Richang Hong, Ee-Peng Lim, and Qianru Sun. Causal interventional training for image recog- nition.IEEE TMM, 25:1033–1044, 2021. 3

  31. [31]

    Deep stable learning for out-of- distribution generalization

    Xingxuan Zhang, Peng Cui, Renzhe Xu, Linjun Zhou, Yue He, and Zheyan Shen. Deep stable learning for out-of- distribution generalization. InCVPR, pages 5372–5382, 2021

  32. [32]

    Learning modal-invariant angular metric by cyclic projection network for vis-nir person re-identification.IEEE TIP, 30:8019– 8033, 2021

    Quan Zhang, Jianhuang Lai, and Xiaohua Xie. Learning modal-invariant angular metric by cyclic projection network for vis-nir person re-identification.IEEE TIP, 30:8019– 8033, 2021

  33. [33]

    Wavelet-guided promotion-suppression transformer for surface-defect detection.IEEE TIP, 32:4517–4528, 2023

    Quan Zhang, Jianhuang Lai, Junyong Zhu, and Xiaohua Xie. Wavelet-guided promotion-suppression transformer for surface-defect detection.IEEE TIP, 32:4517–4528, 2023

  34. [34]

    View-decoupled transformer for person re- identification under aerial-ground camera network

    Quan Zhang, Lei Wang, Vishal M Patel, Xiaohua Xie, and Jianhaung Lai. View-decoupled transformer for person re- identification under aerial-ground camera network. InCVPR, pages 22000–22009, 2024. 3

  35. [35]

    Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

    Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, and Ziwei Liu. Video-mmmu: Evaluating knowledge acquisition from multi-discipline pro- fessional videos.arXiv preprint arXiv:2501.13826, 2025. 3

  36. [36]

    VideoChat: Chat-Centric Video Understanding

    KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding.arXiv preprint arXiv:2305.06355, 2023. 3

  37. [37]

    Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633– 638, 2025

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633– 638, 2025. 3, 4

  38. [38]

    DAPO: An open-source LLM reinforcement learning system at scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xi- aochen Zuo, YuYue, Weinan Dai, Tiantian Fan, Gaohong Liu, Juncai Liu, LingJun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Ru Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Ha...

  39. [39]

    Causal inference in statistics: An overview

    Judea Pearl. Causal inference in statistics: An overview

  40. [40]

    [bayesian analysis in expert systems]: Com- ment: Graphical models, causality and intervention.Statisti- cal Science, 8(3):266–269, 1993

    Judea Pearl. [bayesian analysis in expert systems]: Com- ment: Graphical models, causality and intervention.Statisti- cal Science, 8(3):266–269, 1993. 5

  41. [41]

    Causal diagrams for epidemiologic research.Epidemiology, 10(1):37–48, 1999

    Sander Greenland, Judea Pearl, and James M Robins. Causal diagrams for epidemiologic research.Epidemiology, 10(1):37–48, 1999. 5

  42. [42]

    Llama-vid: An image is worth 2 tokens in large language models

    Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. InECCV, pages 323–340. Springer, 2024. 7

  43. [43]

    VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

    Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, et al. Videollama 2: Advancing spatial- temporal modeling and audio understanding in video-llms. arXiv preprint arXiv:2406.07476, 2024. 7

  44. [44]

    Long Context Transfer from Language to Vision

    Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, and Ziwei Liu. Long context transfer from language to vision.arXiv preprint arXiv:2406.16852, 2024. 7

  45. [45]

    Vila: On pre-training for visual language models

    Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Moham- mad Shoeybi, and Song Han. Vila: On pre-training for visual language models. InCVPR, pages 26689–26699, 2024. 7

  46. [46]

    Unhackable temporal rewarding for scalable video mllms.arXiv preprint arXiv:2502.12081, 2025

    En Yu, Kangheng Lin, Liang Zhao, Yana Wei, Zining Zhu, Haoran Wei, Jianjian Sun, Zheng Ge, Xiangyu Zhang, Jingyu Wang, et al. Unhackable temporal rewarding for scal- able video mllms.arXiv preprint arXiv:2502.12081, 2025. 6, 7

  47. [47]

    Mmvu: Measuring expert-level multi- discipline video understanding

    Yilun Zhao, Haowei Zhang, Lujing Xie, Tongyan Hu, Guo Gan, Yitao Long, Zhiyuan Hu, Weiyuan Chen, Chuhan Li, Zhijian Xu, et al. Mmvu: Measuring expert-level multi- discipline video understanding. InCVPR, pages 8475–8489,

  48. [48]

    Video-holmes: Can mllm think like holmes for complex video reasoning?, 2025

    Junhao Cheng, Yuying Ge, Teng Wang, Yixiao Ge, Jing Liao, and Ying Shan. Video-holmes: Can mllm think like holmes for complex video reasoning?arXiv preprint arXiv:2505.21374, 2025. 6, 1

  49. [49]

    Mvbench: A comprehensive multi-modal video understand- ing benchmark

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understand- ing benchmark. InCVPR, pages 22195–22206, 2024. 6, 1

  50. [50]

    Temp- compass: Do video llms really understand videos? InACL Findings, pages 8731–8772, 2024

    Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. Temp- compass: Do video llms really understand videos? InACL Findings, pages 8731–8772, 2024. 6, 1

  51. [51]

    Output the thinking process in<think></think>and the final answer (letters separated by commas, if multiple) in<answer></answer> tags

    Viorica Patraucean, Lucas Smaira, Ankush Gupta, Adria Re- casens Continente, Larisa Markeeva, Dylan Sunil Banarse, Skanda Koppula, Joseph Heyward, Mateusz Malinowski, Yi Yang, Carl Doersch, Tatiana Matejovicova, Yury Sul- sky, Antoine Miech, Alexandre Fréchette, Hanna Klimczak, Raphael Koster, Junlin Zhang, Stephanie Winkler, Yusuf Ay- tar, Simon Osindero...