pith. machine review for the scientific record. sign in

arxiv: 2602.17555 · v3 · submitted 2026-02-19 · 💻 cs.CV

Recognition: no theorem link

GraphThinker: Reinforcing Temporally Grounded Video Reasoning with Event Graph Thinking

Authors on Pith no claims yet

Pith reviewed 2026-05-15 20:44 UTC · model grok-4.3

classification 💻 cs.CV
keywords video reasoningtemporal hallucinationsevent scene graphsreinforcement finetuningmultimodal LLMsmoment localizationvisual grounding
0
0 comments X

The pith

GraphThinker builds event graphs and applies visual rewards in reinforcement finetuning to ground MLLM video reasoning and reduce temporal hallucinations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to fix severe temporal hallucinations in multimodal large language models during video reasoning, where models fabricate event sequences instead of grounding answers in actual footage. It introduces a structured Event-based Video Scene Graph to explicitly model relations inside and between events, then uses reinforcement finetuning plus a visual attention reward to force the model to attend to reliable visual evidence rather than loose text descriptions. This combination produces concrete gains on moment localization and hallucination benchmarks. A sympathetic reader cares because reliable temporal understanding is essential for any video AI application from search to surveillance, and current models fail precisely on the ordering and causation of events.

Core claim

GraphThinker employs an MLLM to construct an Event-based Video Scene Graph (EVSG) that captures both intra- and inter-event relations and then applies reinforcement finetuning with a novel visual attention reward that encourages active focus on reliable visual cues, jointly reducing reasoning hallucinations while improving grounding.

What carries the argument

Event-based Video Scene Graph (EVSG) that explicitly encodes intra- and inter-event relations to supply causal constraints for structured reasoning, paired with a visual attention reward inside reinforcement finetuning.

If this is right

  • Over 4% improvement in IoU=0.3 for moment localisation on the RexTime dataset.
  • 9.8% improvement in reducing temporal sequence hallucination on the VidHalluc dataset.
  • 7.6% gain in Binary QA performance for reducing action hallucination on VidHalluc.
  • Explicit structured graphs replace reliance on unstructured dense captions, supplying causal constraints that guide reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same graph-construction plus reward mechanism could be tested on audio-video or multi-shot reasoning tasks to check whether the hallucination reduction generalizes beyond single-camera clips.
  • If graph accuracy proves to be the bottleneck, lightweight human-in-the-loop correction of the EVSG during training might further amplify the gains without full retraining.
  • The approach suggests a route for applying explicit relational structures to other hallucination-prone domains such as long-form text generation or image captioning.

Load-bearing premise

An MLLM can reliably construct an accurate Event-based Video Scene Graph that captures true event relations without introducing its own hallucinations or errors.

What would settle it

On a new set of videos with human-annotated event graphs, measure whether the automatically constructed EVSG matches the annotations; if mismatch is high and the reported gains in localization and hallucination reduction disappear, the central claim fails.

Figures

Figures reproduced from arXiv: 2602.17555 by Da Li, Jian Hu, Shaogang Gong, Wei Li, Yuhang Zang, Ziquan Liu, Zixu Cheng.

Figure 1
Figure 1. Figure 1: Current MLLMs [2] implicitly model event relations in video by token correlations. This often leads to hallucinations in video reasoning. For instance, when determining the temporal order of events, Qwen2.5-VL tends to yield temporal hallucina￾tions due to a lack of explicit event-level evidence validation and reasoning. In contrast, our GraphThinker explicitly models both intra- and inter-event relations … view at source ↗
Figure 2
Figure 2. Figure 2: An overview of the GraphThinker video reasoning model. GraphThinker first employs an MLLM to generate multi-grained dense captions in sequence for a video. It deploys the same MLLM again to select keywords as graph nodes before iteratively optimizing them in constructing an event-based graph (EVSG). This EVSG is then used to serve as a fine-grained representation of structured event relations for reasoning… view at source ↗
Figure 3
Figure 3. Figure 3: An Example of the proposed Event-based Video Scene Graph (EVSG) for video reasoning. The EVSG is composed of event [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: A visual example showing that our method reduces hallucination during reasoning. Qwen2.5-VL still yields hallucination in [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

Video reasoning requires a fine-grained understanding of the temporal dependencies and event-level relations between objects and events in videos. Current Multimodal Large Language Models (MLLMs) are prone to severe temporal hallucinations in video reasoning. An underlying cause of these hallucinations is weak visual-temporal grounding and the lack of explicit structure for modelling event relations. Models often rely on auxiliary text, such as dense captions, rather than explicitly anchoring their reasoning in actual visual evidence. However, these textual representations are inherently unstructured and fail to provide explicit causal constraints needed to guide the model's reasoning. In this work, we propose GraphThinker, a reinforcement finetuning method that constructs a structured event representation of a video and enforces visual grounding to jointly reduce reasoning hallucinations. Specifically, we employ an MLLM to construct an Event-based Video Scene Graph (EVSG) that captures both intra- and inter-event relations, guiding a structured video reasoning process. Moreover, we address the weak grounding issue by introducing a novel visual attention reward during reinforcement finetuning that encourages the model to actively attend to reliable visual cues. On the RexTime dataset, GraphThinker achieves an over 4% improvement in IoU=0.3 for moment localisation. On the VidHalluc dataset, GraphThinker achieves a 9.8% improvement in reducing temporal sequence hallucination and a 7.6% gain in Binary QA in reducing action hallucination, compared to the state-of-the-art methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces GraphThinker, a reinforcement finetuning method for MLLMs in video reasoning. It first uses an MLLM to construct an Event-based Video Scene Graph (EVSG) encoding intra- and inter-event relations, then applies RL with a novel visual-attention reward to enforce grounding and reduce temporal and action hallucinations. Empirical claims include >4% IoU@0.3 gain on RexTime moment localization and 9.8%/7.6% gains on VidHalluc for temporal-sequence and action hallucination reduction versus SOTA baselines.

Significance. If the EVSG proves reliable and the gains are robust to controls, the work offers a concrete route to inject explicit event structure into video MLLMs, addressing a documented weakness in temporal grounding. The combination of graph construction with visual-reward RL is a clear methodological contribution that could generalize beyond the two evaluated datasets.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (EVSG construction): the central claim that the EVSG supplies reliable structured grounding is load-bearing, yet no quantitative validation of EVSG fidelity (human agreement, graph-edit distance to ground truth, or node/edge error rates) is reported. Because the EVSG is generated by the same MLLM family known to hallucinate temporally, unverified construction errors could propagate unchanged into the RL stage.
  2. [Abstract and §4] Abstract and §4 (experiments): the reported percentage gains are presented without naming the precise SOTA baselines, number of random seeds, statistical significance tests, or ablation removing the EVSG component. Without these controls it is impossible to isolate whether the improvements derive from the claimed graph-based reasoning or from the visual-attention reward alone.
minor comments (2)
  1. [§3.2] Notation for the visual-attention reward should be introduced with an explicit equation rather than prose description only.
  2. [Figure 2] Figure captions for the EVSG examples should include the exact prompt template used to elicit the graph from the MLLM.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and will revise the manuscript to incorporate the suggested improvements where appropriate.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (EVSG construction): the central claim that the EVSG supplies reliable structured grounding is load-bearing, yet no quantitative validation of EVSG fidelity (human agreement, graph-edit distance to ground truth, or node/edge error rates) is reported. Because the EVSG is generated by the same MLLM family known to hallucinate temporally, unverified construction errors could propagate unchanged into the RL stage.

    Authors: We agree that the absence of quantitative validation for EVSG fidelity is a limitation in the current manuscript. The paper relies on qualitative examples and end-task gains to support the EVSG's utility, without reporting human agreement, graph-edit distance, or error rates. To address this, we will add a human evaluation subsection (and corresponding appendix) assessing EVSG quality on a 100-video subset, reporting node/edge precision-recall against human annotations. We will also explicitly discuss potential error propagation and how the visual-attention reward in RL is intended to reduce its impact by prioritizing visual grounding over potentially noisy graph elements. revision: yes

  2. Referee: [Abstract and §4] Abstract and §4 (experiments): the reported percentage gains are presented without naming the precise SOTA baselines, number of random seeds, statistical significance tests, or ablation removing the EVSG component. Without these controls it is impossible to isolate whether the improvements derive from the claimed graph-based reasoning or from the visual-attention reward alone.

    Authors: We acknowledge that the experimental reporting lacks sufficient controls for reproducibility and isolation of contributions. In the revision we will: (1) explicitly name all SOTA baselines (e.g., VideoChatGPT, LLaVA-Video, and Video-LLaMA variants); (2) report results averaged over 3 random seeds with standard deviations; (3) add paired t-test p-values for statistical significance; and (4) include a new ablation that removes the EVSG construction step while retaining the visual-attention reward, directly comparing against the full model to quantify the graph component's contribution. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical method with external benchmarks

full rationale

The paper proposes GraphThinker as a reinforcement finetuning pipeline that first builds an EVSG via MLLM and then applies a visual-attention reward. All reported gains (IoU on RexTime, hallucination reductions on VidHalluc) are presented as direct empirical comparisons against external SOTA baselines on held-out datasets. No equations, fitted parameters, or self-citations are invoked to derive these numbers from the method's own inputs; the derivation chain consists of standard RL steps whose outputs are measured independently. The central claims therefore remain self-contained against external evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that MLLMs can produce reliable event graphs and that the attention reward will successfully enforce visual grounding during RL.

axioms (1)
  • domain assumption MLLMs can construct meaningful Event-based Video Scene Graphs that capture true event relations
    The method directly employs an MLLM to build the EVSG as the core structured representation.
invented entities (1)
  • Event-based Video Scene Graph (EVSG) no independent evidence
    purpose: To provide explicit intra- and inter-event relations that guide structured video reasoning
    New representation introduced to replace unstructured text captions

pith-pipeline@v0.9.0 · 5584 in / 1298 out tokens · 37563 ms · 2026-05-15T20:44:49.440450+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

74 extracted references · 74 canonical work pages · 12 internal anchors

  1. [1]

    The claude 3 model family: Opus, sonnet, haiku.Claude-3 Model Card, 1(1):4, 2024

    AI Anthropic. The claude 3 model family: Opus, sonnet, haiku.Claude-3 Model Card, 1(1):4, 2024. 6

  2. [2]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 6, 7

  3. [3]

    Perturbollava: Reducing multimodal hallucinations with per- turbative visual training.arXiv preprint arXiv:2503.06486,

    Cong Chen, Mingyu Liu, Chenchen Jing, Yizhou Zhou, Fengyun Rao, Hao Chen, Bo Zhang, and Chunhua Shen. Perturbollava: Reducing multimodal hallucinations with per- turbative visual training.arXiv preprint arXiv:2503.06486,

  4. [4]

    Rextime: A benchmark suite for reasoning-across-time in videos.Advances in Neural In- formation Processing Systems, 37:28662–28673, 2024

    Jr-Jen Chen, Yu-Chien Liao, Hsi-Che Lin, Yu-Chu Yu, Yen- Chun Chen, and Frank Wang. Rextime: A benchmark suite for reasoning-across-time in videos.Advances in Neural In- formation Processing Systems, 37:28662–28673, 2024. 1, 3, 6

  5. [5]

    Sharegpt4video: Improving video understand- ing and generation with better captions.Advances in Neural Information Processing Systems, 37:19472–19495, 2024

    Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Zhenyu Tang, Li Yuan, et al. Sharegpt4video: Improving video understand- ing and generation with better captions.Advances in Neural Information Processing Systems, 37:19472–19495, 2024. 7

  6. [6]

    Scaling rl to long videos.arXiv preprint arXiv:2507.07966, 2025

    Yukang Chen, Wei Huang, Baifeng Shi, Qinghao Hu, Han- rong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, et al. Scaling rl to long videos.arXiv preprint arXiv:2507.07966, 2025. 1, 2

  7. [7]

    VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

    Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, et al. Videollama 2: Advancing spatial- temporal modeling and audio understanding in video-llms. arXiv preprint arXiv:2406.07476, 2024. 7

  8. [8]

    V-star: Benchmarking video- llms on video spatio-temporal reasoning.arXiv preprint arXiv:2503.11495, 2025

    Zixu Cheng, Jian Hu, Ziquan Liu, Chenyang Si, Wei Li, and Shaogang Gong. V-star: Benchmarking video- llms on video spatio-temporal reasoning.arXiv preprint arXiv:2503.11495, 2025. 1

  9. [9]

    Spatial-temporal trans- former for dynamic scene graph generation

    Yuren Cong, Wentong Liao, Hanno Ackermann, Bodo Rosenhahn, and Michael Ying Yang. Spatial-temporal trans- former for dynamic scene graph generation. InProceedings of the IEEE/CVF international conference on computer vi- sion, pages 16372–16382, 2021. 2

  10. [10]

    Sophiavl-r1: Reinforcing mllms reason- ing with thinking reward.arXiv preprint arXiv:2505.17018,

    Kaixuan Fan, Kaituo Feng, Haoming Lyu, Dongzhan Zhou, and Xiangyu Yue. Sophiavl-r1: Reinforcing mllms reason- ing with thinking reward.arXiv preprint arXiv:2505.17018,

  11. [11]

    Mmbench-video: A long-form multi-shot benchmark for holistic video under- standing.Advances in Neural Information Processing Sys- tems, 37:89098–89124, 2024

    Xinyu Fang, Kangrui Mao, Haodong Duan, Xiangyu Zhao, Yining Li, Dahua Lin, and Kai Chen. Mmbench-video: A long-form multi-shot benchmark for holistic video under- standing.Advances in Neural Information Processing Sys- tems, 37:89098–89124, 2024. 1

  12. [12]

    Video-of-thought: Step-by-step video reasoning from perception to cognition

    Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, Meishan Zhang, Mong-Li Lee, and Wynne Hsu. Video-of-thought: Step-by-step video reasoning from perception to cognition. InInternational Conference on Machine Learning, pages 13109–13125. PMLR, 2024. 2

  13. [13]

    Video-R1: Reinforcing Video Reasoning in MLLMs

    Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776,

  14. [14]

    Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24108–24118, 2025. 1

  15. [15]

    Em- bodied ai agents: Modeling the world.arXiv preprint arXiv:2506.22355, 2025

    Pascale Fung, Yoram Bachrach, Asli Celikyilmaz, Kamalika Chaudhuri, Delong Chen, Willy Chung, Emmanuel Dupoux, Hongyu Gong, Herv´e J´egou, Alessandro Lazaric, et al. Em- bodied ai agents: Modeling the world.arXiv preprint arXiv:2506.22355, 2025. 1

  16. [16]

    Chain-of-Frames: Advancing Video Understanding in Multimodal LLMs via Frame-Aware Reasoning

    Sara Ghazanfari, Francesco Croce, Nicolas Flammarion, Prashanth Krishnamurthy, Farshad Khorrami, and Siddharth Garg. Chain-of-frames: Advancing video understanding in multimodal llms via frame-aware reasoning.arXiv preprint arXiv:2506.00318, 2025. 1, 2

  17. [17]

    Real-time scene understanding for blind users: Enhancing vision-language models for accessibility

    Loan Gia. Real-time scene understanding for blind users: Enhancing vision-language models for accessibility. In Workshop on Vision Foundation Models and Generative AI for Accessibility: Challenges and Opportunities, 2025. 1

  18. [18]

    Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633– 638, 2025

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633– 638, 2025. 3

  19. [19]

    Trace: Temporal grounding video llm via causal event modeling.arXiv preprint arXiv:2410.05643,

    Yongxin Guo, Jingyu Liu, Mingda Li, Qingbin Liu, Xi Chen, and Xiaoying Tang. Trace: Temporal grounding video llm via causal event modeling.arXiv preprint arXiv:2410.05643,

  20. [20]

    Toga: Tempo- rally grounded open-ended video qa with weak supervision

    Ayush Gupta, Anirban Roy, Rama Chellappa, Nathaniel D Bastian, Alvaro Velasquez, and Susmit Jha. Toga: Tempo- rally grounded open-ended video qa with weak supervision. arXiv preprint arXiv:2506.09445, 2025. 6

  21. [21]

    Videoespresso: A large-scale chain-of-thought dataset for fine-grained video reasoning via core frame selection

    Songhao Han, Wei Huang, Hairong Shi, Le Zhuo, Xiu Su, Shifeng Zhang, Xu Zhou, Xiaojuan Qi, Yue Liao, and Si Liu. Videoespresso: A large-scale chain-of-thought dataset for fine-grained video reasoning via core frame selection. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 26181–26191, 2025. 1, 2

  22. [22]

    To- wards open-vocabulary scene graph generation with prompt- based finetuning

    Tao He, Lianli Gao, Jingkuan Song, and Yuan-Fang Li. To- wards open-vocabulary scene graph generation with prompt- based finetuning. InEuropean conference on computer vi- sion, pages 56–73. Springer, 2022. 2 9

  23. [23]

    Glm-4.1 v-thinking: Towards versatile multi- modal reasoning with scalable reinforcement learning.arXiv e-prints, pages arXiv–2507, 2025

    Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guob- ing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Li- hang Pan, et al. Glm-4.1 v-thinking: Towards versatile multi- modal reasoning with scalable reinforcement learning.arXiv e-prints, pages arXiv–2507, 2025. 1, 2

  24. [24]

    Uncertainty-quantified roll- out policy adaptation for unlabelled cross-domain temporal grounding.arXiv preprint arXiv:2508.06317, 2025

    Jian Hu, Zixu Cheng, Shaogang Gong, Isabel Guan, Jianye Hao, Jun Wang, and Kun Shao. Uncertainty-quantified roll- out policy adaptation for unlabelled cross-domain temporal grounding.arXiv preprint arXiv:2508.06317, 2025. 3

  25. [25]

    Vtimellm: Empower llm to grasp video moments

    Bin Huang, Xin Wang, Hong Chen, Zihan Song, and Wenwu Zhu. Vtimellm: Empower llm to grasp video moments. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 14271–14280, 2024. 6

  26. [26]

    Lita: Language instructed temporal-localization assistant

    De-An Huang, Shijia Liao, Subhashree Radhakrishnan, Hongxu Yin, Pavlo Molchanov, Zhiding Yu, and Jan Kautz. Lita: Language instructed temporal-localization assistant. In European Conference on Computer Vision, pages 202–218. Springer, 2024. 1, 2, 3, 6

  27. [27]

    Building a mind palace: Structuring environment-grounded semantic graphs for ef- fective long video analysis with llms

    Zeyi Huang, Yuyang Ji, Xiaofang Wang, Nikhil Mehta, Tong Xiao, Donghyun Lee, Sigmund Vanvalkenburgh, Shengxin Zha, Bolin Lai, Licheng Yu, et al. Building a mind palace: Structuring environment-grounded semantic graphs for ef- fective long video analysis with llms. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24169–24179, 2025. 2

  28. [28]

    Action genome: Actions as compositions of spatio- temporal scene graphs

    Jingwei Ji, Ranjay Krishna, Li Fei-Fei, and Juan Carlos Niebles. Action genome: Actions as compositions of spatio- temporal scene graphs. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10236–10247, 2020. 2

  29. [29]

    Look again, think slowly: Enhancing visual reflection in vision-language models.arXiv preprint arXiv:2509.12132, 2025

    Pu Jian, Junhong Wu, Wei Sun, Chen Wang, Shuo Ren, and Jiajun Zhang. Look again, think slowly: Enhancing visual reflection in vision-language models.arXiv preprint arXiv:2509.12132, 2025. 1, 2

  30. [30]

    Chat-univi: Unified visual representation em- powers large language models with image and video un- derstanding

    Peng Jin, Ryuichi Takanobu, Wancai Zhang, Xiaochun Cao, and Li Yuan. Chat-univi: Unified visual representation em- powers large language models with image and video un- derstanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13700– 13710, 2024. 7

  31. [31]

    Do you remember? dense video captioning with cross-modal memory retrieval

    Minkuk Kim, Hyeon Bae Kim, Jinyoung Moon, Jinwoo Choi, and Seong Tae Kim. Do you remember? dense video captioning with cross-modal memory retrieval. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13894–13904, 2024. 1, 2

  32. [32]

    Vidhalluc: Evaluating temporal hallucinations in multimodal large lan- guage models for video understanding

    Chaoyu Li, Eun Woo Im, and Pooyan Fazli. Vidhalluc: Evaluating temporal hallucinations in multimodal large lan- guage models for video understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13723–13733, 2025. 6

  33. [33]

    Reinforcement learning tuning for videollms: Reward design and data efficiency

    Hongyu Li, Songhao Han, Yue Liao, Junfeng Luo, Jialin Gao, Shuicheng Yan, and Si Liu. Reinforcement learning tuning for videollms: Reward design and data efficiency. arXiv preprint arXiv:2506.01908, 2025. 1, 2

  34. [34]

    Inten- tqa: Context-aware video intent reasoning

    Jiapeng Li, Ping Wei, Wenjuan Han, and Lifeng Fan. Inten- tqa: Context-aware video intent reasoning. InProceedings of the IEEE/CVF international conference on computer vision, pages 11963–11974, 2023. 1

  35. [35]

    Tem- poral reasoning transfer from text to video.arXiv preprint arXiv:2410.06166, 2024

    Lei Li, Yuanxin Liu, Linli Yao, Peiyuan Zhang, Chenxin An, Lean Wang, Xu Sun, Lingpeng Kong, and Qi Liu. Tem- poral reasoning transfer from text to video.arXiv preprint arXiv:2410.06166, 2024. 1, 2

  36. [36]

    Embodied agent inter- face: Benchmarking llms for embodied decision making

    Manling Li, Shiyu Zhao, Qineng Wang, Kangrui Wang, Yu Zhou, Sanjana Srivastava, Cem Gokmen, Tony Lee, Er- ran Li Li, Ruohan Zhang, et al. Embodied agent inter- face: Benchmarking llms for embodied decision making. Advances in Neural Information Processing Systems, 37: 100428–100534, 2024. 1

  37. [37]

    From pixels to graphs: Open-vocabulary scene graph generation with vision-language models

    Rongjie Li, Songyang Zhang, Dahua Lin, Kai Chen, and Xuming He. From pixels to graphs: Open-vocabulary scene graph generation with vision-language models. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 28076–28086, 2024. 2

  38. [38]

    VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

    Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, and Limin Wang. Videochat-r1: Enhancing spatio-temporal perception via reinforcement fine-tuning.arXiv preprint arXiv:2504.06958, 2025. 1, 2

  39. [39]

    Factorizable net: an efficient subgraph-based framework for scene graph generation

    Yikang Li, Wanli Ouyang, Bolei Zhou, Jianping Shi, Chao Zhang, and Xiaogang Wang. Factorizable net: an efficient subgraph-based framework for scene graph generation. In Proceedings of the European conference on computer vision (ECCV), pages 335–351, 2018. 2

  40. [40]

    Video-llava: Learning united visual repre- sentation by alignment before projection

    Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual repre- sentation by alignment before projection. InProceedings of the 2024 Conference on Empirical Methods in Natural Lan- guage Processing, pages 5971–5984, 2024. 7

  41. [41]

    Vila: On pre-training for vi- sual language models

    Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Moham- mad Shoeybi, and Song Han. Vila: On pre-training for vi- sual language models. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 26689–26699, 2024. 7

  42. [42]

    Univtg: Towards unified video- language temporal grounding

    Kevin Qinghong Lin, Pengchuan Zhang, Joya Chen, Shra- man Pramanick, Difei Gao, Alex Jinpeng Wang, Rui Yan, and Mike Zheng Shou. Univtg: Towards unified video- language temporal grounding. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2794–2804, 2023. 6

  43. [43]

    More thinking, less seeing? assessing amplified halluci- nation in multimodal reasoning models.arXiv preprint arXiv:2505.21523, 2025

    Chengzhi Liu, Zhongxing Xu, Qingyue Wei, Juncheng Wu, James Zou, Xin Eric Wang, Yuyin Zhou, and Sheng Liu. More thinking, less seeing? assessing amplified halluci- nation in multimodal reasoning models.arXiv preprint arXiv:2505.21523, 2025. 1, 2

  44. [44]

    MUSEG: Reinforcing Video Temporal Understanding via Timestamp-Aware Multi-Segment Grounding

    Fuwen Luo, Shengfeng Lou, Chi Chen, Ziyue Wang, Chen- liang Li, Weizhou Shen, Jiyue Guo, Peng Li, Ming Yan, Ji Zhang, et al. Museg: Reinforcing video temporal un- derstanding via timestamp-aware multi-segment grounding. arXiv preprint arXiv:2505.20715, 2025. 1, 2

  45. [45]

    When thinking drifts: Evidential grounding for robust video reasoning.arXiv preprint arXiv:2510.06077, 2025

    Mi Luo, Zihui Xue, Alex Dimakis, and Kristen Grauman. When thinking drifts: Evidential grounding for robust video reasoning.arXiv preprint arXiv:2510.06077, 2025. 1, 2

  46. [46]

    Video-chatgpt: Towards detailed video un- derstanding via large vision and language models

    Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Khan. Video-chatgpt: Towards detailed video un- derstanding via large vision and language models. InPro- ceedings of the 62nd Annual Meeting of the Association for 10 Computational Linguistics (Volume 1: Long Papers), pages 12585–12602, 2024. 7

  47. [47]

    Correlation-guided query-dependency calibra- tion for video temporal grounding.arXiv preprint arXiv:2311.08835, 2023

    WonJun Moon, Sangeek Hyun, SuBeen Lee, and Jae- Pil Heo. Correlation-guided query-dependency calibra- tion for video temporal grounding.arXiv preprint arXiv:2311.08835, 2023. 6

  48. [48]

    Unbiased scene graph generation in videos

    Sayak Nag, Kyle Min, Subarna Tripathi, and Amit K Roy- Chowdhury. Unbiased scene graph generation in videos. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 22803–22813, 2023. 2

  49. [49]

    Hig: Hierarchical interlacement graph approach to scene graph generation in video understanding

    Trong-Thuan Nguyen, Pha Nguyen, and Khoa Luu. Hig: Hierarchical interlacement graph approach to scene graph generation in video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18384–18394, 2024. 2

  50. [50]

    Hyperglm: Hypergraph for video scene graph generation and anticipation

    Trong-Thuan Nguyen, Pha Nguyen, Jackson Cothren, Alper Yilmaz, and Khoa Luu. Hyperglm: Hypergraph for video scene graph generation and anticipation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29150–29160, 2025. 2

  51. [51]

    Gpt-4 technical report

    OpenAI. Gpt-4 technical report. Technical report, OpenAI,

  52. [52]

    Gpt-4o system card

    OpenAI. Gpt-4o system card. Technical report, OpenAI,

  53. [53]

    Timesearch: Hierarchical video search with spotlight and reflection for human-like long video under- standing.arXiv preprint arXiv:2504.01407, 2025

    Junwen Pan, Rui Zhang, Xin Wan, Yuan Zhang, Ming Lu, and Qi She. Timesearch: Hierarchical video search with spotlight and reflection for human-like long video under- standing.arXiv preprint arXiv:2504.01407, 2025. 6, 7

  54. [54]

    Question- answering dense video events

    Hangyu Qin, Junbin Xiao, and Angela Yao. Question- answering dense video events. InProceedings of the 48th International ACM SIGIR Conference on Research and De- velopment in Information Retrieval, pages 884–894, 2025. 1, 2

  55. [55]

    Step: Enhancing video-llms’ com- positional reasoning by spatio-temporal graph-guided self- training

    Haiyi Qiu, Minghe Gao, Long Qian, Kaihang Pan, Qifan Yu, Juncheng Li, Wenjie Wang, Siliang Tang, Yueting Zhuang, and Tat-Seng Chua. Step: Enhancing video-llms’ com- positional reasoning by spatio-temporal graph-guided self- training. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3284–3294, 2025. 1, 2

  56. [56]

    Timechat: A time-sensitive multimodal large lan- guage model for long video understanding

    Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, and Lu Hou. Timechat: A time-sensitive multimodal large lan- guage model for long video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 14313–14323, 2024. 6

  57. [57]

    VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

    Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, et al. Vlm-r1: A stable and generaliz- able r1-style large vision-language model.arXiv preprint arXiv:2504.07615, 2025. 1, 2

  58. [58]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 6, 7

  59. [59]

    Reka core, flash, and edge: A series of powerful multimodal language models.arXiv preprint arXiv:2404.12387, 2024

    Reka Team, Aitor Ormazabal, Che Zheng, Cyprien de Mas- son d’Autume, Dani Yogatama, Deyu Fu, Donovan Ong, Eric Chen, Eugenie Lamprecht, Hai Pham, et al. Reka core, flash, and edge: A series of powerful multimodal language models.arXiv preprint arXiv:2404.12387, 2024. 6

  60. [60]

    Causal ai scientist: Facilitating causal data science with large language models

    Vishal Verma, Sawal Acharya, Samuel Simko, Devansh Bhardwaj, Anahita Haghighat, Dominik Janzing, Mrinmaya Sachan, Zhijing Jin, and Yongjin Yang. Causal ai scientist: Facilitating causal data science with large language models. InNeurIPS 2025 AI for Science Workshop, 2025. 1

  61. [61]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265, 2025. 1

  62. [62]

    Timezero: Temporal video grounding with reasoning-guided lvlm.arXiv preprint arXiv:2503.13377, 2025

    Ye Wang, Ziheng Wang, Boshen Xu, Yang Du, Kejun Lin, Zihan Xiao, Zihao Yue, Jianzhong Ju, Liang Zhang, Dingyi Yang, et al. Time-r1: Post-training large vision lan- guage model for temporal video grounding.arXiv preprint arXiv:2503.13377, 2025. 1, 2, 3

  63. [63]

    Sportshhi: A dataset for human-human interaction detection in sports videos

    Tao Wu, Runyu He, Gangshan Wu, and Limin Wang. Sportshhi: A dataset for human-human interaction detection in sports videos. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18537– 18546, 2024. 2

  64. [64]

    Visionary-r1: Mitigating shortcuts in vi- sual reasoning with reinforcement learning.arXiv preprint arXiv:2505.14677, 2025

    Jiaer Xia, Yuhang Zang, Peng Gao, Yixuan Li, and Kaiyang Zhou. Visionary-r1: Mitigating shortcuts in vi- sual reasoning with reinforcement learning.arXiv preprint arXiv:2505.14677, 2025. 1, 2

  65. [65]

    Can i trust your answer? visually grounded video question answering

    Junbin Xiao, Angela Yao, Yicong Li, and Tat-Seng Chua. Can i trust your answer? visually grounded video question answering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13204– 13214, 2024. 3

  66. [66]

    PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning

    Lin Xu, Yilin Zhao, Daquan Zhou, Zhijie Lin, See Kiong Ng, and Jiashi Feng. Pllava: Parameter-free llava extension from images to videos for video dense captioning.arXiv preprint arXiv:2404.16994, 2024. 7

  67. [67]

    Panoptic video scene graph generation

    Jingkang Yang, Wenxuan Peng, Xiangtai Li, Zujin Guo, Liangyu Chen, Bo Li, Zheng Ma, Kaiyang Zhou, Wayne Zhang, Chen Change Loy, et al. Panoptic video scene graph generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18675– 18685, 2023. 2

  68. [68]

    Thinking in space: How mul- timodal large language models see, remember, and recall spaces

    Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How mul- timodal large language models see, remember, and recall spaces. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10632–10643, 2025. 1

  69. [69]

    VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

    Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, et al. Videollama 3: Frontier multi- modal foundation models for image and video understand- ing.arXiv preprint arXiv:2501.13106, 2025. 1

  70. [70]

    Thinking with videos: Multimodal tool- augmented reinforcement learning for long video reasoning

    Haoji Zhang, Xin Gu, Jiawen Li, Chixiang Ma, Sule Bai, Chubin Zhang, Bowen Zhang, Zhichao Zhou, Dongliang He, and Yansong Tang. Thinking with videos: Multimodal tool- augmented reinforcement learning for long video reasoning. arXiv preprint arXiv:2508.04416, 2025. 6, 7

  71. [71]

    Vtime- 11 cot: Thinking by drawing for video temporal grounding and reasoning

    Jinglei Zhang, Yuanfan Guo, Rolandos Alexandros Potamias, Jiankang Deng, Hang Xu, and Chao Ma. Vtime- 11 cot: Thinking by drawing for video temporal grounding and reasoning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 24203–24213, 2025. 1, 2

  72. [72]

    Video-cot: A comprehensive dataset for spatiotemporal understanding of videos based on chain-of- thought.arXiv preprint arXiv:2506.08817, 2025

    Shuyi Zhang, Xiaoshuai Hao, Yingbo Tang, Lingfeng Zhang, Pengwei Wang, Zhongyuan Wang, Hongxuan Ma, and Shanghang Zhang. Video-cot: A comprehensive dataset for spatiotemporal understanding of videos based on chain-of- thought.arXiv preprint arXiv:2506.08817, 2025. 1, 2

  73. [73]

    LLaVA-Video: Video Instruction Tuning With Synthetic Data

    Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Zi- wei Liu, and Chunyuan Li. Video instruction tuning with synthetic data.arXiv preprint arXiv:2410.02713, 2024. 7

  74. [74]

    Towards video thinking test: A holistic benchmark for advanced video reasoning and understanding

    Yuanhan Zhang, Yunice Chew, Yuhao Dong, Aria Leo, Bo Hu, and Ziwei Liu. Towards video thinking test: A holistic benchmark for advanced video reasoning and understanding. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20626–20636, 2025. 1 12