arxiv: 2602.17555 · v3 · submitted 2026-02-19 · 💻 cs.CV

Recognition: no theorem link

GraphThinker: Reinforcing Temporally Grounded Video Reasoning with Event Graph Thinking

Zixu Cheng , Da Li , Jian Hu , Yuhang Zang , Ziquan Liu , Shaogang Gong , Wei Li

Authors on Pith no claims yet

Pith reviewed 2026-05-15 20:44 UTC · model grok-4.3

classification 💻 cs.CV

keywords video reasoningtemporal hallucinationsevent scene graphsreinforcement finetuningmultimodal LLMsmoment localizationvisual grounding

0 comments

The pith

GraphThinker builds event graphs and applies visual rewards in reinforcement finetuning to ground MLLM video reasoning and reduce temporal hallucinations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to fix severe temporal hallucinations in multimodal large language models during video reasoning, where models fabricate event sequences instead of grounding answers in actual footage. It introduces a structured Event-based Video Scene Graph to explicitly model relations inside and between events, then uses reinforcement finetuning plus a visual attention reward to force the model to attend to reliable visual evidence rather than loose text descriptions. This combination produces concrete gains on moment localization and hallucination benchmarks. A sympathetic reader cares because reliable temporal understanding is essential for any video AI application from search to surveillance, and current models fail precisely on the ordering and causation of events.

Core claim

GraphThinker employs an MLLM to construct an Event-based Video Scene Graph (EVSG) that captures both intra- and inter-event relations and then applies reinforcement finetuning with a novel visual attention reward that encourages active focus on reliable visual cues, jointly reducing reasoning hallucinations while improving grounding.

What carries the argument

Event-based Video Scene Graph (EVSG) that explicitly encodes intra- and inter-event relations to supply causal constraints for structured reasoning, paired with a visual attention reward inside reinforcement finetuning.

If this is right

Over 4% improvement in IoU=0.3 for moment localisation on the RexTime dataset.
9.8% improvement in reducing temporal sequence hallucination on the VidHalluc dataset.
7.6% gain in Binary QA performance for reducing action hallucination on VidHalluc.
Explicit structured graphs replace reliance on unstructured dense captions, supplying causal constraints that guide reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same graph-construction plus reward mechanism could be tested on audio-video or multi-shot reasoning tasks to check whether the hallucination reduction generalizes beyond single-camera clips.
If graph accuracy proves to be the bottleneck, lightweight human-in-the-loop correction of the EVSG during training might further amplify the gains without full retraining.
The approach suggests a route for applying explicit relational structures to other hallucination-prone domains such as long-form text generation or image captioning.

Load-bearing premise

An MLLM can reliably construct an accurate Event-based Video Scene Graph that captures true event relations without introducing its own hallucinations or errors.

What would settle it

On a new set of videos with human-annotated event graphs, measure whether the automatically constructed EVSG matches the annotations; if mismatch is high and the reported gains in localization and hallucination reduction disappear, the central claim fails.

Figures

Figures reproduced from arXiv: 2602.17555 by Da Li, Jian Hu, Shaogang Gong, Wei Li, Yuhang Zang, Ziquan Liu, Zixu Cheng.

**Figure 1.** Figure 1: Current MLLMs [2] implicitly model event relations in video by token correlations. This often leads to hallucinations in video reasoning. For instance, when determining the temporal order of events, Qwen2.5-VL tends to yield temporal hallucinations due to a lack of explicit event-level evidence validation and reasoning. In contrast, our GraphThinker explicitly models both intra- and inter-event relations … view at source ↗

**Figure 2.** Figure 2: An overview of the GraphThinker video reasoning model. GraphThinker first employs an MLLM to generate multi-grained dense captions in sequence for a video. It deploys the same MLLM again to select keywords as graph nodes before iteratively optimizing them in constructing an event-based graph (EVSG). This EVSG is then used to serve as a fine-grained representation of structured event relations for reasoning… view at source ↗

**Figure 3.** Figure 3: An Example of the proposed Event-based Video Scene Graph (EVSG) for video reasoning. The EVSG is composed of event [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: A visual example showing that our method reduces hallucination during reasoning. Qwen2.5-VL still yields hallucination in [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Video reasoning requires a fine-grained understanding of the temporal dependencies and event-level relations between objects and events in videos. Current Multimodal Large Language Models (MLLMs) are prone to severe temporal hallucinations in video reasoning. An underlying cause of these hallucinations is weak visual-temporal grounding and the lack of explicit structure for modelling event relations. Models often rely on auxiliary text, such as dense captions, rather than explicitly anchoring their reasoning in actual visual evidence. However, these textual representations are inherently unstructured and fail to provide explicit causal constraints needed to guide the model's reasoning. In this work, we propose GraphThinker, a reinforcement finetuning method that constructs a structured event representation of a video and enforces visual grounding to jointly reduce reasoning hallucinations. Specifically, we employ an MLLM to construct an Event-based Video Scene Graph (EVSG) that captures both intra- and inter-event relations, guiding a structured video reasoning process. Moreover, we address the weak grounding issue by introducing a novel visual attention reward during reinforcement finetuning that encourages the model to actively attend to reliable visual cues. On the RexTime dataset, GraphThinker achieves an over 4% improvement in IoU=0.3 for moment localisation. On the VidHalluc dataset, GraphThinker achieves a 9.8% improvement in reducing temporal sequence hallucination and a 7.6% gain in Binary QA in reducing action hallucination, compared to the state-of-the-art methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GraphThinker pairs an MLLM-built event graph with a visual attention reward in RL finetuning to reduce temporal hallucinations in video models, but the gains are hard to attribute without any check on graph accuracy.

read the letter

The main takeaway is that this paper adds explicit event structure and a grounding reward to reinforcement finetuning for video MLLMs. It reports concrete improvements on moment localisation and hallucination metrics, yet the method leaves open whether those gains come from the graph or just the reward signal. The new element is the combination of constructing an Event-based Video Scene Graph that encodes intra- and inter-event relations, then using that graph inside RL alongside a visual attention reward that pushes the model toward reliable visual evidence rather than text alone. This directly addresses the unstructured nature of dense captions that the abstract flags as a source of weak grounding. The approach is a reasonable attempt to inject causal constraints into reasoning that current models lack. The reported results show over 4% better IoU at 0.3 on RexTime for localisation and 9.8% and 7.6% gains on temporal sequence and action hallucination tasks on VidHalluc. Those numbers are specific and suggest the overall pipeline has an effect compared with prior methods. The soft spot is the lack of any verification that the event graph itself is accurate. Since the graph is produced by an MLLM from the same family known to hallucinate on timing, errors in nodes or relations could simply propagate into the RL stage. The abstract gives no human agreement scores, graph-edit distances, or ablation that isolates the graph contribution, so it remains possible the visual reward alone explains the lift. Experimental details on baselines and controls are also thin in the summary, which makes it harder to judge robustness. This work is aimed at researchers working on video understanding and hallucination mitigation in multimodal models. Someone already experimenting with structured representations or RL for grounding would get a usable idea from it, even if the current evidence needs more scrutiny. I would send it to peer review so referees can examine the graph construction process and run the necessary ablations.

Referee Report

2 major / 2 minor

Summary. The paper introduces GraphThinker, a reinforcement finetuning method for MLLMs in video reasoning. It first uses an MLLM to construct an Event-based Video Scene Graph (EVSG) encoding intra- and inter-event relations, then applies RL with a novel visual-attention reward to enforce grounding and reduce temporal and action hallucinations. Empirical claims include >4% IoU@0.3 gain on RexTime moment localization and 9.8%/7.6% gains on VidHalluc for temporal-sequence and action hallucination reduction versus SOTA baselines.

Significance. If the EVSG proves reliable and the gains are robust to controls, the work offers a concrete route to inject explicit event structure into video MLLMs, addressing a documented weakness in temporal grounding. The combination of graph construction with visual-reward RL is a clear methodological contribution that could generalize beyond the two evaluated datasets.

major comments (2)

[Abstract and §3] Abstract and §3 (EVSG construction): the central claim that the EVSG supplies reliable structured grounding is load-bearing, yet no quantitative validation of EVSG fidelity (human agreement, graph-edit distance to ground truth, or node/edge error rates) is reported. Because the EVSG is generated by the same MLLM family known to hallucinate temporally, unverified construction errors could propagate unchanged into the RL stage.
[Abstract and §4] Abstract and §4 (experiments): the reported percentage gains are presented without naming the precise SOTA baselines, number of random seeds, statistical significance tests, or ablation removing the EVSG component. Without these controls it is impossible to isolate whether the improvements derive from the claimed graph-based reasoning or from the visual-attention reward alone.

minor comments (2)

[§3.2] Notation for the visual-attention reward should be introduced with an explicit equation rather than prose description only.
[Figure 2] Figure captions for the EVSG examples should include the exact prompt template used to elicit the graph from the MLLM.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and will revise the manuscript to incorporate the suggested improvements where appropriate.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (EVSG construction): the central claim that the EVSG supplies reliable structured grounding is load-bearing, yet no quantitative validation of EVSG fidelity (human agreement, graph-edit distance to ground truth, or node/edge error rates) is reported. Because the EVSG is generated by the same MLLM family known to hallucinate temporally, unverified construction errors could propagate unchanged into the RL stage.

Authors: We agree that the absence of quantitative validation for EVSG fidelity is a limitation in the current manuscript. The paper relies on qualitative examples and end-task gains to support the EVSG's utility, without reporting human agreement, graph-edit distance, or error rates. To address this, we will add a human evaluation subsection (and corresponding appendix) assessing EVSG quality on a 100-video subset, reporting node/edge precision-recall against human annotations. We will also explicitly discuss potential error propagation and how the visual-attention reward in RL is intended to reduce its impact by prioritizing visual grounding over potentially noisy graph elements. revision: yes
Referee: [Abstract and §4] Abstract and §4 (experiments): the reported percentage gains are presented without naming the precise SOTA baselines, number of random seeds, statistical significance tests, or ablation removing the EVSG component. Without these controls it is impossible to isolate whether the improvements derive from the claimed graph-based reasoning or from the visual-attention reward alone.

Authors: We acknowledge that the experimental reporting lacks sufficient controls for reproducibility and isolation of contributions. In the revision we will: (1) explicitly name all SOTA baselines (e.g., VideoChatGPT, LLaVA-Video, and Video-LLaMA variants); (2) report results averaged over 3 random seeds with standard deviations; (3) add paired t-test p-values for statistical significance; and (4) include a new ablation that removes the EVSG construction step while retaining the visual-attention reward, directly comparing against the full model to quantify the graph component's contribution. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical method with external benchmarks

full rationale

The paper proposes GraphThinker as a reinforcement finetuning pipeline that first builds an EVSG via MLLM and then applies a visual-attention reward. All reported gains (IoU on RexTime, hallucination reductions on VidHalluc) are presented as direct empirical comparisons against external SOTA baselines on held-out datasets. No equations, fitted parameters, or self-citations are invoked to derive these numbers from the method's own inputs; the derivation chain consists of standard RL steps whose outputs are measured independently. The central claims therefore remain self-contained against external evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that MLLMs can produce reliable event graphs and that the attention reward will successfully enforce visual grounding during RL.

axioms (1)

domain assumption MLLMs can construct meaningful Event-based Video Scene Graphs that capture true event relations
The method directly employs an MLLM to build the EVSG as the core structured representation.

invented entities (1)

Event-based Video Scene Graph (EVSG) no independent evidence
purpose: To provide explicit intra- and inter-event relations that guide structured video reasoning
New representation introduced to replace unstructured text captions

pith-pipeline@v0.9.0 · 5584 in / 1298 out tokens · 37563 ms · 2026-05-15T20:44:49.440450+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

74 extracted references · 74 canonical work pages · 12 internal anchors

[1]

The claude 3 model family: Opus, sonnet, haiku.Claude-3 Model Card, 1(1):4, 2024

AI Anthropic. The claude 3 model family: Opus, sonnet, haiku.Claude-3 Model Card, 1(1):4, 2024. 6

work page 2024
[2]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Perturbollava: Reducing multimodal hallucinations with per- turbative visual training.arXiv preprint arXiv:2503.06486,

Cong Chen, Mingyu Liu, Chenchen Jing, Yizhou Zhou, Fengyun Rao, Hao Chen, Bo Zhang, and Chunhua Shen. Perturbollava: Reducing multimodal hallucinations with per- turbative visual training.arXiv preprint arXiv:2503.06486,

work page arXiv
[4]

Rextime: A benchmark suite for reasoning-across-time in videos.Advances in Neural In- formation Processing Systems, 37:28662–28673, 2024

Jr-Jen Chen, Yu-Chien Liao, Hsi-Che Lin, Yu-Chu Yu, Yen- Chun Chen, and Frank Wang. Rextime: A benchmark suite for reasoning-across-time in videos.Advances in Neural In- formation Processing Systems, 37:28662–28673, 2024. 1, 3, 6

work page 2024
[5]

Sharegpt4video: Improving video understand- ing and generation with better captions.Advances in Neural Information Processing Systems, 37:19472–19495, 2024

Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Zhenyu Tang, Li Yuan, et al. Sharegpt4video: Improving video understand- ing and generation with better captions.Advances in Neural Information Processing Systems, 37:19472–19495, 2024. 7

work page 2024
[6]

Scaling rl to long videos.arXiv preprint arXiv:2507.07966, 2025

Yukang Chen, Wei Huang, Baifeng Shi, Qinghao Hu, Han- rong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, et al. Scaling rl to long videos.arXiv preprint arXiv:2507.07966, 2025. 1, 2

work page arXiv 2025
[7]

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, et al. Videollama 2: Advancing spatial- temporal modeling and audio understanding in video-llms. arXiv preprint arXiv:2406.07476, 2024. 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

V-star: Benchmarking video- llms on video spatio-temporal reasoning.arXiv preprint arXiv:2503.11495, 2025

Zixu Cheng, Jian Hu, Ziquan Liu, Chenyang Si, Wei Li, and Shaogang Gong. V-star: Benchmarking video- llms on video spatio-temporal reasoning.arXiv preprint arXiv:2503.11495, 2025. 1

work page arXiv 2025
[9]

Spatial-temporal trans- former for dynamic scene graph generation

Yuren Cong, Wentong Liao, Hanno Ackermann, Bodo Rosenhahn, and Michael Ying Yang. Spatial-temporal trans- former for dynamic scene graph generation. InProceedings of the IEEE/CVF international conference on computer vi- sion, pages 16372–16382, 2021. 2

work page 2021
[10]

Sophiavl-r1: Reinforcing mllms reason- ing with thinking reward.arXiv preprint arXiv:2505.17018,

Kaixuan Fan, Kaituo Feng, Haoming Lyu, Dongzhan Zhou, and Xiangyu Yue. Sophiavl-r1: Reinforcing mllms reason- ing with thinking reward.arXiv preprint arXiv:2505.17018,

work page arXiv
[11]

Mmbench-video: A long-form multi-shot benchmark for holistic video under- standing.Advances in Neural Information Processing Sys- tems, 37:89098–89124, 2024

Xinyu Fang, Kangrui Mao, Haodong Duan, Xiangyu Zhao, Yining Li, Dahua Lin, and Kai Chen. Mmbench-video: A long-form multi-shot benchmark for holistic video under- standing.Advances in Neural Information Processing Sys- tems, 37:89098–89124, 2024. 1

work page 2024
[12]

Video-of-thought: Step-by-step video reasoning from perception to cognition

Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, Meishan Zhang, Mong-Li Lee, and Wynne Hsu. Video-of-thought: Step-by-step video reasoning from perception to cognition. InInternational Conference on Machine Learning, pages 13109–13125. PMLR, 2024. 2

work page 2024
[13]

Video-R1: Reinforcing Video Reasoning in MLLMs

Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24108–24118, 2025. 1

work page 2025
[15]

Em- bodied ai agents: Modeling the world.arXiv preprint arXiv:2506.22355, 2025

Pascale Fung, Yoram Bachrach, Asli Celikyilmaz, Kamalika Chaudhuri, Delong Chen, Willy Chung, Emmanuel Dupoux, Hongyu Gong, Herv´e J´egou, Alessandro Lazaric, et al. Em- bodied ai agents: Modeling the world.arXiv preprint arXiv:2506.22355, 2025. 1

work page arXiv 2025
[16]

Chain-of-Frames: Advancing Video Understanding in Multimodal LLMs via Frame-Aware Reasoning

Sara Ghazanfari, Francesco Croce, Nicolas Flammarion, Prashanth Krishnamurthy, Farshad Khorrami, and Siddharth Garg. Chain-of-frames: Advancing video understanding in multimodal llms via frame-aware reasoning.arXiv preprint arXiv:2506.00318, 2025. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Real-time scene understanding for blind users: Enhancing vision-language models for accessibility

Loan Gia. Real-time scene understanding for blind users: Enhancing vision-language models for accessibility. In Workshop on Vision Foundation Models and Generative AI for Accessibility: Challenges and Opportunities, 2025. 1

work page 2025
[18]

Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633– 638, 2025

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633– 638, 2025. 3

work page 2025
[19]

Trace: Temporal grounding video llm via causal event modeling.arXiv preprint arXiv:2410.05643,

Yongxin Guo, Jingyu Liu, Mingda Li, Qingbin Liu, Xi Chen, and Xiaoying Tang. Trace: Temporal grounding video llm via causal event modeling.arXiv preprint arXiv:2410.05643,

work page arXiv
[20]

Toga: Tempo- rally grounded open-ended video qa with weak supervision

Ayush Gupta, Anirban Roy, Rama Chellappa, Nathaniel D Bastian, Alvaro Velasquez, and Susmit Jha. Toga: Tempo- rally grounded open-ended video qa with weak supervision. arXiv preprint arXiv:2506.09445, 2025. 6

work page arXiv 2025
[21]

Videoespresso: A large-scale chain-of-thought dataset for fine-grained video reasoning via core frame selection

Songhao Han, Wei Huang, Hairong Shi, Le Zhuo, Xiu Su, Shifeng Zhang, Xu Zhou, Xiaojuan Qi, Yue Liao, and Si Liu. Videoespresso: A large-scale chain-of-thought dataset for fine-grained video reasoning via core frame selection. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 26181–26191, 2025. 1, 2

work page 2025
[22]

To- wards open-vocabulary scene graph generation with prompt- based finetuning

Tao He, Lianli Gao, Jingkuan Song, and Yuan-Fang Li. To- wards open-vocabulary scene graph generation with prompt- based finetuning. InEuropean conference on computer vi- sion, pages 56–73. Springer, 2022. 2 9

work page 2022
[23]

Glm-4.1 v-thinking: Towards versatile multi- modal reasoning with scalable reinforcement learning.arXiv e-prints, pages arXiv–2507, 2025

Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guob- ing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Li- hang Pan, et al. Glm-4.1 v-thinking: Towards versatile multi- modal reasoning with scalable reinforcement learning.arXiv e-prints, pages arXiv–2507, 2025. 1, 2

work page 2025
[24]

Uncertainty-quantified roll- out policy adaptation for unlabelled cross-domain temporal grounding.arXiv preprint arXiv:2508.06317, 2025

Jian Hu, Zixu Cheng, Shaogang Gong, Isabel Guan, Jianye Hao, Jun Wang, and Kun Shao. Uncertainty-quantified roll- out policy adaptation for unlabelled cross-domain temporal grounding.arXiv preprint arXiv:2508.06317, 2025. 3

work page arXiv 2025
[25]

Vtimellm: Empower llm to grasp video moments

Bin Huang, Xin Wang, Hong Chen, Zihan Song, and Wenwu Zhu. Vtimellm: Empower llm to grasp video moments. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 14271–14280, 2024. 6

work page 2024
[26]

Lita: Language instructed temporal-localization assistant

De-An Huang, Shijia Liao, Subhashree Radhakrishnan, Hongxu Yin, Pavlo Molchanov, Zhiding Yu, and Jan Kautz. Lita: Language instructed temporal-localization assistant. In European Conference on Computer Vision, pages 202–218. Springer, 2024. 1, 2, 3, 6

work page 2024
[27]

Building a mind palace: Structuring environment-grounded semantic graphs for ef- fective long video analysis with llms

Zeyi Huang, Yuyang Ji, Xiaofang Wang, Nikhil Mehta, Tong Xiao, Donghyun Lee, Sigmund Vanvalkenburgh, Shengxin Zha, Bolin Lai, Licheng Yu, et al. Building a mind palace: Structuring environment-grounded semantic graphs for ef- fective long video analysis with llms. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24169–24179, 2025. 2

work page 2025
[28]

Action genome: Actions as compositions of spatio- temporal scene graphs

Jingwei Ji, Ranjay Krishna, Li Fei-Fei, and Juan Carlos Niebles. Action genome: Actions as compositions of spatio- temporal scene graphs. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10236–10247, 2020. 2

work page 2020
[29]

Look again, think slowly: Enhancing visual reflection in vision-language models.arXiv preprint arXiv:2509.12132, 2025

Pu Jian, Junhong Wu, Wei Sun, Chen Wang, Shuo Ren, and Jiajun Zhang. Look again, think slowly: Enhancing visual reflection in vision-language models.arXiv preprint arXiv:2509.12132, 2025. 1, 2

work page arXiv 2025
[30]

Chat-univi: Unified visual representation em- powers large language models with image and video un- derstanding

Peng Jin, Ryuichi Takanobu, Wancai Zhang, Xiaochun Cao, and Li Yuan. Chat-univi: Unified visual representation em- powers large language models with image and video un- derstanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13700– 13710, 2024. 7

work page 2024
[31]

Do you remember? dense video captioning with cross-modal memory retrieval

Minkuk Kim, Hyeon Bae Kim, Jinyoung Moon, Jinwoo Choi, and Seong Tae Kim. Do you remember? dense video captioning with cross-modal memory retrieval. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13894–13904, 2024. 1, 2

work page 2024
[32]

Vidhalluc: Evaluating temporal hallucinations in multimodal large lan- guage models for video understanding

Chaoyu Li, Eun Woo Im, and Pooyan Fazli. Vidhalluc: Evaluating temporal hallucinations in multimodal large lan- guage models for video understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13723–13733, 2025. 6

work page 2025
[33]

Reinforcement learning tuning for videollms: Reward design and data efficiency

Hongyu Li, Songhao Han, Yue Liao, Junfeng Luo, Jialin Gao, Shuicheng Yan, and Si Liu. Reinforcement learning tuning for videollms: Reward design and data efficiency. arXiv preprint arXiv:2506.01908, 2025. 1, 2

work page arXiv 2025
[34]

Inten- tqa: Context-aware video intent reasoning

Jiapeng Li, Ping Wei, Wenjuan Han, and Lifeng Fan. Inten- tqa: Context-aware video intent reasoning. InProceedings of the IEEE/CVF international conference on computer vision, pages 11963–11974, 2023. 1

work page 2023
[35]

Tem- poral reasoning transfer from text to video.arXiv preprint arXiv:2410.06166, 2024

Lei Li, Yuanxin Liu, Linli Yao, Peiyuan Zhang, Chenxin An, Lean Wang, Xu Sun, Lingpeng Kong, and Qi Liu. Tem- poral reasoning transfer from text to video.arXiv preprint arXiv:2410.06166, 2024. 1, 2

work page arXiv 2024
[36]

Embodied agent inter- face: Benchmarking llms for embodied decision making

Manling Li, Shiyu Zhao, Qineng Wang, Kangrui Wang, Yu Zhou, Sanjana Srivastava, Cem Gokmen, Tony Lee, Er- ran Li Li, Ruohan Zhang, et al. Embodied agent inter- face: Benchmarking llms for embodied decision making. Advances in Neural Information Processing Systems, 37: 100428–100534, 2024. 1

work page 2024
[37]

From pixels to graphs: Open-vocabulary scene graph generation with vision-language models

Rongjie Li, Songyang Zhang, Dahua Lin, Kai Chen, and Xuming He. From pixels to graphs: Open-vocabulary scene graph generation with vision-language models. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 28076–28086, 2024. 2

work page 2024
[38]

VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, and Limin Wang. Videochat-r1: Enhancing spatio-temporal perception via reinforcement fine-tuning.arXiv preprint arXiv:2504.06958, 2025. 1, 2

work page internal anchor Pith review arXiv 2025
[39]

Factorizable net: an efficient subgraph-based framework for scene graph generation

Yikang Li, Wanli Ouyang, Bolei Zhou, Jianping Shi, Chao Zhang, and Xiaogang Wang. Factorizable net: an efficient subgraph-based framework for scene graph generation. In Proceedings of the European conference on computer vision (ECCV), pages 335–351, 2018. 2

work page 2018
[40]

Video-llava: Learning united visual repre- sentation by alignment before projection

Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual repre- sentation by alignment before projection. InProceedings of the 2024 Conference on Empirical Methods in Natural Lan- guage Processing, pages 5971–5984, 2024. 7

work page 2024
[41]

Vila: On pre-training for vi- sual language models

Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Moham- mad Shoeybi, and Song Han. Vila: On pre-training for vi- sual language models. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 26689–26699, 2024. 7

work page 2024
[42]

Univtg: Towards unified video- language temporal grounding

Kevin Qinghong Lin, Pengchuan Zhang, Joya Chen, Shra- man Pramanick, Difei Gao, Alex Jinpeng Wang, Rui Yan, and Mike Zheng Shou. Univtg: Towards unified video- language temporal grounding. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2794–2804, 2023. 6

work page 2023
[43]

More thinking, less seeing? assessing amplified halluci- nation in multimodal reasoning models.arXiv preprint arXiv:2505.21523, 2025

Chengzhi Liu, Zhongxing Xu, Qingyue Wei, Juncheng Wu, James Zou, Xin Eric Wang, Yuyin Zhou, and Sheng Liu. More thinking, less seeing? assessing amplified halluci- nation in multimodal reasoning models.arXiv preprint arXiv:2505.21523, 2025. 1, 2

work page arXiv 2025
[44]

MUSEG: Reinforcing Video Temporal Understanding via Timestamp-Aware Multi-Segment Grounding

Fuwen Luo, Shengfeng Lou, Chi Chen, Ziyue Wang, Chen- liang Li, Weizhou Shen, Jiyue Guo, Peng Li, Ming Yan, Ji Zhang, et al. Museg: Reinforcing video temporal un- derstanding via timestamp-aware multi-segment grounding. arXiv preprint arXiv:2505.20715, 2025. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

When thinking drifts: Evidential grounding for robust video reasoning.arXiv preprint arXiv:2510.06077, 2025

Mi Luo, Zihui Xue, Alex Dimakis, and Kristen Grauman. When thinking drifts: Evidential grounding for robust video reasoning.arXiv preprint arXiv:2510.06077, 2025. 1, 2

work page arXiv 2025
[46]

Video-chatgpt: Towards detailed video un- derstanding via large vision and language models

Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Khan. Video-chatgpt: Towards detailed video un- derstanding via large vision and language models. InPro- ceedings of the 62nd Annual Meeting of the Association for 10 Computational Linguistics (Volume 1: Long Papers), pages 12585–12602, 2024. 7

work page 2024
[47]

Correlation-guided query-dependency calibra- tion for video temporal grounding.arXiv preprint arXiv:2311.08835, 2023

WonJun Moon, Sangeek Hyun, SuBeen Lee, and Jae- Pil Heo. Correlation-guided query-dependency calibra- tion for video temporal grounding.arXiv preprint arXiv:2311.08835, 2023. 6

work page arXiv 2023
[48]

Unbiased scene graph generation in videos

Sayak Nag, Kyle Min, Subarna Tripathi, and Amit K Roy- Chowdhury. Unbiased scene graph generation in videos. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 22803–22813, 2023. 2

work page 2023
[49]

Hig: Hierarchical interlacement graph approach to scene graph generation in video understanding

Trong-Thuan Nguyen, Pha Nguyen, and Khoa Luu. Hig: Hierarchical interlacement graph approach to scene graph generation in video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18384–18394, 2024. 2

work page 2024
[50]

Hyperglm: Hypergraph for video scene graph generation and anticipation

Trong-Thuan Nguyen, Pha Nguyen, Jackson Cothren, Alper Yilmaz, and Khoa Luu. Hyperglm: Hypergraph for video scene graph generation and anticipation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29150–29160, 2025. 2

work page 2025
[51]

Gpt-4 technical report

OpenAI. Gpt-4 technical report. Technical report, OpenAI,

work page
[52]

Gpt-4o system card

OpenAI. Gpt-4o system card. Technical report, OpenAI,

work page
[53]

Timesearch: Hierarchical video search with spotlight and reflection for human-like long video under- standing.arXiv preprint arXiv:2504.01407, 2025

Junwen Pan, Rui Zhang, Xin Wan, Yuan Zhang, Ming Lu, and Qi She. Timesearch: Hierarchical video search with spotlight and reflection for human-like long video under- standing.arXiv preprint arXiv:2504.01407, 2025. 6, 7

work page arXiv 2025
[54]

Question- answering dense video events

Hangyu Qin, Junbin Xiao, and Angela Yao. Question- answering dense video events. InProceedings of the 48th International ACM SIGIR Conference on Research and De- velopment in Information Retrieval, pages 884–894, 2025. 1, 2

work page 2025
[55]

Step: Enhancing video-llms’ com- positional reasoning by spatio-temporal graph-guided self- training

Haiyi Qiu, Minghe Gao, Long Qian, Kaihang Pan, Qifan Yu, Juncheng Li, Wenjie Wang, Siliang Tang, Yueting Zhuang, and Tat-Seng Chua. Step: Enhancing video-llms’ com- positional reasoning by spatio-temporal graph-guided self- training. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3284–3294, 2025. 1, 2

work page 2025
[56]

Timechat: A time-sensitive multimodal large lan- guage model for long video understanding

Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, and Lu Hou. Timechat: A time-sensitive multimodal large lan- guage model for long video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 14313–14323, 2024. 6

work page 2024
[57]

VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, et al. Vlm-r1: A stable and generaliz- able r1-style large vision-language model.arXiv preprint arXiv:2504.07615, 2025. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[58]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[59]

Reka core, flash, and edge: A series of powerful multimodal language models.arXiv preprint arXiv:2404.12387, 2024

Reka Team, Aitor Ormazabal, Che Zheng, Cyprien de Mas- son d’Autume, Dani Yogatama, Deyu Fu, Donovan Ong, Eric Chen, Eugenie Lamprecht, Hai Pham, et al. Reka core, flash, and edge: A series of powerful multimodal language models.arXiv preprint arXiv:2404.12387, 2024. 6

work page arXiv 2024
[60]

Causal ai scientist: Facilitating causal data science with large language models

Vishal Verma, Sawal Acharya, Samuel Simko, Devansh Bhardwaj, Anahita Haghighat, Dominik Janzing, Mrinmaya Sachan, Zhijing Jin, and Yongjin Yang. Causal ai scientist: Facilitating causal data science with large language models. InNeurIPS 2025 AI for Science Workshop, 2025. 1

work page 2025
[61]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[62]

Timezero: Temporal video grounding with reasoning-guided lvlm.arXiv preprint arXiv:2503.13377, 2025

Ye Wang, Ziheng Wang, Boshen Xu, Yang Du, Kejun Lin, Zihan Xiao, Zihao Yue, Jianzhong Ju, Liang Zhang, Dingyi Yang, et al. Time-r1: Post-training large vision lan- guage model for temporal video grounding.arXiv preprint arXiv:2503.13377, 2025. 1, 2, 3

work page arXiv 2025
[63]

Sportshhi: A dataset for human-human interaction detection in sports videos

Tao Wu, Runyu He, Gangshan Wu, and Limin Wang. Sportshhi: A dataset for human-human interaction detection in sports videos. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18537– 18546, 2024. 2

work page 2024
[64]

Visionary-r1: Mitigating shortcuts in vi- sual reasoning with reinforcement learning.arXiv preprint arXiv:2505.14677, 2025

Jiaer Xia, Yuhang Zang, Peng Gao, Yixuan Li, and Kaiyang Zhou. Visionary-r1: Mitigating shortcuts in vi- sual reasoning with reinforcement learning.arXiv preprint arXiv:2505.14677, 2025. 1, 2

work page arXiv 2025
[65]

Can i trust your answer? visually grounded video question answering

Junbin Xiao, Angela Yao, Yicong Li, and Tat-Seng Chua. Can i trust your answer? visually grounded video question answering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13204– 13214, 2024. 3

work page 2024
[66]

PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning

Lin Xu, Yilin Zhao, Daquan Zhou, Zhijie Lin, See Kiong Ng, and Jiashi Feng. Pllava: Parameter-free llava extension from images to videos for video dense captioning.arXiv preprint arXiv:2404.16994, 2024. 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[67]

Panoptic video scene graph generation

Jingkang Yang, Wenxuan Peng, Xiangtai Li, Zujin Guo, Liangyu Chen, Bo Li, Zheng Ma, Kaiyang Zhou, Wayne Zhang, Chen Change Loy, et al. Panoptic video scene graph generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18675– 18685, 2023. 2

work page 2023
[68]

Thinking in space: How mul- timodal large language models see, remember, and recall spaces

Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How mul- timodal large language models see, remember, and recall spaces. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10632–10643, 2025. 1

work page 2025
[69]

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, et al. Videollama 3: Frontier multi- modal foundation models for image and video understand- ing.arXiv preprint arXiv:2501.13106, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[70]

Thinking with videos: Multimodal tool- augmented reinforcement learning for long video reasoning

Haoji Zhang, Xin Gu, Jiawen Li, Chixiang Ma, Sule Bai, Chubin Zhang, Bowen Zhang, Zhichao Zhou, Dongliang He, and Yansong Tang. Thinking with videos: Multimodal tool- augmented reinforcement learning for long video reasoning. arXiv preprint arXiv:2508.04416, 2025. 6, 7

work page arXiv 2025
[71]

Vtime- 11 cot: Thinking by drawing for video temporal grounding and reasoning

Jinglei Zhang, Yuanfan Guo, Rolandos Alexandros Potamias, Jiankang Deng, Hang Xu, and Chao Ma. Vtime- 11 cot: Thinking by drawing for video temporal grounding and reasoning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 24203–24213, 2025. 1, 2

work page 2025
[72]

Video-cot: A comprehensive dataset for spatiotemporal understanding of videos based on chain-of- thought.arXiv preprint arXiv:2506.08817, 2025

Shuyi Zhang, Xiaoshuai Hao, Yingbo Tang, Lingfeng Zhang, Pengwei Wang, Zhongyuan Wang, Hongxuan Ma, and Shanghang Zhang. Video-cot: A comprehensive dataset for spatiotemporal understanding of videos based on chain-of- thought.arXiv preprint arXiv:2506.08817, 2025. 1, 2

work page arXiv 2025
[73]

LLaVA-Video: Video Instruction Tuning With Synthetic Data

Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Zi- wei Liu, and Chunyuan Li. Video instruction tuning with synthetic data.arXiv preprint arXiv:2410.02713, 2024. 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[74]

Towards video thinking test: A holistic benchmark for advanced video reasoning and understanding

Yuanhan Zhang, Yunice Chew, Yuhao Dong, Aria Leo, Bo Hu, and Ziwei Liu. Towards video thinking test: A holistic benchmark for advanced video reasoning and understanding. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20626–20636, 2025. 1 12

work page 2025