Recognition: no theorem link
GraphThinker: Reinforcing Temporally Grounded Video Reasoning with Event Graph Thinking
Pith reviewed 2026-05-15 20:44 UTC · model grok-4.3
The pith
GraphThinker builds event graphs and applies visual rewards in reinforcement finetuning to ground MLLM video reasoning and reduce temporal hallucinations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GraphThinker employs an MLLM to construct an Event-based Video Scene Graph (EVSG) that captures both intra- and inter-event relations and then applies reinforcement finetuning with a novel visual attention reward that encourages active focus on reliable visual cues, jointly reducing reasoning hallucinations while improving grounding.
What carries the argument
Event-based Video Scene Graph (EVSG) that explicitly encodes intra- and inter-event relations to supply causal constraints for structured reasoning, paired with a visual attention reward inside reinforcement finetuning.
If this is right
- Over 4% improvement in IoU=0.3 for moment localisation on the RexTime dataset.
- 9.8% improvement in reducing temporal sequence hallucination on the VidHalluc dataset.
- 7.6% gain in Binary QA performance for reducing action hallucination on VidHalluc.
- Explicit structured graphs replace reliance on unstructured dense captions, supplying causal constraints that guide reasoning.
Where Pith is reading between the lines
- The same graph-construction plus reward mechanism could be tested on audio-video or multi-shot reasoning tasks to check whether the hallucination reduction generalizes beyond single-camera clips.
- If graph accuracy proves to be the bottleneck, lightweight human-in-the-loop correction of the EVSG during training might further amplify the gains without full retraining.
- The approach suggests a route for applying explicit relational structures to other hallucination-prone domains such as long-form text generation or image captioning.
Load-bearing premise
An MLLM can reliably construct an accurate Event-based Video Scene Graph that captures true event relations without introducing its own hallucinations or errors.
What would settle it
On a new set of videos with human-annotated event graphs, measure whether the automatically constructed EVSG matches the annotations; if mismatch is high and the reported gains in localization and hallucination reduction disappear, the central claim fails.
Figures
read the original abstract
Video reasoning requires a fine-grained understanding of the temporal dependencies and event-level relations between objects and events in videos. Current Multimodal Large Language Models (MLLMs) are prone to severe temporal hallucinations in video reasoning. An underlying cause of these hallucinations is weak visual-temporal grounding and the lack of explicit structure for modelling event relations. Models often rely on auxiliary text, such as dense captions, rather than explicitly anchoring their reasoning in actual visual evidence. However, these textual representations are inherently unstructured and fail to provide explicit causal constraints needed to guide the model's reasoning. In this work, we propose GraphThinker, a reinforcement finetuning method that constructs a structured event representation of a video and enforces visual grounding to jointly reduce reasoning hallucinations. Specifically, we employ an MLLM to construct an Event-based Video Scene Graph (EVSG) that captures both intra- and inter-event relations, guiding a structured video reasoning process. Moreover, we address the weak grounding issue by introducing a novel visual attention reward during reinforcement finetuning that encourages the model to actively attend to reliable visual cues. On the RexTime dataset, GraphThinker achieves an over 4% improvement in IoU=0.3 for moment localisation. On the VidHalluc dataset, GraphThinker achieves a 9.8% improvement in reducing temporal sequence hallucination and a 7.6% gain in Binary QA in reducing action hallucination, compared to the state-of-the-art methods.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces GraphThinker, a reinforcement finetuning method for MLLMs in video reasoning. It first uses an MLLM to construct an Event-based Video Scene Graph (EVSG) encoding intra- and inter-event relations, then applies RL with a novel visual-attention reward to enforce grounding and reduce temporal and action hallucinations. Empirical claims include >4% IoU@0.3 gain on RexTime moment localization and 9.8%/7.6% gains on VidHalluc for temporal-sequence and action hallucination reduction versus SOTA baselines.
Significance. If the EVSG proves reliable and the gains are robust to controls, the work offers a concrete route to inject explicit event structure into video MLLMs, addressing a documented weakness in temporal grounding. The combination of graph construction with visual-reward RL is a clear methodological contribution that could generalize beyond the two evaluated datasets.
major comments (2)
- [Abstract and §3] Abstract and §3 (EVSG construction): the central claim that the EVSG supplies reliable structured grounding is load-bearing, yet no quantitative validation of EVSG fidelity (human agreement, graph-edit distance to ground truth, or node/edge error rates) is reported. Because the EVSG is generated by the same MLLM family known to hallucinate temporally, unverified construction errors could propagate unchanged into the RL stage.
- [Abstract and §4] Abstract and §4 (experiments): the reported percentage gains are presented without naming the precise SOTA baselines, number of random seeds, statistical significance tests, or ablation removing the EVSG component. Without these controls it is impossible to isolate whether the improvements derive from the claimed graph-based reasoning or from the visual-attention reward alone.
minor comments (2)
- [§3.2] Notation for the visual-attention reward should be introduced with an explicit equation rather than prose description only.
- [Figure 2] Figure captions for the EVSG examples should include the exact prompt template used to elicit the graph from the MLLM.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major point below and will revise the manuscript to incorporate the suggested improvements where appropriate.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (EVSG construction): the central claim that the EVSG supplies reliable structured grounding is load-bearing, yet no quantitative validation of EVSG fidelity (human agreement, graph-edit distance to ground truth, or node/edge error rates) is reported. Because the EVSG is generated by the same MLLM family known to hallucinate temporally, unverified construction errors could propagate unchanged into the RL stage.
Authors: We agree that the absence of quantitative validation for EVSG fidelity is a limitation in the current manuscript. The paper relies on qualitative examples and end-task gains to support the EVSG's utility, without reporting human agreement, graph-edit distance, or error rates. To address this, we will add a human evaluation subsection (and corresponding appendix) assessing EVSG quality on a 100-video subset, reporting node/edge precision-recall against human annotations. We will also explicitly discuss potential error propagation and how the visual-attention reward in RL is intended to reduce its impact by prioritizing visual grounding over potentially noisy graph elements. revision: yes
-
Referee: [Abstract and §4] Abstract and §4 (experiments): the reported percentage gains are presented without naming the precise SOTA baselines, number of random seeds, statistical significance tests, or ablation removing the EVSG component. Without these controls it is impossible to isolate whether the improvements derive from the claimed graph-based reasoning or from the visual-attention reward alone.
Authors: We acknowledge that the experimental reporting lacks sufficient controls for reproducibility and isolation of contributions. In the revision we will: (1) explicitly name all SOTA baselines (e.g., VideoChatGPT, LLaVA-Video, and Video-LLaMA variants); (2) report results averaged over 3 random seeds with standard deviations; (3) add paired t-test p-values for statistical significance; and (4) include a new ablation that removes the EVSG construction step while retaining the visual-attention reward, directly comparing against the full model to quantify the graph component's contribution. revision: yes
Circularity Check
No significant circularity; empirical method with external benchmarks
full rationale
The paper proposes GraphThinker as a reinforcement finetuning pipeline that first builds an EVSG via MLLM and then applies a visual-attention reward. All reported gains (IoU on RexTime, hallucination reductions on VidHalluc) are presented as direct empirical comparisons against external SOTA baselines on held-out datasets. No equations, fitted parameters, or self-citations are invoked to derive these numbers from the method's own inputs; the derivation chain consists of standard RL steps whose outputs are measured independently. The central claims therefore remain self-contained against external evaluation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption MLLMs can construct meaningful Event-based Video Scene Graphs that capture true event relations
invented entities (1)
-
Event-based Video Scene Graph (EVSG)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
The claude 3 model family: Opus, sonnet, haiku.Claude-3 Model Card, 1(1):4, 2024
AI Anthropic. The claude 3 model family: Opus, sonnet, haiku.Claude-3 Model Card, 1(1):4, 2024. 6
work page 2024
-
[2]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Cong Chen, Mingyu Liu, Chenchen Jing, Yizhou Zhou, Fengyun Rao, Hao Chen, Bo Zhang, and Chunhua Shen. Perturbollava: Reducing multimodal hallucinations with per- turbative visual training.arXiv preprint arXiv:2503.06486,
-
[4]
Jr-Jen Chen, Yu-Chien Liao, Hsi-Che Lin, Yu-Chu Yu, Yen- Chun Chen, and Frank Wang. Rextime: A benchmark suite for reasoning-across-time in videos.Advances in Neural In- formation Processing Systems, 37:28662–28673, 2024. 1, 3, 6
work page 2024
-
[5]
Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Zhenyu Tang, Li Yuan, et al. Sharegpt4video: Improving video understand- ing and generation with better captions.Advances in Neural Information Processing Systems, 37:19472–19495, 2024. 7
work page 2024
-
[6]
Scaling rl to long videos.arXiv preprint arXiv:2507.07966, 2025
Yukang Chen, Wei Huang, Baifeng Shi, Qinghao Hu, Han- rong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, et al. Scaling rl to long videos.arXiv preprint arXiv:2507.07966, 2025. 1, 2
-
[7]
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, et al. Videollama 2: Advancing spatial- temporal modeling and audio understanding in video-llms. arXiv preprint arXiv:2406.07476, 2024. 7
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[8]
Zixu Cheng, Jian Hu, Ziquan Liu, Chenyang Si, Wei Li, and Shaogang Gong. V-star: Benchmarking video- llms on video spatio-temporal reasoning.arXiv preprint arXiv:2503.11495, 2025. 1
-
[9]
Spatial-temporal trans- former for dynamic scene graph generation
Yuren Cong, Wentong Liao, Hanno Ackermann, Bodo Rosenhahn, and Michael Ying Yang. Spatial-temporal trans- former for dynamic scene graph generation. InProceedings of the IEEE/CVF international conference on computer vi- sion, pages 16372–16382, 2021. 2
work page 2021
-
[10]
Sophiavl-r1: Reinforcing mllms reason- ing with thinking reward.arXiv preprint arXiv:2505.17018,
Kaixuan Fan, Kaituo Feng, Haoming Lyu, Dongzhan Zhou, and Xiangyu Yue. Sophiavl-r1: Reinforcing mllms reason- ing with thinking reward.arXiv preprint arXiv:2505.17018,
-
[11]
Xinyu Fang, Kangrui Mao, Haodong Duan, Xiangyu Zhao, Yining Li, Dahua Lin, and Kai Chen. Mmbench-video: A long-form multi-shot benchmark for holistic video under- standing.Advances in Neural Information Processing Sys- tems, 37:89098–89124, 2024. 1
work page 2024
-
[12]
Video-of-thought: Step-by-step video reasoning from perception to cognition
Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, Meishan Zhang, Mong-Li Lee, and Wynne Hsu. Video-of-thought: Step-by-step video reasoning from perception to cognition. InInternational Conference on Machine Learning, pages 13109–13125. PMLR, 2024. 2
work page 2024
-
[13]
Video-R1: Reinforcing Video Reasoning in MLLMs
Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis
Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24108–24118, 2025. 1
work page 2025
-
[15]
Em- bodied ai agents: Modeling the world.arXiv preprint arXiv:2506.22355, 2025
Pascale Fung, Yoram Bachrach, Asli Celikyilmaz, Kamalika Chaudhuri, Delong Chen, Willy Chung, Emmanuel Dupoux, Hongyu Gong, Herv´e J´egou, Alessandro Lazaric, et al. Em- bodied ai agents: Modeling the world.arXiv preprint arXiv:2506.22355, 2025. 1
-
[16]
Chain-of-Frames: Advancing Video Understanding in Multimodal LLMs via Frame-Aware Reasoning
Sara Ghazanfari, Francesco Croce, Nicolas Flammarion, Prashanth Krishnamurthy, Farshad Khorrami, and Siddharth Garg. Chain-of-frames: Advancing video understanding in multimodal llms via frame-aware reasoning.arXiv preprint arXiv:2506.00318, 2025. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
Real-time scene understanding for blind users: Enhancing vision-language models for accessibility
Loan Gia. Real-time scene understanding for blind users: Enhancing vision-language models for accessibility. In Workshop on Vision Foundation Models and Generative AI for Accessibility: Challenges and Opportunities, 2025. 1
work page 2025
-
[18]
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633– 638, 2025. 3
work page 2025
-
[19]
Trace: Temporal grounding video llm via causal event modeling.arXiv preprint arXiv:2410.05643,
Yongxin Guo, Jingyu Liu, Mingda Li, Qingbin Liu, Xi Chen, and Xiaoying Tang. Trace: Temporal grounding video llm via causal event modeling.arXiv preprint arXiv:2410.05643,
-
[20]
Toga: Tempo- rally grounded open-ended video qa with weak supervision
Ayush Gupta, Anirban Roy, Rama Chellappa, Nathaniel D Bastian, Alvaro Velasquez, and Susmit Jha. Toga: Tempo- rally grounded open-ended video qa with weak supervision. arXiv preprint arXiv:2506.09445, 2025. 6
-
[21]
Songhao Han, Wei Huang, Hairong Shi, Le Zhuo, Xiu Su, Shifeng Zhang, Xu Zhou, Xiaojuan Qi, Yue Liao, and Si Liu. Videoespresso: A large-scale chain-of-thought dataset for fine-grained video reasoning via core frame selection. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 26181–26191, 2025. 1, 2
work page 2025
-
[22]
To- wards open-vocabulary scene graph generation with prompt- based finetuning
Tao He, Lianli Gao, Jingkuan Song, and Yuan-Fang Li. To- wards open-vocabulary scene graph generation with prompt- based finetuning. InEuropean conference on computer vi- sion, pages 56–73. Springer, 2022. 2 9
work page 2022
-
[23]
Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guob- ing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Li- hang Pan, et al. Glm-4.1 v-thinking: Towards versatile multi- modal reasoning with scalable reinforcement learning.arXiv e-prints, pages arXiv–2507, 2025. 1, 2
work page 2025
-
[24]
Jian Hu, Zixu Cheng, Shaogang Gong, Isabel Guan, Jianye Hao, Jun Wang, and Kun Shao. Uncertainty-quantified roll- out policy adaptation for unlabelled cross-domain temporal grounding.arXiv preprint arXiv:2508.06317, 2025. 3
-
[25]
Vtimellm: Empower llm to grasp video moments
Bin Huang, Xin Wang, Hong Chen, Zihan Song, and Wenwu Zhu. Vtimellm: Empower llm to grasp video moments. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 14271–14280, 2024. 6
work page 2024
-
[26]
Lita: Language instructed temporal-localization assistant
De-An Huang, Shijia Liao, Subhashree Radhakrishnan, Hongxu Yin, Pavlo Molchanov, Zhiding Yu, and Jan Kautz. Lita: Language instructed temporal-localization assistant. In European Conference on Computer Vision, pages 202–218. Springer, 2024. 1, 2, 3, 6
work page 2024
-
[27]
Zeyi Huang, Yuyang Ji, Xiaofang Wang, Nikhil Mehta, Tong Xiao, Donghyun Lee, Sigmund Vanvalkenburgh, Shengxin Zha, Bolin Lai, Licheng Yu, et al. Building a mind palace: Structuring environment-grounded semantic graphs for ef- fective long video analysis with llms. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24169–24179, 2025. 2
work page 2025
-
[28]
Action genome: Actions as compositions of spatio- temporal scene graphs
Jingwei Ji, Ranjay Krishna, Li Fei-Fei, and Juan Carlos Niebles. Action genome: Actions as compositions of spatio- temporal scene graphs. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10236–10247, 2020. 2
work page 2020
-
[29]
Pu Jian, Junhong Wu, Wei Sun, Chen Wang, Shuo Ren, and Jiajun Zhang. Look again, think slowly: Enhancing visual reflection in vision-language models.arXiv preprint arXiv:2509.12132, 2025. 1, 2
-
[30]
Peng Jin, Ryuichi Takanobu, Wancai Zhang, Xiaochun Cao, and Li Yuan. Chat-univi: Unified visual representation em- powers large language models with image and video un- derstanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13700– 13710, 2024. 7
work page 2024
-
[31]
Do you remember? dense video captioning with cross-modal memory retrieval
Minkuk Kim, Hyeon Bae Kim, Jinyoung Moon, Jinwoo Choi, and Seong Tae Kim. Do you remember? dense video captioning with cross-modal memory retrieval. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13894–13904, 2024. 1, 2
work page 2024
-
[32]
Chaoyu Li, Eun Woo Im, and Pooyan Fazli. Vidhalluc: Evaluating temporal hallucinations in multimodal large lan- guage models for video understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13723–13733, 2025. 6
work page 2025
-
[33]
Reinforcement learning tuning for videollms: Reward design and data efficiency
Hongyu Li, Songhao Han, Yue Liao, Junfeng Luo, Jialin Gao, Shuicheng Yan, and Si Liu. Reinforcement learning tuning for videollms: Reward design and data efficiency. arXiv preprint arXiv:2506.01908, 2025. 1, 2
-
[34]
Inten- tqa: Context-aware video intent reasoning
Jiapeng Li, Ping Wei, Wenjuan Han, and Lifeng Fan. Inten- tqa: Context-aware video intent reasoning. InProceedings of the IEEE/CVF international conference on computer vision, pages 11963–11974, 2023. 1
work page 2023
-
[35]
Tem- poral reasoning transfer from text to video.arXiv preprint arXiv:2410.06166, 2024
Lei Li, Yuanxin Liu, Linli Yao, Peiyuan Zhang, Chenxin An, Lean Wang, Xu Sun, Lingpeng Kong, and Qi Liu. Tem- poral reasoning transfer from text to video.arXiv preprint arXiv:2410.06166, 2024. 1, 2
-
[36]
Embodied agent inter- face: Benchmarking llms for embodied decision making
Manling Li, Shiyu Zhao, Qineng Wang, Kangrui Wang, Yu Zhou, Sanjana Srivastava, Cem Gokmen, Tony Lee, Er- ran Li Li, Ruohan Zhang, et al. Embodied agent inter- face: Benchmarking llms for embodied decision making. Advances in Neural Information Processing Systems, 37: 100428–100534, 2024. 1
work page 2024
-
[37]
From pixels to graphs: Open-vocabulary scene graph generation with vision-language models
Rongjie Li, Songyang Zhang, Dahua Lin, Kai Chen, and Xuming He. From pixels to graphs: Open-vocabulary scene graph generation with vision-language models. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 28076–28086, 2024. 2
work page 2024
-
[38]
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning
Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, and Limin Wang. Videochat-r1: Enhancing spatio-temporal perception via reinforcement fine-tuning.arXiv preprint arXiv:2504.06958, 2025. 1, 2
work page internal anchor Pith review arXiv 2025
-
[39]
Factorizable net: an efficient subgraph-based framework for scene graph generation
Yikang Li, Wanli Ouyang, Bolei Zhou, Jianping Shi, Chao Zhang, and Xiaogang Wang. Factorizable net: an efficient subgraph-based framework for scene graph generation. In Proceedings of the European conference on computer vision (ECCV), pages 335–351, 2018. 2
work page 2018
-
[40]
Video-llava: Learning united visual repre- sentation by alignment before projection
Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual repre- sentation by alignment before projection. InProceedings of the 2024 Conference on Empirical Methods in Natural Lan- guage Processing, pages 5971–5984, 2024. 7
work page 2024
-
[41]
Vila: On pre-training for vi- sual language models
Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Moham- mad Shoeybi, and Song Han. Vila: On pre-training for vi- sual language models. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 26689–26699, 2024. 7
work page 2024
-
[42]
Univtg: Towards unified video- language temporal grounding
Kevin Qinghong Lin, Pengchuan Zhang, Joya Chen, Shra- man Pramanick, Difei Gao, Alex Jinpeng Wang, Rui Yan, and Mike Zheng Shou. Univtg: Towards unified video- language temporal grounding. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2794–2804, 2023. 6
work page 2023
-
[43]
Chengzhi Liu, Zhongxing Xu, Qingyue Wei, Juncheng Wu, James Zou, Xin Eric Wang, Yuyin Zhou, and Sheng Liu. More thinking, less seeing? assessing amplified halluci- nation in multimodal reasoning models.arXiv preprint arXiv:2505.21523, 2025. 1, 2
-
[44]
MUSEG: Reinforcing Video Temporal Understanding via Timestamp-Aware Multi-Segment Grounding
Fuwen Luo, Shengfeng Lou, Chi Chen, Ziyue Wang, Chen- liang Li, Weizhou Shen, Jiyue Guo, Peng Li, Ming Yan, Ji Zhang, et al. Museg: Reinforcing video temporal un- derstanding via timestamp-aware multi-segment grounding. arXiv preprint arXiv:2505.20715, 2025. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[45]
Mi Luo, Zihui Xue, Alex Dimakis, and Kristen Grauman. When thinking drifts: Evidential grounding for robust video reasoning.arXiv preprint arXiv:2510.06077, 2025. 1, 2
-
[46]
Video-chatgpt: Towards detailed video un- derstanding via large vision and language models
Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Khan. Video-chatgpt: Towards detailed video un- derstanding via large vision and language models. InPro- ceedings of the 62nd Annual Meeting of the Association for 10 Computational Linguistics (Volume 1: Long Papers), pages 12585–12602, 2024. 7
work page 2024
-
[47]
WonJun Moon, Sangeek Hyun, SuBeen Lee, and Jae- Pil Heo. Correlation-guided query-dependency calibra- tion for video temporal grounding.arXiv preprint arXiv:2311.08835, 2023. 6
-
[48]
Unbiased scene graph generation in videos
Sayak Nag, Kyle Min, Subarna Tripathi, and Amit K Roy- Chowdhury. Unbiased scene graph generation in videos. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 22803–22813, 2023. 2
work page 2023
-
[49]
Hig: Hierarchical interlacement graph approach to scene graph generation in video understanding
Trong-Thuan Nguyen, Pha Nguyen, and Khoa Luu. Hig: Hierarchical interlacement graph approach to scene graph generation in video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18384–18394, 2024. 2
work page 2024
-
[50]
Hyperglm: Hypergraph for video scene graph generation and anticipation
Trong-Thuan Nguyen, Pha Nguyen, Jackson Cothren, Alper Yilmaz, and Khoa Luu. Hyperglm: Hypergraph for video scene graph generation and anticipation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29150–29160, 2025. 2
work page 2025
- [51]
- [52]
-
[53]
Junwen Pan, Rui Zhang, Xin Wan, Yuan Zhang, Ming Lu, and Qi She. Timesearch: Hierarchical video search with spotlight and reflection for human-like long video under- standing.arXiv preprint arXiv:2504.01407, 2025. 6, 7
-
[54]
Question- answering dense video events
Hangyu Qin, Junbin Xiao, and Angela Yao. Question- answering dense video events. InProceedings of the 48th International ACM SIGIR Conference on Research and De- velopment in Information Retrieval, pages 884–894, 2025. 1, 2
work page 2025
-
[55]
Step: Enhancing video-llms’ com- positional reasoning by spatio-temporal graph-guided self- training
Haiyi Qiu, Minghe Gao, Long Qian, Kaihang Pan, Qifan Yu, Juncheng Li, Wenjie Wang, Siliang Tang, Yueting Zhuang, and Tat-Seng Chua. Step: Enhancing video-llms’ com- positional reasoning by spatio-temporal graph-guided self- training. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3284–3294, 2025. 1, 2
work page 2025
-
[56]
Timechat: A time-sensitive multimodal large lan- guage model for long video understanding
Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, and Lu Hou. Timechat: A time-sensitive multimodal large lan- guage model for long video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 14313–14323, 2024. 6
work page 2024
-
[57]
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model
Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, et al. Vlm-r1: A stable and generaliz- able r1-style large vision-language model.arXiv preprint arXiv:2504.07615, 2025. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[58]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[59]
Reka Team, Aitor Ormazabal, Che Zheng, Cyprien de Mas- son d’Autume, Dani Yogatama, Deyu Fu, Donovan Ong, Eric Chen, Eugenie Lamprecht, Hai Pham, et al. Reka core, flash, and edge: A series of powerful multimodal language models.arXiv preprint arXiv:2404.12387, 2024. 6
-
[60]
Causal ai scientist: Facilitating causal data science with large language models
Vishal Verma, Sawal Acharya, Samuel Simko, Devansh Bhardwaj, Anahita Haghighat, Dominik Janzing, Mrinmaya Sachan, Zhijing Jin, and Yongjin Yang. Causal ai scientist: Facilitating causal data science with large language models. InNeurIPS 2025 AI for Science Workshop, 2025. 1
work page 2025
-
[61]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265, 2025. 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[62]
Timezero: Temporal video grounding with reasoning-guided lvlm.arXiv preprint arXiv:2503.13377, 2025
Ye Wang, Ziheng Wang, Boshen Xu, Yang Du, Kejun Lin, Zihan Xiao, Zihao Yue, Jianzhong Ju, Liang Zhang, Dingyi Yang, et al. Time-r1: Post-training large vision lan- guage model for temporal video grounding.arXiv preprint arXiv:2503.13377, 2025. 1, 2, 3
-
[63]
Sportshhi: A dataset for human-human interaction detection in sports videos
Tao Wu, Runyu He, Gangshan Wu, and Limin Wang. Sportshhi: A dataset for human-human interaction detection in sports videos. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18537– 18546, 2024. 2
work page 2024
-
[64]
Jiaer Xia, Yuhang Zang, Peng Gao, Yixuan Li, and Kaiyang Zhou. Visionary-r1: Mitigating shortcuts in vi- sual reasoning with reinforcement learning.arXiv preprint arXiv:2505.14677, 2025. 1, 2
-
[65]
Can i trust your answer? visually grounded video question answering
Junbin Xiao, Angela Yao, Yicong Li, and Tat-Seng Chua. Can i trust your answer? visually grounded video question answering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13204– 13214, 2024. 3
work page 2024
-
[66]
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning
Lin Xu, Yilin Zhao, Daquan Zhou, Zhijie Lin, See Kiong Ng, and Jiashi Feng. Pllava: Parameter-free llava extension from images to videos for video dense captioning.arXiv preprint arXiv:2404.16994, 2024. 7
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[67]
Panoptic video scene graph generation
Jingkang Yang, Wenxuan Peng, Xiangtai Li, Zujin Guo, Liangyu Chen, Bo Li, Zheng Ma, Kaiyang Zhou, Wayne Zhang, Chen Change Loy, et al. Panoptic video scene graph generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18675– 18685, 2023. 2
work page 2023
-
[68]
Thinking in space: How mul- timodal large language models see, remember, and recall spaces
Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How mul- timodal large language models see, remember, and recall spaces. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10632–10643, 2025. 1
work page 2025
-
[69]
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, et al. Videollama 3: Frontier multi- modal foundation models for image and video understand- ing.arXiv preprint arXiv:2501.13106, 2025. 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[70]
Thinking with videos: Multimodal tool- augmented reinforcement learning for long video reasoning
Haoji Zhang, Xin Gu, Jiawen Li, Chixiang Ma, Sule Bai, Chubin Zhang, Bowen Zhang, Zhichao Zhou, Dongliang He, and Yansong Tang. Thinking with videos: Multimodal tool- augmented reinforcement learning for long video reasoning. arXiv preprint arXiv:2508.04416, 2025. 6, 7
-
[71]
Vtime- 11 cot: Thinking by drawing for video temporal grounding and reasoning
Jinglei Zhang, Yuanfan Guo, Rolandos Alexandros Potamias, Jiankang Deng, Hang Xu, and Chao Ma. Vtime- 11 cot: Thinking by drawing for video temporal grounding and reasoning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 24203–24213, 2025. 1, 2
work page 2025
-
[72]
Shuyi Zhang, Xiaoshuai Hao, Yingbo Tang, Lingfeng Zhang, Pengwei Wang, Zhongyuan Wang, Hongxuan Ma, and Shanghang Zhang. Video-cot: A comprehensive dataset for spatiotemporal understanding of videos based on chain-of- thought.arXiv preprint arXiv:2506.08817, 2025. 1, 2
-
[73]
LLaVA-Video: Video Instruction Tuning With Synthetic Data
Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Zi- wei Liu, and Chunyuan Li. Video instruction tuning with synthetic data.arXiv preprint arXiv:2410.02713, 2024. 7
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[74]
Towards video thinking test: A holistic benchmark for advanced video reasoning and understanding
Yuanhan Zhang, Yunice Chew, Yuhao Dong, Aria Leo, Bo Hu, and Ziwei Liu. Towards video thinking test: A holistic benchmark for advanced video reasoning and understanding. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20626–20636, 2025. 1 12
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.