pith. machine review for the scientific record. sign in

arxiv: 2604.14692 · v1 · submitted 2026-04-16 · 💻 cs.CV

Recognition: unknown

Chain-of-Glimpse: Search-Guided Progressive Object-Grounded Reasoning for Video Understanding

Bo Cheng, Genbao Xu, Nan Ma, Quanxing Zha, Soujanya Poria, Teng Wang, Wei Rao, Wenyuan Gu, Zhixuan Wu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 11:59 UTC · model grok-4.3

classification 💻 cs.CV
keywords video understandingobject groundingprogressive reasoningreinforcement learningvisual evidence regionsmulti-step decisionssearch-guided controllerreasoning trajectories
0
0 comments X

The pith

Chain-of-Glimpse uses a search-guided controller to iteratively ground visual objects and build reliable reasoning trajectories for video understanding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Video understanding requires tracking how specific objects change across frames, yet many existing approaches rely on broad saliency cues without anchoring reasoning to particular visual regions. The paper presents Chain-of-Glimpse to treat reasoning as an incremental process that builds spatially grounded traces around task-relevant objects. A search-guided controller, trained through reinforcement learning with a format reward, drives each step by locating and focusing on relevant evidence regions. This setup produces multi-step decisions that remain accurate and interpretable across both familiar and new video benchmarks.

Core claim

Chain-of-Glimpse formulates video reasoning as a step-by-step process that incrementally builds spatially grounded traces around task-relevant visual objects. It features a search-guided controller, optimized via reinforcement learning with a format reward that incentivizes grounding capability, to iteratively ground visual evidence regions and form reliable reasoning trajectories, yielding accurate and interpretable multi-step decisions.

What carries the argument

search-guided controller optimized via reinforcement learning with a format reward that incentivizes grounding capability, to iteratively ground visual evidence regions and form reliable reasoning trajectories

If this is right

  • Consistent performance gains on in-domain benchmarks such as NExTQA.
  • Robustness and generalization across out-of-domain benchmarks including Video-Holmes, CG-Bench Reasoning, and VRBench.
  • More accurate and interpretable multi-step decisions through explicit object grounding.
  • Reduced over-reliance on saliency-driven cues by anchoring each step to specific visual evidence regions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The progressive grounding mechanism could be tested on longer videos with frequent object reappearances to check if trajectory reliability holds.
  • Similar controllers with format rewards might improve step-by-step interpretability when applied to other sequential visual tasks such as action localization.
  • The separation of search guidance from final decision output suggests a route for combining this approach with existing chain-of-thought methods in multimodal models.

Load-bearing premise

Reinforcement learning with a format reward will produce genuine object-grounding capability and reliable reasoning trajectories rather than superficial format compliance or benchmark overfitting.

What would settle it

Evaluating the model on a held-out video reasoning task with novel object variations and checking whether the generated grounding regions match the actual reasoning steps used in the output trajectories.

Figures

Figures reproduced from arXiv: 2604.14692 by Bo Cheng, Genbao Xu, Nan Ma, Quanxing Zha, Soujanya Poria, Teng Wang, Wei Rao, Wenyuan Gu, Zhixuan Wu.

Figure 1
Figure 1. Figure 1: Inconsistent reasoning in prior models and improved with Chain-of-Glimpse. (a) Vanilla RL-based and (b) CoT-based models both insufficient evidence integration and global context oversight, as they tend to rely on superficial, visually prominent cues. Consequently, they fail to capture complex dependencies, leading to inconsistent reasoning (D and C). In contrast, (c) our Chain-of-Glimpse performs progress… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of Chain-of-Glimpse. Chain-of-Glimpse formulates video reasoning as a search-guided, multi-turn object-grounded decision process. Given a video and a query, the model searches over object-grounded reasoning trajectories and optimizes them via reinforcement learning with task-level rewards, enabling accurate reasoning beyond visually salient cues. intermediate reasoning states help bridge low-level… view at source ↗
Figure 3
Figure 3. Figure 3: Effect of MCTS rollout numbers on Video-Holmes. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Ablation studies on NExTQA and Video-Holmes. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

Video understanding requires identifying and reasoning over semantically discriminative visual objects across frames, yet existing object-agnostic solutions struggle to effectively handle substantial object variations over time. To address this, we introduce Chain-of-Glimpse, a search-guided progressive object-grounded reasoning framework that explicitly anchors each reasoning step to specific visual evidence regions, enabling compositional and multi-step decision-making. Formally, Chain-of-Glimpse formulates video reasoning as a step-by-step process that incrementally builds spatially grounded traces around task-relevant visual objects, thereby mitigating over-reliance on saliency-driven cues. Specifically, Chain-of-Glimpse features a search-guided controller, optimized via reinforcement learning with a format reward that significantly incentivizes grounding capability, to iteratively ground visual evidence regions and form reliable reasoning trajectories, yielding accurate and interpretable multi-step decisions. Extensive evaluations on both in domain NExTQA and out-of-domain Video-Holmes, CG-Bench Reasoning, and VRBench benchmarks demonstrate consistent performance gains, robustness and generalization of Chain-of-Glimpse across diverse video reasoning tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces Chain-of-Glimpse, a search-guided progressive object-grounded reasoning framework for video understanding. It formulates video reasoning as an incremental process that builds spatially grounded traces around task-relevant objects using a search-guided controller optimized via reinforcement learning with a format reward, claiming this yields accurate, interpretable multi-step decisions and consistent gains on in-domain NExTQA plus out-of-domain benchmarks (Video-Holmes, CG-Bench Reasoning, VRBench).

Significance. If the RL-trained controller with the described format reward produces verified object grounding and reliable trajectories (rather than format compliance alone), the framework could advance compositional video reasoning by explicitly anchoring steps to visual evidence and mitigating saliency-driven shortcuts.

major comments (2)
  1. [Abstract] Abstract: the assertion that the format reward 'significantly incentivizes grounding capability' is load-bearing for the central claim yet unsupported by any reward formulation details, ablation results, or grounding-quality metrics (e.g., IoU against pseudo-labels or trajectory validity scores). Without an auxiliary term beyond syntactic formatting, the policy can satisfy the reward via correctly structured but semantically arbitrary regions, leaving open whether reported gains reflect improved grounding or benchmark-specific prompt adherence.
  2. [Abstract] Abstract: no quantitative results, error bars, ablation studies, or baseline comparisons are supplied to substantiate 'consistent performance gains, robustness and generalization' across the listed benchmarks. This absence prevents verification of whether the data support the robustness claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, clarifying the support present in the full paper and outlining targeted revisions to the abstract and related sections.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the assertion that the format reward 'significantly incentivizes grounding capability' is load-bearing for the central claim yet unsupported by any reward formulation details, ablation results, or grounding-quality metrics (e.g., IoU against pseudo-labels or trajectory validity scores). Without an auxiliary term beyond syntactic formatting, the policy can satisfy the reward via correctly structured but semantically arbitrary regions, leaving open whether reported gains reflect improved grounding or benchmark-specific prompt adherence.

    Authors: We agree that the abstract would benefit from more direct support for this claim. The full manuscript (Section 3.2) specifies the format reward as requiring not only syntactic structure but also explicit output of object coordinates and bounding regions anchored to task-relevant visual evidence extracted via the search-guided controller; this goes beyond pure formatting by tying the reward to spatial grounding steps. Ablation results in Section 4.3 isolate the reward's contribution to trajectory reliability, and we will add references to these in the abstract. To further address the concern, we will include grounding-quality metrics (e.g., trajectory validity scores and IoU against pseudo-labels where available) in the revision and reference them from the abstract, allowing readers to assess whether gains derive from improved grounding rather than prompt adherence alone. revision: yes

  2. Referee: [Abstract] Abstract: no quantitative results, error bars, ablation studies, or baseline comparisons are supplied to substantiate 'consistent performance gains, robustness and generalization' across the listed benchmarks. This absence prevents verification of whether the data support the robustness claims.

    Authors: The abstract summarizes the evaluation outcomes, while the full manuscript supplies the supporting quantitative evidence: performance tables in Section 4 compare Chain-of-Glimpse against baselines on NExTQA (in-domain) and the out-of-domain sets (Video-Holmes, CG-Bench Reasoning, VRBench), with ablations in Section 4.3 and error bars reported for multi-run experiments. To make this substantiation visible at the abstract level, we will insert concise quantitative highlights (e.g., average gains and robustness indicators) along with brief mentions of the ablations and cross-benchmark consistency in the revised abstract. revision: partial

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper presents Chain-of-Glimpse as a novel framework that formulates video reasoning as an incremental process using a search-guided controller trained via reinforcement learning with a format reward. No equations, parameter fittings, or formal derivations are shown in the provided text that reduce any claimed prediction or result to the inputs by construction. The method introduces new components without self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations that would make the central claims equivalent to prior outputs. The description remains self-contained as a proposed architecture evaluated on benchmarks, with no evidence of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no mathematical formulation, so no free parameters, axioms, or invented entities can be extracted or audited.

pith-pipeline@v0.9.0 · 5515 in / 1077 out tokens · 39523 ms · 2026-05-10T11:59:15.413018+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

56 extracted references · 19 canonical work pages · 9 internal anchors

  1. [1]

    A simple llm framework for long-range video question- answering,

    C. Zhang, T. Lu, M. M. Islam, Z. Wang, S. Yu, M. Bansal, and G. Bertasius, “A simple llm framework for long-range video question- answering,” inProceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024, pp. 21 715–21 737

  2. [2]

    Understanding long videos in one multimodal language model pass,

    K. Ranasinghe, X. Li, K. Kahatapitiya, and M. S. Ryoo, “Understanding long videos in one multimodal language model pass,”arXiv preprint arXiv:2403.16998, vol. 3, no. 4, p. 12, 2024

  3. [3]

    Stimuvar: Spatiotemporal stimuli-aware video affective reasoning with multimodal large language models,

    Y . Guo, F. Siddiqui, Y . Zhao, R. Chellappa, and S.-Y . Lo, “Stimuvar: Spatiotemporal stimuli-aware video affective reasoning with multimodal large language models,”International Journal of Computer Vision (IJCV), pp. 1–17, 2025

  4. [4]

    Dycoke: Dynamic com- pression of tokens for fast video large language models,

    K. Tao, C. Qin, H. You, Y . Sui, and H. Wang, “Dycoke: Dynamic com- pression of tokens for fast video large language models,” inProceedings of the Computer Vision and Pattern Recognition (CVPR), 2025, pp. 18 992–19 001

  5. [5]

    Vtimellm: Empower llm to grasp video moments,

    B. Huang, X. Wang, H. Chen, Z. Song, and W. Zhu, “Vtimellm: Empower llm to grasp video moments,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 14 271–14 280

  6. [6]

    Chapter-llama: Efficient chaptering in hour-long videos with llms,

    L. Ventura, A. Yang, C. Schmid, and G. Varol, “Chapter-llama: Efficient chaptering in hour-long videos with llms,” inProceedings of the Com- puter Vision and Pattern Recognition (CVPR), 2025, pp. 18 947–18 958

  7. [7]

    Automated multi-level preference for mllms,

    M. Zhang, W. Wu, Y . Lu, Y . Song, K. Rong, H. Yao, J. Zhao, F. Liu, H. Feng, J. Wanget al., “Automated multi-level preference for mllms,” Advances in Neural Information Processing Systems (NeurIPS), pp. 26 171–26 194, 2024

  8. [8]

    Omnialign-v: Towards enhanced alignment of mllms with human preference,

    X. Zhao, S. Ding, Z. Zhang, H. Huang, M. Maosongcao, J. Wang, W. Wang, X. Fang, W. Wang, G. Zhaiet al., “Omnialign-v: Towards enhanced alignment of mllms with human preference,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL), 2025, pp. 18 490–18 515

  9. [9]

    Grounded reinforcement learning for visual reasoning.arXiv preprint arXiv:2505.23678, 2025

    G. Sarch, S. Saha, N. Khandelwal, A. Jain, M. J. Tarr, A. Kumar, and K. Fragkiadaki, “Grounded reinforcement learning for visual reasoning,” arXiv preprint arXiv:2505.23678, 2025

  10. [10]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Geet al., “Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution,”arXiv preprint arXiv:2409.12191, 2024

  11. [11]

    Timechat: A time-sensitive multimodal large language model for long video understanding,

    S. Ren, L. Yao, S. Li, X. Sun, and L. Hou, “Timechat: A time-sensitive multimodal large language model for long video understanding,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 14 313–14 323

  12. [12]

    Videoespresso: A large-scale chain-of-thought dataset for fine-grained video reasoning via core frame selection,

    S. Han, W. Huang, H. Shi, L. Zhuo, X. Su, S. Zhang, X. Zhou, X. Qi, Y . Liao, and S. Liu, “Videoespresso: A large-scale chain-of-thought dataset for fine-grained video reasoning via core frame selection,” in Proceedings of the Computer Vision and Pattern Recognition (CVPR), 2025, pp. 26 181–26 191

  13. [13]

    Qwen2.5-VL Technical Report

    S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tanget al., “Qwen2.5-vl technical report,”arXiv preprint arXiv:2502.13923, 2025

  14. [14]

    OneThinker: All-in-one Reasoning Model for Image and Video

    K. Feng, M. Zhang, H. Li, K. Fan, S. Chen, Y . Jiang, D. Zheng, P. Sun, Y . Zhang, H. Sunet al., “Onethinker: All-in-one reasoning model for image and video,”arXiv preprint arXiv:2512.03043, 2025

  15. [15]

    Video-R1: Reinforcing Video Reasoning in MLLMs

    K. Feng, K. Gong, B. Li, Z. Guo, Y . Wang, T. Peng, J. Wu, X. Zhang, B. Wang, and X. Yue, “Video-r1: Reinforcing video reasoning in mllms,” arXiv preprint arXiv:2503.21776, 2025

  16. [16]

    Videochat-r1: Enhancing spatio-temporal perception via reinforcement fine-tuning,

    X. Li, Z. Yan, D. Meng, L. Dong, X. Zeng, Y . He, Y . Wang, Y . Qiao, Y . Wang, and L. Wang, “Videochat-r1: Enhancing spatio-temporal per- ception via reinforcement fine-tuning,”arXiv preprint arXiv:2504.06958, 2025

  17. [17]

    Ddcot: Duty-distinct chain-of-thought prompting for multimodal reasoning in language models,

    G. Zheng, B. Yang, J. Tang, H.-Y . Zhou, and S. Yang, “Ddcot: Duty-distinct chain-of-thought prompting for multimodal reasoning in language models,”Advances in Neural Information Processing Systems (NeurIPS), vol. 36, pp. 5168–5191, 2023

  18. [18]

    Imagine while reasoning in space: Multimodal visualization-of- thought,

    C. Li, W. Wu, H. Zhang, Y . Xia, S. Mao, L. Dong, I. Vuli ´c, and F. Wei, “Imagine while reasoning in space: Multimodal visualization-of- thought,” inForty-second International Conference on Machine Learn- ing (ICML), 2025

  19. [19]

    Rethinking chain-of-thought reasoning for videos,

    Y . Zhong, Z.-Y . Hu, Y . Li, and L. Wang, “Rethinking chain-of-thought reasoning for videos,”arXiv preprint arXiv:2512.09616, 2025

  20. [20]

    Mmtom-qa: Multimodal theory of mind question answering,

    C. Jin, Y . Wu, J. Cao, J. Xiang, Y .-L. Kuo, Z. Hu, T. Ullman, A. Torralba, J. Tenenbaum, and T. Shu, “Mmtom-qa: Multimodal theory of mind question answering,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), 2024, pp. 16 077– 16 102. JOURNAL OF LATEX CLASS FILES, APIRL 2026 10

  21. [21]

    Morevqa: Exploring modular reasoning models for video question answering,

    J. Min, S. Buch, A. Nagrani, M. Cho, and C. Schmid, “Morevqa: Exploring modular reasoning models for video question answering,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 13 235–13 245

  22. [22]

    End-to-end generative pretraining for multimodal video captioning,

    P. H. Seo, A. Nagrani, A. Arnab, and C. Schmid, “End-to-end generative pretraining for multimodal video captioning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 17 959–17 968

  23. [23]

    Video-xl: Extra-long vision language model for hour- scale video understanding,

    Y . Shu, Z. Liu, P. Zhang, M. Qin, J. Zhou, Z. Liang, T. Huang, and B. Zhao, “Video-xl: Extra-long vision language model for hour- scale video understanding,” inProceedings of the Computer Vision and Pattern Recognition (CVPR), 2025, pp. 26 160–26 169

  24. [24]

    Revisiting tem- poral modeling for clip-based image-to-video knowledge transferring,

    R. Liu, J. Huang, G. Li, J. Feng, X. Wu, and T. H. Li, “Revisiting tem- poral modeling for clip-based image-to-video knowledge transferring,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 6555–6564

  25. [25]

    Streaming long video understanding with large language models,

    R. Qian, X. Dong, P. Zhang, Y . Zang, S. Ding, D. Lin, and J. Wang, “Streaming long video understanding with large language models,” Advances in Neural Information Processing Systems (NeurIPS), vol. 37, pp. 119 336–119 360, 2024

  26. [26]

    Moviechat+: Question-aware sparse memory for long video question answering,

    E. Song, W. Chai, T. Ye, J.-N. Hwang, X. Li, and G. Wang, “Moviechat+: Question-aware sparse memory for long video question answering,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

  27. [27]

    Learning high-quality dynamic memory for video object segmentation,

    Y . Liu, R. Yu, F. Yin, X. Zhao, W. Zhao, W. Xia, J. Wang, Y . Wang, Y . Tang, and Y . Yang, “Learning high-quality dynamic memory for video object segmentation,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

  28. [28]

    Cotdet: Affordance knowledge prompting for task driven object detection,

    J. Tang, G. Zheng, J. Yu, and S. Yang, “Cotdet: Affordance knowledge prompting for task driven object detection,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 3068–3078

  29. [29]

    Sf2t: Self-supervised fragment finetuning of video-llms for fine-grained understanding,

    Y . Hu, Z. Song, N. Feng, Y . Luo, J. Yu, Y .-P. P. Chen, and W. Yang, “Sf2t: Self-supervised fragment finetuning of video-llms for fine-grained understanding,” inProceedings of the Computer Vision and Pattern Recognition (CVPR), 2025, pp. 29 108–29 117

  30. [30]

    Ma-lmm: Memory-augmented large multimodal model for long-term video understanding,

    B. He, H. Li, Y . K. Jang, M. Jia, X. Cao, A. Shah, A. Shrivastava, and S.-N. Lim, “Ma-lmm: Memory-augmented large multimodal model for long-term video understanding,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 13 504–13 514

  31. [31]

    Compositional chain- of-thought prompting for large multimodal models,

    C. Mitra, B. Huang, T. Darrell, and R. Herzig, “Compositional chain- of-thought prompting for large multimodal models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 14 420–14 431

  32. [32]

    Videorefer suite: Advancing spatial- temporal object understanding with video llm,

    Y . Yuan, H. Zhang, W. Li, Z. Cheng, B. Zhang, L. Li, X. Li, D. Zhao, W. Zhang, Y . Zhuanget al., “Videorefer suite: Advancing spatial- temporal object understanding with video llm,” inProceedings of the Computer Vision and Pattern Recognition (CVPR), 2025, pp. 18 970– 18 980

  33. [33]

    Pixel-level reasoning segmentation via multi-turn conversations,

    D. Cai, X. Yang, Y . Liu, D. Wang, S. Feng, Y . Zhang, and S. Poria, “Pixel-level reasoning segmentation via multi-turn conversations,” in Proceedings of the 62nd Annual Meeting of the Association for Com- putational Linguistics (ACL), 2025, pp. 17 660–17 679

  34. [34]

    Re-thinking temporal search for long-form video understanding,

    J. Ye, Z. Wang, H. Sun, K. Chandrasegaran, Z. Durante, C. Eyzaguirre, Y . Bisk, J. C. Niebles, E. Adeli, L. Fei-Feiet al., “Re-thinking temporal search for long-form video understanding,” inProceedings of the Com- puter Vision and Pattern Recognition (CVPR), 2025, pp. 8579–8591

  35. [35]

    Video- of-thought: Step-by-step video reasoning from perception to cognition.arXiv preprint arXiv:2501.03230, 2024

    H. Fei, S. Wu, W. Ji, H. Zhang, M. Zhang, M.-L. Lee, and W. Hsu, “Video-of-thought: Step-by-step video reasoning from perception to cognition,”arXiv preprint arXiv:2501.03230, 2024

  36. [36]

    arXiv preprint arXiv:2501.04001 (2025)

    H. Yuan, X. Li, T. Zhang, Y . Sun, Z. Huang, S. Xu, S. Ji, Y . Tong, L. Qi, J. Fenget al., “Sa2va: Marrying sam2 with llava for dense grounded understanding of images and videos,”arXiv preprint arXiv:2501.04001, 2025

  37. [37]

    Vistadpo: Video hierarchical spatial-temporal direct preference optimization for large video models,

    H. Huang, H. Chen, S. Wu, M. Luo, J. Fu, X. Du, H. Zhang, and H. Fei, “Vistadpo: Video hierarchical spatial-temporal direct preference optimization for large video models,” inForty-second International Conference on Machine Learning (ICML), 2025

  38. [38]

    DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

    Z. Zheng, M. Yang, J. Hong, C. Zhao, G. Xu, L. Yang, C. Shen, and X. Yu, “Deepeyes: Incentivizing” thinking with images” via reinforce- ment learning,”arXiv preprint arXiv:2505.14362, 2025

  39. [39]

    Cot-vla: Visual chain-of-thought reasoning for vision-language-action models,

    Q. Zhao, Y . Lu, M. J. Kim, Z. Fu, Z. Zhang, Y . Wu, Z. Li, Q. Ma, S. Han, C. Finnet al., “Cot-vla: Visual chain-of-thought reasoning for vision-language-action models,” inProceedings of the Computer Vision and Pattern Recognition (CVPR), 2025, pp. 1702–1713

  40. [40]

    Visual chain-of-thought prompting for knowledge-based visual reason- ing,

    Z. Chen, Q. Zhou, Y . Shen, Y . Hong, Z. Sun, D. Gutfreund, and C. Gan, “Visual chain-of-thought prompting for knowledge-based visual reason- ing,” inProceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2024, pp. 1254–1262

  41. [41]

    Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning,

    H. Shao, S. Qian, H. Xiao, G. Song, Z. Zong, L. Wang, Y . Liu, and H. Li, “Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning,” Advances in Neural Information Processing Systems (NeurIPS), pp. 8612–8642, 2024

  42. [42]

    An analysis of monte carlo tree search,

    S. James, G. Konidaris, and B. Rosman, “An analysis of monte carlo tree search,” inProceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2017

  43. [43]

    Videoagent: Long- form video understanding with large language model as agent,

    X. Wang, Y . Zhang, O. Zohar, and S. Yeung-Levy, “Videoagent: Long- form video understanding with large language model as agent,” in European Conference on Computer Vision (ECCV). Springer, 2024, pp. 58–76

  44. [44]

    Language repository for long video understanding,

    K. Kahatapitiya, K. Ranasinghe, J. Park, and M. S. Ryoo, “Language repository for long video understanding,” inFindings of the Association for Computational Linguistics (ACL), 2025, pp. 5627–5646

  45. [45]

    Video-llava: Learning united visual representation by alignment before projection,

    B. Lin, Y . Ye, B. Zhu, J. Cui, M. Ning, P. Jin, and L. Yuan, “Video-llava: Learning united visual representation by alignment before projection,” inProceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024, pp. 5971–5984

  46. [46]

    GPT-4o System Card

    A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radfordet al., “Gpt-4o system card,”arXiv preprint arXiv:2410.21276, 2024

  47. [47]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosenet al., “Gem- ini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities,”arXiv preprint arXiv:2507.06261, 2025

  48. [48]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Z. Chen, W. Wang, Y . Cao, Y . Liu, Z. Gao, E. Cui, J. Zhu, S. Ye, H. Tian, Z. Liuet al., “Expanding performance boundaries of open- source multimodal models with model, data, and test-time scaling,” arXiv preprint arXiv:2412.05271, 2024

  49. [49]

    Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1,

    Y . Chen, Y . Ge, R. Wang, Y . Ge, L. Qiu, Y . Shan, and X. Liu, “Exploring the effect of reinforcement learning on video understanding: Insights from seed-bench-r1,”arXiv preprint arXiv:2503.24376, 2025

  50. [50]

    Next-qa: Next phase of question-answering to explaining temporal actions,

    J. Xiao, X. Shang, A. Yao, and T.-S. Chua, “Next-qa: Next phase of question-answering to explaining temporal actions,” inProceedings of the IEEE/CVF Computer Vision and Pattern Recognition (CVPR), 2021, pp. 9777–9786

  51. [51]

    Video-holmes: Can mllm think like holmes for complex video reasoning?, 2025

    J. Cheng, Y . Ge, T. Wang, Y . Ge, J. Liao, and Y . Shan, “Video- holmes: Can mllm think like holmes for complex video reasoning?” arXiv preprint arXiv:2505.21374, 2025

  52. [52]

    Cg-bench: Clue-grounded question answering benchmark for long video understanding,

    G. Chen, Y . Liu, Y . Huang, B. Pei, J. Xu, Y . He, T. Lu, Y . Wang, and L. Wang, “Cg-bench: Clue-grounded question answering benchmark for long video understanding,” inThe Thirteenth International Conference on Learning Representations (ICLR), 2025

  53. [53]

    Vrbench: A benchmark for multi-step reasoning in long narrative videos,

    J. Yu, Y . Wu, M. Chu, Z. Ren, Z. Huang, P. Chu, R. Zhang, Y . He, Q. Li, S. Liet al., “Vrbench: A benchmark for multi-step reasoning in long narrative videos,”arXiv preprint arXiv:2506.10857, 2025

  54. [54]

    Egoschema: A diagnostic benchmark for very long-form video language understanding,

    K. Mangalam, R. Akshulakov, and J. Malik, “Egoschema: A diagnostic benchmark for very long-form video language understanding,”Advances in Neural Information Processing Systems (NeurIPS), vol. 36, pp. 46 212–46 244, 2023

  55. [55]

    STAR: A Benchmark for Situ- ated Reasoning in Real-World Videos.arXiv e-prints, art

    B. Wu, S. Yu, Z. Chen, J. B. Tenenbaum, and C. Gan, “Star: A benchmark for situated reasoning in real-world videos,”arXiv preprint arXiv:2405.09711, 2024

  56. [56]

    LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

    Y . Zheng, R. Zhang, J. Zhang, Y . Ye, Z. Luo, Z. Feng, and Y . Ma, “Llamafactory: Unified efficient fine-tuning of 100+ language models,” arXiv preprint arXiv:2403.13372, 2024