pith. machine review for the scientific record. sign in

arxiv: 2605.13228 · v1 · submitted 2026-05-13 · 💻 cs.CV · cs.AI

Recognition: no theorem link

ReTool-Video: Recursive Tool-Using Video Agents with Meta-Augmented Tool Grounding

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:04 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords video understandingtool-augmented agentsrecursive reasoningmeta toolsmultimodal agentstemporal reasoningvideo question answering
0
0 comments X

The pith

ReTool-Video recursively grounds abstract video intents into executable tool chains using a library of 134 meta-augmented tools.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Video agents need to seek evidence across time and modalities for complex reasoning, but prior systems are limited by coarse tools and flat mappings that force every intent into primitive calls. ReTool-Video introduces a MetaAug-Video Tool Library with 26 base tools for multimodal processing and 108 meta tools for filtering, aggregation, reranking, and formatting, enabling dual access to structured and raw evidence. A recursive resolver matches direct intents to tools or delegates unmatched ones for parameter repair, substitution, or decomposition, translating abstract steps like temporal merging into concrete operations at runtime. Experiments on MVBench, MLVU, and Video-MME show consistent outperformance over baselines, with analysis linking gains to the recursive process and fine-grained meta tools. A sympathetic reader cares because the approach supports open-ended video queries without requiring exhaustive pre-definition of every possible action.

Core claim

ReTool-Video proposes a recursive tool-using method that grounds high-level video intents into executable tool chains, where matched actions execute directly and unmatched intents are delegated to a resolver for parameter repair, tool substitution, or decomposition. This is enabled by the MetaAug-Video Tool Library (MVTL) containing 134 registered tools—26 base tools for general multimodal signal processing and 108 meta tools for intermediate operations—supporting diverse scenarios through dual-level access to structured video information and raw modal evidence.

What carries the argument

The recursive grounding mechanism in ReTool-Video, which delegates unmatched high-level intents to a resolver for progressive translation into concrete multimodal tool operations, supported by the MVTL library of 134 tools.

If this is right

  • Abstract actions such as temporal merging, cross-modal verification, and repeated-event aggregation can be translated into concrete multimodal operations at runtime.
  • Recursive decomposition reduces the need for rigid pre-mapping of intents and improves handling of compositional video reasoning.
  • Fine-grained meta tools for filtering, aggregation, and reranking increase stability and effectiveness on tasks requiring temporal and multimodal evidence.
  • Performance gains hold across MVBench, MLVU, and Video-MME without subtitles when both recursion and meta tools are used.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The resolver-based decomposition could be tested on queries engineered to trigger repeated failures, revealing the practical depth limit of recursion.
  • Similar recursive tool libraries might transfer to other sequential modalities such as audio streams or long document navigation.
  • Extending the 134-tool set with domain-specific additions could support specialized video domains without retraining the core agent.
  • Measuring resolver error rates directly on held-out intents would quantify how much of the reported gains depend on reliable decomposition.

Load-bearing premise

High-level video intents can be reliably matched or decomposed by the resolver into the 134 registered tools without introducing errors, excessive recursion, or loss of reasoning fidelity.

What would settle it

A controlled test set of video queries with deliberately ambiguous or complex intents where the resolver produces frequent mismatches or excessive recursion, resulting in accuracy on MVBench no higher than non-recursive baselines.

Figures

Figures reproduced from arXiv: 2605.13228 by Changjian Wang, Guohui Xiang, Jiang Zhong, Junnan Zhu, KaiWen Wei, Nayu Liu, Rongzhen Li, Ruirui Chen, Xiao Liu.

Figure 1
Figure 1. Figure 1: Motivation of RETOOL-VIDEO. Conventional video tool-using agents force abstract actions (e.g., tem￾poral merging or cross-modal verification) into primitive calls, while RETOOL-VIDEO recursively grounds them into executable multimodal tool chains. Despite these advances, existing video￾based agents still face two core challenges in scaling tool use for complex video un￾derstanding. First, from the agent to… view at source ↗
Figure 2
Figure 2. Figure 2: Functional categories of base tools and meta tools in MVTL. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overall framework of RETOOL-VIDEO, where the planner selects primitive or abstract actions, the MetaAug-Video tool library provides base and meta tools, and the resolver recursively grounds abstract actions into executable tool chains. 4 RETOOL-VIDEO 4.1 Recursive Tool Grounding Complex video QA can be viewed as an interactive decision process, where an agent repeatedly determines what information is missi… view at source ↗
Figure 4
Figure 4. Figure 4: An example of recursive tool grounding in RETOOL-VIDEO, where abstract actions are recur￾sively resolved into executable tool chains. Action Interface. At each step, the planner out￾puts either a set of structured actions or a top-level FINISH. The action space contains three types: A = Aprim ∪ Aabs ∪ F inish, (3) where Aprim denotes primitive actions that can be executed by registered tools, Aabs denotes … view at source ↗
Figure 6
Figure 6. Figure 6: Relationship between final accuracy and tool-use behavior, including tool-call count, [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 5
Figure 5. Figure 5: Distribution of runtime tool calls in RETOOL-VIDEO, showing the relative invocation frequency of each base/meta tool category during model inference. To assess each component, we conduct the ab￾lation study. As shown in [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Case study experiment on MLVU benchmark with R [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Dual-level evidence access in RETOOL￾VIDEO: the planner can use both preprocessed struc￾tured evidence and raw visual or video evidence re￾turned by tools. Planner-token masking. During training, the policy-loss mask includes only planner￾generated tokens. Resolver outputs, tool observations, image or video evidence de￾scriptions, execution logs, and environment￾generated text are excluded from the policy￾… view at source ↗
read the original abstract

Video understanding requires active evidence seeking, motivating tool-augmented video agents for temporal reasoning, cross-modal understanding, and complex question answering. Existing video agents have improved video reasoning with retrieval, memory, frame inspection, and verifier tools, but they still face two limitations: (1) a coarse tool space that lacks fine-grained operations for compositional reasoning; and (2) a flat action space that forces high-level video intents into primitive executable tool calls. In this paper, we address these challenges with two complementary designs. First, we construct a MetaAug-Video Tool Library (MVTL), an extensible tool library with 134 registered tools, including 26 base tools for general multimodal signal processing and 108 meta tools for filtering, aggregation, reranking, formatting, and other intermediate-result operations. MVTL supports dual-level access to both structured video information and raw modal evidence, enabling diverse video reasoning scenarios. Second, we propose ReTool-Video, a recursive tool-using method that grounds high-level video intents into executable tool chains. In ReTool-Video, matched actions are executed directly, while unmatched intents are delegated to a resolver for parameter repair, tool substitution, or decomposition. This allows abstract actions such as temporal merging, cross-modal verification, or repeated-event aggregation to be progressively translated into concrete multimodal operations at runtime. Experiments on MVBench, MLVU, and Video-MME w/o sub. show that ReTool-Video consistently outperforms strong baselines. Further analysis demonstrates that recursive grounding and fine-grained meta tools improve the stability and effectiveness of complex video understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes ReTool-Video, a recursive tool-using agent for video understanding that grounds high-level intents into executable chains via a resolver. It introduces the MetaAug-Video Tool Library (MVTL) with 134 tools (26 base multimodal processing tools plus 108 meta tools for filtering, aggregation, reranking, and formatting). The central claim is that this dual-level tool access plus recursive delegation for unmatched intents (repair, substitution, or decomposition) yields consistent gains over baselines on MVBench, MLVU, and Video-MME (without subtitles), while improving stability for complex temporal and cross-modal reasoning.

Significance. If the empirical results and ablation analyses hold, the work offers a concrete mechanism for scaling tool-augmented video agents beyond flat action spaces. The extensible MVTL and resolver-based recursion could provide a reusable substrate for compositional video reasoning, with potential impact on downstream tasks requiring evidence-seeking over long videos.

major comments (3)
  1. [Abstract / Experiments] Abstract and Experiments section: the claim that ReTool-Video 'consistently outperforms strong baselines' is stated without any reported numerical scores, baseline names, or statistical significance tests. This leaves the magnitude and reliability of the gains unassessable and makes it impossible to determine whether improvements exceed what would be expected from simply increasing tool count.
  2. [Abstract / §4] Abstract and §4 (method): the resolver is described as handling unmatched intents via repair/substitution/decomposition, yet no resolver accuracy metric, recursion-depth statistics, or error-propagation analysis is referenced. Without these, gains cannot be confidently attributed to recursive grounding rather than the sheer size of the 134-tool library.
  3. [Further analysis] Further analysis paragraph: the statement that 'recursive grounding and fine-grained meta tools improve stability' requires supporting ablations (e.g., ReTool-Video vs. non-recursive baseline with identical MVTL, or meta-tool ablation). Absent such controls, the causal contribution of the two proposed designs remains unverified.
minor comments (2)
  1. [Abstract] The abstract mentions 'Video-MME w/o sub.' without defining the abbreviation or the exact evaluation protocol on first use.
  2. [§3 / §4] Notation for the resolver and MVTL access modes is introduced without a compact diagram or pseudocode in the method overview, making the dual-level access hard to visualize.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments on our work. We address each major comment below and will revise the manuscript to incorporate the suggested improvements for clarity and rigor.

read point-by-point responses
  1. Referee: [Abstract / Experiments] Abstract and Experiments section: the claim that ReTool-Video 'consistently outperforms strong baselines' is stated without any reported numerical scores, baseline names, or statistical significance tests. This leaves the magnitude and reliability of the gains unassessable and makes it impossible to determine whether improvements exceed what would be expected from simply increasing tool count.

    Authors: We acknowledge that the abstract provides a high-level summary without specific numerical values or baseline names. The experiments section does include detailed comparisons, but to make the claims more assessable, we will revise the abstract to include key performance metrics (e.g., accuracy improvements on each benchmark) and explicitly name the baselines. Additionally, we will add statistical significance tests in the experiments section and reference them in the abstract to confirm that the gains are meaningful beyond the increased tool count. revision: yes

  2. Referee: [Abstract / §4] Abstract and §4 (method): the resolver is described as handling unmatched intents via repair/substitution/decomposition, yet no resolver accuracy metric, recursion-depth statistics, or error-propagation analysis is referenced. Without these, gains cannot be confidently attributed to recursive grounding rather than the sheer size of the 134-tool library.

    Authors: We agree that quantitative evaluation of the resolver would help isolate the contribution of recursive grounding. In the revised manuscript, we will add metrics for resolver accuracy (success rates for repair, substitution, and decomposition), statistics on recursion depth across queries, and an analysis of error propagation. We will also include a comparison to a non-recursive variant using the full MVTL to demonstrate that the recursive mechanism provides benefits beyond the tool library size. revision: yes

  3. Referee: [Further analysis] Further analysis paragraph: the statement that 'recursive grounding and fine-grained meta tools improve stability' requires supporting ablations (e.g., ReTool-Video vs. non-recursive baseline with identical MVTL, or meta-tool ablation). Absent such controls, the causal contribution of the two proposed designs remains unverified.

    Authors: We recognize the need for explicit ablations to support the claims about stability improvements. We will expand the further analysis section with new ablation studies: one comparing ReTool-Video to a non-recursive baseline with the same 134-tool MVTL, and another ablating the meta-tools to measure their specific impact. These will provide direct evidence for the causal contributions of recursive grounding and the meta-augmented tools to stability and performance. revision: yes

Circularity Check

0 steps flagged

No circularity: design choices evaluated on external benchmarks

full rationale

The paper constructs MVTL (134 tools) and ReTool-Video's recursive resolver by design, then measures performance on independent external benchmarks (MVBench, MLVU, Video-MME). No equations, fitted parameters, or predictions reduce to quantities derived from the same data used to define the tools or resolver. No self-citation chains or uniqueness theorems are invoked to justify core claims. The derivation chain is self-contained against external evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the newly introduced MVTL library and recursive resolver being effective for video reasoning; these components are defined within the paper and evaluated only on the reported benchmarks.

axioms (1)
  • domain assumption Standard video understanding benchmarks (MVBench, MLVU, Video-MME) provide valid measures of complex temporal and cross-modal reasoning.
    All reported gains are measured against these datasets.
invented entities (2)
  • MetaAug-Video Tool Library (MVTL) no independent evidence
    purpose: Extensible collection of 26 base and 108 meta tools for fine-grained multimodal video operations
    Newly constructed library introduced in this work.
  • ReTool-Video recursive resolver no independent evidence
    purpose: Mechanism that matches, repairs, substitutes, or decomposes high-level intents into executable tool chains
    Core algorithmic contribution proposed here.

pith-pipeline@v0.9.0 · 5614 in / 1442 out tokens · 58900 ms · 2026-05-14T20:04:07.533132+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

89 extracted references · 80 canonical work pages · 11 internal anchors

  1. [1]

    Model System Cards

    Anthropic. Claude Sonnet 4.5 System Card. https://www.anthropic.com/ claude-sonnet-4-5-system-card , September 2025. Listed on Anthropic “Model System Cards” page

  2. [2]

    Sharegpt4video: Improving video understanding and generation with better captions.Advances in Neural Information Processing Systems, 37:19472–19495, 2024

    Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, et al. Sharegpt4video: Improving video understanding and generation with better captions.Advances in Neural Information Processing Systems, 37:19472–19495, 2024

  3. [3]

    VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

    Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, et al. Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms.arXiv preprint arXiv:2406.07476, 2024

  4. [4]

    Video question answering with procedural programs

    Rohan Choudhury, Koichiro Niinuma, Kris M Kitani, and László A Jeni. Video question answering with procedural programs. InEuropean Conference on Computer Vision, pages 315–332. Springer, 2024

  5. [5]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with ad- vanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

  6. [6]

    Tool-augmented spatiotemporal reasoning for streamlining video question answering task.arXiv preprint arXiv:2512.10359, 2025

    Sunqi Fan, Jiashuo Cui, Meng-Hao Guo, and Shuojin Yang. Tool-augmented spatiotemporal reasoning for streamlining video question answering task.arXiv preprint arXiv:2512.10359, 2025

  7. [7]

    Videoagent: A memory- augmented multimodal agent for video understanding

    Yue Fan, Xiaojian Ma, Rujie Wu, Yuntao Du, Jiaqi Li, Zhi Gao, and Qing Li. Videoagent: A memory- augmented multimodal agent for video understanding. InEuropean Conference on Computer Vision, pages 75–92. Springer, 2024

  8. [8]

    Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24108–24118, 2025

  9. [9]

    Framemind: Frame-interleaved video reasoning via reinforcement learning.arXiv preprint arXiv:2509.24008, 2025

    Haonan Ge, Yiwei Wang, Kai-Wei Chang, Hang Wu, and Yujun Cai. Framemind: Frame-interleaved video reasoning via reinforcement learning.arXiv preprint arXiv:2509.24008, 2025

  10. [10]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

  11. [11]

    Dense-captioning events in videos

    Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. Dense-captioning events in videos. InProceedings of the IEEE international conference on computer vision, pages 706–715, 2017

  12. [12]

    Tvqa: Localized, compositional video question answering

    Jie Lei, Licheng Yu, Mohit Bansal, and Tamara Berg. Tvqa: Localized, compositional video question answering. InProceedings of the 2018 conference on empirical methods in natural language processing, pages 1369–1379, 2018

  13. [13]

    arXiv preprint arXiv:2410.02189 , year=

    Ao Li, Yuexiang Xie, Songze Li, Fugee Tsung, Bolin Ding, and Yaliang Li. Agent-oriented planning in multi-agent systems.arXiv preprint arXiv:2410.02189, 2024

  14. [14]

    VideoThinker: Building Agentic VideoLLMs with LLM-Guided Tool Reasoning

    Chenglin Li, Qianglong Chen, Feng Han, Yikun Wang, Xingxi Yin, Yan Gong, Ruilin Li, Yin Zhang, and Jiaqi Wang. Videothinker: Building agentic videollms with llm-guided tool reasoning.arXiv preprint arXiv:2601.15724, 2026. 10

  15. [15]

    Mvbench: A comprehensive multi-modal video understanding benchmark

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195–22206, 2024

  16. [16]

    Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking

    Mingxin Li, Yanzhao Zhang, Dingkun Long, Chen Keqin, Sibo Song, Shuai Bai, Zhibo Yang, Pengjun Xie, An Yang, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Qwen3-vl-embedding and qwen3-vl-reranker: A unified framework for state-of-the-art multimodal retrieval and ranking.arXiv preprint arXiv:2601.04720, 2026

  17. [17]

    Llama-vid: An image is worth 2 tokens in large language models

    Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. InEuropean Conference on Computer Vision, pages 323–340. Springer, 2024

  18. [18]

    Video-llava: Learning united visual representation by alignment before projection

    Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection. InProceedings of the 2024 conference on empirical methods in natural language processing, pages 5971–5984, 2024

  19. [19]

    Videoseek: Long-horizon video agent with tool-guided seeking.arXiv preprint arXiv:2603.20185, 2026

    Jingyang Lin, Jialian Wu, Jiang Liu, Ximeng Sun, Ze Wang, Xiaodong Yu, Jiebo Luo, Zicheng Liu, and Emad Barsoum. Videoseek: Long-horizon video agent with tool-guided seeking.arXiv preprint arXiv:2603.20185, 2026

  20. [20]

    Pc-agent: A hierarchical multi-agent collaboration framework for complex task automation on pc.arXiv preprint arXiv:2502.14282, 2025

    Haowei Liu, Xi Zhang, Haiyang Xu, Yuyang Wanyan, Junyang Wang, Ming Yan, Ji Zhang, Chunfeng Yuan, Changsheng Xu, Weiming Hu, et al. Pc-agent: A hierarchical multi-agent collaboration framework for complex task automation on pc.arXiv preprint arXiv:2502.14282, 2025

  21. [21]

    Kangaroo: A powerful video-language model supporting long-context video input: J

    Jiajun Liu, Yibing Wang, Hanghang Ma, Xiaoping Wu, Xiaoqi Ma, Xiaoming Wei, Jianbin Jiao, Enhua Wu, and Jie Hu. Kangaroo: A powerful video-language model supporting long-context video input: J. liu et al.International Journal of Computer Vision, 134(3):114, 2026

  22. [22]

    Videomind: A chain-of-lora agent for long video reasoning.arXiv e-prints, pages arXiv–2503, 2025

    Ye Liu, Kevin Qinghong Lin, Chang Wen Chen, and Mike Zheng Shou. Videomind: A chain-of-lora agent for long video reasoning.arXiv e-prints, pages arXiv–2503, 2025

  23. [23]

    Nvila: Efficient frontier visual language models

    Zhijian Liu, Ligeng Zhu, Baifeng Shi, Zhuoyang Zhang, Yuming Lou, Shang Yang, Haocheng Xi, Shiyi Cao, Yuxian Gu, Dacheng Li, et al. Nvila: Efficient frontier visual language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4122–4134, 2025

  24. [24]

    Drvideo: Document retrieval based long video understanding

    Ziyu Ma, Chenhui Gou, Hengcan Shi, Bin Sun, Shutao Li, Hamid Rezatofighi, and Jianfei Cai. Drvideo: Document retrieval based long video understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18936–18946, 2025

  25. [25]

    Video-chatgpt: Towards detailed video understanding via large vision and language models

    Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 12585–12602, 2024

  26. [26]

    Video-bench: A comprehensive benchmark and toolkit for evaluating video-based large language models.Computational Visual Media, 2025

    Munan Ning, Bin Zhu, Yujia Xie, Bin Lin, Jiaxi Cui, Lu Yuan, Dongdong Chen, and Li Yuan. Video-bench: A comprehensive benchmark and toolkit for evaluating video-based large language models.Computational Visual Media, 2025

  27. [27]

    GPT-4V(ision) System Card.https://openai.com/index/gpt-4v-system-card/ , Septem- ber 2023

    OpenAI. GPT-4V(ision) System Card.https://openai.com/index/gpt-4v-system-card/ , Septem- ber 2023. Accessed: 2025-09-19

  28. [28]

    Qwen3.5: Towards native multimodal agents, February 2026

    Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https://qwen.ai/ blog?id=qwen3.5

  29. [29]

    Agentic very long video understanding.arXiv preprint arXiv:2601.18157, 2026

    Aniket Rege, Arka Sadhu, Yuliang Li, Kejie Li, Ramya Korlakai Vinayak, Yuning Chai, Yong Jae Lee, and Hyo Jin Kim. Agentic very long video understanding.arXiv preprint arXiv:2601.18157, 2026

  30. [30]

    Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36:68539–68551, 2023

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36:68539–68551, 2023

  31. [31]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  32. [32]

    Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face.Advances in Neural Information Processing Systems, 36:38154–38180, 2023

    Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face.Advances in Neural Information Processing Systems, 36:38154–38180, 2023. 11

  33. [33]

    OpenAI GPT-5 System Card

    Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaugh- lin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

  34. [34]

    Moviechat: From dense token to sparse memory for long video understanding

    Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, et al. Moviechat: From dense token to sparse memory for long video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18221–18232, 2024

  35. [35]

    Vipergpt: Visual inference via python execution for reasoning

    Dídac Surís, Sachit Menon, and Carl V ondrick. Vipergpt: Visual inference via python execution for reasoning. InProceedings of the IEEE/CVF international conference on computer vision, pages 11888– 11898, 2023

  36. [36]

    Ego-r1: Chain-of-tool-thought for ultra-long egocentric video reasoning.arXiv preprint arXiv:2506.13654, 2025

    Shulin Tian, Ruiqi Wang, Hongming Guo, Penghao Wu, Yuhao Dong, Xiuying Wang, Jingkang Yang, Hao Zhang, Hongyuan Zhu, and Ziwei Liu. Ego-r1: Chain-of-tool-thought for ultra-long egocentric video reasoning.arXiv preprint arXiv:2506.13654, 2025

  37. [37]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

  38. [38]

    Cfvbench: A comprehensive video benchmark for fine-grained multimodal retrieval-augmented generation

    Kaiwen Wei, Xiao Liu, Jie Zhang, Zijian Wang, Ruida Liu, Yuming Yang, Xin Xiao, Xiao Sun, Haoyang Zeng, Changzai Pan, et al. Cfvbench: A comprehensive video benchmark for fine-grained multimodal retrieval-augmented generation. InProceedings of the ACM Web Conference 2026, pages 2501–2512, 2026

  39. [39]

    Gap: Graph- based agent planning with parallel tool use and reinforcement learning.arXiv preprint arXiv:2510.25320, 2025

    Jiaqi Wu, Qinlao Zhao, Zefeng Chen, Kai Qin, Yifei Zhao, Xueqian Wang, and Yuhang Yao. Gap: Graph- based agent planning with parallel tool use and reinforcement learning.arXiv preprint arXiv:2510.25320, 2025

  40. [40]

    Next-qa: Next phase of question-answering to explaining temporal actions

    Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question-answering to explaining temporal actions. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9777–9786, 2021

  41. [41]

    MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

    Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan Wang. Mm-react: Prompting chatgpt for multimodal reasoning and action.arXiv preprint arXiv:2303.11381, 2023

  42. [42]

    Doraemongpt: Toward under- standing dynamic scenes with large language models (exemplified as a video agent).arXiv preprint arXiv:2401.08392, 2024

    Zongxin Yang, Guikun Chen, Xiaodi Li, Wenguan Wang, and Yi Yang. Doraemongpt: Toward under- standing dynamic scenes with large language models (exemplified as a video agent).arXiv preprint arXiv:2401.08392, 2024

  43. [43]

    React: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations, 2022

  44. [44]

    Videoarm: Agentic reasoning over hierarchical memory for long-form video understanding.arXiv preprint arXiv:2512.12360, 2025

    Yufei Yin, Qianke Meng, Minghao Chen, Jiajun Ding, Zhenwei Shao, and Zhou Yu. Videoarm: Agentic reasoning over hierarchical memory for long-form video understanding.arXiv preprint arXiv:2512.12360, 2025

  45. [45]

    Recode: Unify plan and action for universal granularity control.arXiv preprint arXiv:2510.23564, 2025

    Zhaoyang Yu, Jiayi Zhang, Huixue Su, Yufan Zhao, Yifan Wu, Mingyi Deng, Jinyu Xiang, Yizhang Lin, Lingxiao Tang, Yuyu Luo, et al. Recode: Unify plan and action for universal granularity control.arXiv preprint arXiv:2510.23564, 2025

  46. [46]

    Videoexplorer: Think with videos for agentic long-video understanding.arXiv preprint arXiv:2506.10821, 2025

    Huaying Yuan, Zheng Liu, Junjie Zhou, Hongjin Qian, Yan Shu, Nicu Sebe, Ji-Rong Wen, and Zhicheng Dou. Videoexplorer: Think with videos for agentic long-video understanding.arXiv preprint arXiv:2506.10821, 2025

  47. [47]

    Thinking with videos: Multimodal tool-augmented reinforcement learning for long video reasoning.arXiv preprint arXiv:2508.04416, 2025

    Haoji Zhang, Xin Gu, Jiawen Li, Chixiang Ma, Sule Bai, Chubin Zhang, Bowen Zhang, Zhichao Zhou, Dongliang He, and Yansong Tang. Thinking with videos: Multimodal tool-augmented reinforcement learning for long video reasoning.arXiv preprint arXiv:2508.04416, 2025

  48. [48]

    Flash-vstream: Efficient real-time understanding for long video streams

    Haoji Zhang, Yiqin Wang, Yansong Tang, Yong Liu, Jiashi Feng, and Xiaojie Jin. Flash-vstream: Efficient real-time understanding for long video streams. InProceedings of the IEEE/CVF international conference on computer vision, pages 21059–21069, 2025

  49. [49]

    Long Context Transfer from Language to Vision

    Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, and Ziwei Liu. Long context transfer from language to vision.arXiv preprint arXiv:2406.16852, 2024. 12

  50. [50]

    Deep video dis- covery: Agentic search with tool use for long-form video understanding.arXiv preprint arXiv:2505.18079, 2025

    Xiaoyi Zhang, Zhaoyang Jia, Zongyu Guo, Jiahao Li, Bin Li, Houqiang Li, and Yan Lu. Deep video dis- covery: Agentic search with tool use for long-form video understanding.arXiv preprint arXiv:2505.18079, 2025

  51. [51]

    Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

    Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. Qwen3 embedding: Advancing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176, 2025

  52. [52]

    how the current action should be grounded

    Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Zhengyang Liang, Shitao Xiao, Minghao Qin, Xi Yang, Yongping Xiong, Bo Zhang, et al. Mlvu: Benchmarking multi-task long video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13691– 13701, 2025. A Additional Details of MVTL A.1 Tool Counting and Exposure Prot...

  53. [53]

    **Thinking Process**: You may use ‘<think>...</think>‘ tags for internal reasoning

  54. [54]

    **Final JSON**: Your response MUST conclude with a single, valid JSON object

  55. [55]

    $" only for runtime result pointers from prior tool outputs (e.g.,

    **Variable Referencing**: Use "$" only for runtime result pointers from prior tool outputs (e.g., "$last_inspect_frame_result", "$frame_evidence_from_Inspect_Frame")

  56. [56]

    Thought":

    **Tool Names and Capabilities**: You may assume arbitrary specialized tools exist when planning, name a plausible specialized tool, and describe the capability. Do not rely on a hard-coded planner tool inventory in the prompt. # Required JSON Schema { "Thought": "Concise reasoning (max 2 short sentences)", "Plan": "Short step plan", "Evidence": [ { "label...

  57. [57]

    Thought":

    Continue with tools: <think> ... </think> { "Thought": "...", "Plan": "...", "Actions": [{"tool": "ACTUAL_TOOL_NAME_HERE", "params": {"arg": "value"}}] }

  58. [58]

    Thought":

    Finish the task: <think> ... </think> { "Thought": "...", "Plan": "...", "Actions": [], "Finish": { "chain_complete": true, "completion_basis": "...", "answer": "..." } } Never output non-empty ‘Actions‘ together with a valid top-level ‘Finish‘. # Data Access Capabilities Tools can access two tiers of video infomation. Specify your needs in the tool descr...

  59. [59]

    If one tool already provides clear, direct, and decisive evidence with little ambiguity, you may finish without extra verification

    **Cross-tool verification**: By default, validate each piece of evidence and the final conclusion with result from at least two different tools, especially for visual evidence, ambiguous evidence, fine-grained actions, future-event prediction, or when candidate answers are easily confused. If one tool already provides clear, direct, and decisive evidence ...

  60. [60]

    **Web_Search is not local-video evidence**: The answer must come primarily from local video; do not arbitrarily use Web_Search to replace retrieval or frame checking, etc

  61. [61]

    **Exact-value tasks need Evidence tracking**: When a query asks for exact values, totals, counts, prices, or scenario comparisons from the video, keep an ‘Evidence‘ array updated with one item per required fact/scenario

  62. [62]

    **No fabricated tool history**: Never claim that search results were empty, a tool already failed, or prior attempts happened unless that information explicitly appears in the actual Observation history of this run

  63. [63]

    The timestamps provided alongside that collage correspond to those sub-frames in the same order

    **Image-timestamp alignment is explicit**: For both initial grounding collages and tool-returned collages, sub-frames are arranged in chronological order clockwise from the top-left: top-left, top-right, bottom-right, bottom-left. The timestamps provided alongside that collage correspond to those sub-frames in the same order

  64. [64]

    Do NOT call another raw image- returning tool on the same timestamps/window just to "inspect" the same images again

    **Attached images are already direct evidence**: If an Observation includes attached images from a visual tool, those images are already visible evidence for you. Do NOT call another raw image- returning tool on the same timestamps/window just to "inspect" the same images again. Re-probe only if you materially change the time window or apply ‘Crop‘/‘Zoom_...

  65. [65]

    **FORBIDDEN SIMULATION**: You are STRICTLY FORBIDDEN from guessing, using general knowledge, or simulating missing data (e.g., assuming a 2% payment rate)

  66. [66]

    Switch to retrieval, verification, or direct-evidence probing instead of guessing

    **MISSING INFO = PROBE TRIGGER**: If a value is missing from the initial summary, it is a MANDATE to search the raw video. Switch to retrieval, verification, or direct-evidence probing instead of guessing. 19

  67. [67]

    Never call a tool with "assumed" numbers

    **EVIDENCE-ONLY PARAMETERS**: Every parameter passed to a logic tool MUST be explicitly found first. Never call a tool with "assumed" numbers. # Video-Centric Probe Hierarchy

  68. [68]

    Use tools to probe the remainder of the timeline

    **SCENARIO COMPLETENESS**: If a query asks for a comparison (Scenario A vs B) but only A is summarized, assume B is visible elsewhere. Use tools to probe the remainder of the timeline

  69. [69]

    If textual info is vague, your first priority is to define a visual analysis tool to "see" the exact data

    **VISUAL IS TRUTH**: Visual frames contain data that summaries often miss. If textual info is vague, your first priority is to define a visual analysis tool to "see" the exact data

  70. [70]

    **LOGICAL RESTRAINT**: Logic tools (like ‘Python_Executor‘) are ONLY for processing data that has already been retrieved from the evidence

  71. [71]

    **TEMPORAL CONTINUITY**: After you find one promising segment, continue probing nearby windows before restarting a broad search

  72. [72]

    around",

    **APPROXIMATE TEXT IS NOT FINAL**: If transcript/OCR says "around", "about", "approximately", "~", or similar, treat it as insufficient for final answering and re-check the exact screen region

  73. [73]

    Use a chain like text->visual->answer or visual->text->answer

    **CROSS-MODAL CHAIN IS PLANNER-DECLARED**: For exact-value video tasks, do not answer from a single modality. Use a chain like text->visual->answer or visual->text->answer. Only set ‘Finish. chain_complete=true‘ after you judge that the required chain is complete

  74. [74]

    answer" must explicitly cite the timestamp for every value used. If question contains options,

    **PYTHON IS NOT A VISION FALLBACK**: Never use ‘Python_Executor‘ to infer what local video/images show from filenames or paths. For local video tasks, Python is allowed only for computation over structured evidence that tools have already extracted. # Termination Protocol - **MANDATORY CITATION**: Your final "answer" must explicitly cite the timestamp for...

  75. [75]

    parallel_decomposition

    **Read Mediation Markers First**: Respect ‘[Mediation Rule]‘, ‘[Allowed Resolution Modes]‘, ‘[ Locked Target Tool]‘, ‘[Target Tool Schema]‘, ‘[Candidate Tool Schemas]‘, and ‘[Route Plan Candidates]‘. 20 If the runtime provides ‘[Composite Intent Guidance]‘ with ‘recommended_resolution=" parallel_decomposition"‘, treat that as a strong hint to prefer L4 ov...

  76. [76]

    local_video

    If mediation is L2, keep the target tool fixed by default. Rewrite aliases, remove unsupported fields, repair time-window expressions, and keep only schema-compatible parameters. Treat tool-search schemas as authoritative parameter requirements. If the schema says ‘t_start‘, ‘ t_end‘, and ‘density‘, do not emit old aliases like ‘time_range_start‘, ‘time_r...

  77. [77]

    **Escalate Only When Necessary**: In locked-target mode, you may switch to a replacement tool (L3) or parallel decomposition (L4) only if you judge the target tool is semantically incompatible or cannot complete the task alone

  78. [78]

    If mediation is ‘L3 or L4‘, use route metadata and candidate tools to decide between: - one executable replacement tool, or - multiple child tools returned together in ‘Actions‘ When the request simultaneously asks about multiple evidence dimensions such as actions, style/ category, subtitles or audio cues, and scene distinctions/comparisons, prefer multi...

  79. [79]

    Do not output placeholder or abstract tool names

    **Abstract-tool boundary is level-specific**: - In ‘L1-L3‘, every returned action must already be a concrete executable tool. Do not output placeholder or abstract tool names. - Only in ‘L4‘ may you selectively return abstract child tools in ‘Actions‘, when one tool is insufficient and the task must be decomposed for resolution

  80. [80]

    Child tools that become valid concrete tools will execute directly

    **L4 Semantics**: For each child tool produced at ‘L4‘, the runtime will run Tool_Search plus parameter validation independently. Child tools that become valid concrete tools will execute directly. Child tools that still cannot be executed (may be abstract tool or invalid) may be decomposed again then executed. Requests may contain nested tools, where one...

Showing first 80 references.