arxiv: 2605.13228 · v1 · submitted 2026-05-13 · 💻 cs.CV · cs.AI

Recognition: no theorem link

ReTool-Video: Recursive Tool-Using Video Agents with Meta-Augmented Tool Grounding

Xiao Liu , Nayu Liu , Junnan Zhu , Ruirui Chen , Guohui Xiang , Changjian Wang , KaiWen Wei , Rongzhen Li

show 1 more author

Jiang Zhong

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:04 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords video understandingtool-augmented agentsrecursive reasoningmeta toolsmultimodal agentstemporal reasoningvideo question answering

0 comments

The pith

ReTool-Video recursively grounds abstract video intents into executable tool chains using a library of 134 meta-augmented tools.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Video agents need to seek evidence across time and modalities for complex reasoning, but prior systems are limited by coarse tools and flat mappings that force every intent into primitive calls. ReTool-Video introduces a MetaAug-Video Tool Library with 26 base tools for multimodal processing and 108 meta tools for filtering, aggregation, reranking, and formatting, enabling dual access to structured and raw evidence. A recursive resolver matches direct intents to tools or delegates unmatched ones for parameter repair, substitution, or decomposition, translating abstract steps like temporal merging into concrete operations at runtime. Experiments on MVBench, MLVU, and Video-MME show consistent outperformance over baselines, with analysis linking gains to the recursive process and fine-grained meta tools. A sympathetic reader cares because the approach supports open-ended video queries without requiring exhaustive pre-definition of every possible action.

Core claim

ReTool-Video proposes a recursive tool-using method that grounds high-level video intents into executable tool chains, where matched actions execute directly and unmatched intents are delegated to a resolver for parameter repair, tool substitution, or decomposition. This is enabled by the MetaAug-Video Tool Library (MVTL) containing 134 registered tools—26 base tools for general multimodal signal processing and 108 meta tools for intermediate operations—supporting diverse scenarios through dual-level access to structured video information and raw modal evidence.

What carries the argument

The recursive grounding mechanism in ReTool-Video, which delegates unmatched high-level intents to a resolver for progressive translation into concrete multimodal tool operations, supported by the MVTL library of 134 tools.

If this is right

Abstract actions such as temporal merging, cross-modal verification, and repeated-event aggregation can be translated into concrete multimodal operations at runtime.
Recursive decomposition reduces the need for rigid pre-mapping of intents and improves handling of compositional video reasoning.
Fine-grained meta tools for filtering, aggregation, and reranking increase stability and effectiveness on tasks requiring temporal and multimodal evidence.
Performance gains hold across MVBench, MLVU, and Video-MME without subtitles when both recursion and meta tools are used.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The resolver-based decomposition could be tested on queries engineered to trigger repeated failures, revealing the practical depth limit of recursion.
Similar recursive tool libraries might transfer to other sequential modalities such as audio streams or long document navigation.
Extending the 134-tool set with domain-specific additions could support specialized video domains without retraining the core agent.
Measuring resolver error rates directly on held-out intents would quantify how much of the reported gains depend on reliable decomposition.

Load-bearing premise

High-level video intents can be reliably matched or decomposed by the resolver into the 134 registered tools without introducing errors, excessive recursion, or loss of reasoning fidelity.

What would settle it

A controlled test set of video queries with deliberately ambiguous or complex intents where the resolver produces frequent mismatches or excessive recursion, resulting in accuracy on MVBench no higher than non-recursive baselines.

Figures

Figures reproduced from arXiv: 2605.13228 by Changjian Wang, Guohui Xiang, Jiang Zhong, Junnan Zhu, KaiWen Wei, Nayu Liu, Rongzhen Li, Ruirui Chen, Xiao Liu.

**Figure 1.** Figure 1: Motivation of RETOOL-VIDEO. Conventional video tool-using agents force abstract actions (e.g., temporal merging or cross-modal verification) into primitive calls, while RETOOL-VIDEO recursively grounds them into executable multimodal tool chains. Despite these advances, existing videobased agents still face two core challenges in scaling tool use for complex video understanding. First, from the agent to… view at source ↗

**Figure 2.** Figure 2: Functional categories of base tools and meta tools in MVTL. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Overall framework of RETOOL-VIDEO, where the planner selects primitive or abstract actions, the MetaAug-Video tool library provides base and meta tools, and the resolver recursively grounds abstract actions into executable tool chains. 4 RETOOL-VIDEO 4.1 Recursive Tool Grounding Complex video QA can be viewed as an interactive decision process, where an agent repeatedly determines what information is missi… view at source ↗

**Figure 4.** Figure 4: An example of recursive tool grounding in RETOOL-VIDEO, where abstract actions are recursively resolved into executable tool chains. Action Interface. At each step, the planner outputs either a set of structured actions or a top-level FINISH. The action space contains three types: A = Aprim ∪ Aabs ∪ F inish, (3) where Aprim denotes primitive actions that can be executed by registered tools, Aabs denotes … view at source ↗

**Figure 6.** Figure 6: Relationship between final accuracy and tool-use behavior, including tool-call count, [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 5.** Figure 5: Distribution of runtime tool calls in RETOOL-VIDEO, showing the relative invocation frequency of each base/meta tool category during model inference. To assess each component, we conduct the ablation study. As shown in [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 7.** Figure 7: Case study experiment on MLVU benchmark with R [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Dual-level evidence access in RETOOLVIDEO: the planner can use both preprocessed structured evidence and raw visual or video evidence returned by tools. Planner-token masking. During training, the policy-loss mask includes only plannergenerated tokens. Resolver outputs, tool observations, image or video evidence descriptions, execution logs, and environmentgenerated text are excluded from the policy… view at source ↗

read the original abstract

Video understanding requires active evidence seeking, motivating tool-augmented video agents for temporal reasoning, cross-modal understanding, and complex question answering. Existing video agents have improved video reasoning with retrieval, memory, frame inspection, and verifier tools, but they still face two limitations: (1) a coarse tool space that lacks fine-grained operations for compositional reasoning; and (2) a flat action space that forces high-level video intents into primitive executable tool calls. In this paper, we address these challenges with two complementary designs. First, we construct a MetaAug-Video Tool Library (MVTL), an extensible tool library with 134 registered tools, including 26 base tools for general multimodal signal processing and 108 meta tools for filtering, aggregation, reranking, formatting, and other intermediate-result operations. MVTL supports dual-level access to both structured video information and raw modal evidence, enabling diverse video reasoning scenarios. Second, we propose ReTool-Video, a recursive tool-using method that grounds high-level video intents into executable tool chains. In ReTool-Video, matched actions are executed directly, while unmatched intents are delegated to a resolver for parameter repair, tool substitution, or decomposition. This allows abstract actions such as temporal merging, cross-modal verification, or repeated-event aggregation to be progressively translated into concrete multimodal operations at runtime. Experiments on MVBench, MLVU, and Video-MME w/o sub. show that ReTool-Video consistently outperforms strong baselines. Further analysis demonstrates that recursive grounding and fine-grained meta tools improve the stability and effectiveness of complex video understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ReTool-Video introduces a 134-tool library with meta operations and a recursive resolver for video agent intents, but the resolver's reliability is asserted without error metrics or ablation details.

read the letter

The core of this paper is the MetaAug-Video Tool Library (MVTL) with 26 base tools for multimodal processing and 108 meta tools for filtering, aggregation, reranking and similar intermediate steps, paired with ReTool-Video's recursive resolver that either executes matched actions or delegates unmatched intents for repair, substitution or decomposition. This directly targets the stated limits of coarse tool spaces and flat action mappings in prior video agents. The design is concrete: it gives dual access to structured video info and raw modal evidence, and the resolver mechanism lets high-level intents like temporal merging or cross-modal verification break down at runtime. That part is a clear step forward from the flat tool calls described in the cited literature. The experiments claim consistent gains on MVBench, MLVU and Video-MME without subtitles, with further analysis tying improvements to recursive grounding and fine-grained meta tools. The stress-test concern about resolver reliability is fair based on the abstract. No resolver accuracy numbers, recursion-depth distributions or error-propagation results are referenced, so it is hard to separate the contribution of the recursive mechanism from the simple effect of having more tools available. If resolver failures are frequent, the stability claims weaken. This work is aimed at researchers building tool-augmented multimodal agents for video QA and compositional reasoning. Readers working on agent architectures would get usable design ideas from the tool taxonomy and resolver logic, even if the current results section needs expansion. I would send it for peer review because the ideas are specific and address documented gaps, though it will require revisions to include the missing resolver diagnostics and ablations.

Referee Report

3 major / 2 minor

Summary. The paper proposes ReTool-Video, a recursive tool-using agent for video understanding that grounds high-level intents into executable chains via a resolver. It introduces the MetaAug-Video Tool Library (MVTL) with 134 tools (26 base multimodal processing tools plus 108 meta tools for filtering, aggregation, reranking, and formatting). The central claim is that this dual-level tool access plus recursive delegation for unmatched intents (repair, substitution, or decomposition) yields consistent gains over baselines on MVBench, MLVU, and Video-MME (without subtitles), while improving stability for complex temporal and cross-modal reasoning.

Significance. If the empirical results and ablation analyses hold, the work offers a concrete mechanism for scaling tool-augmented video agents beyond flat action spaces. The extensible MVTL and resolver-based recursion could provide a reusable substrate for compositional video reasoning, with potential impact on downstream tasks requiring evidence-seeking over long videos.

major comments (3)

[Abstract / Experiments] Abstract and Experiments section: the claim that ReTool-Video 'consistently outperforms strong baselines' is stated without any reported numerical scores, baseline names, or statistical significance tests. This leaves the magnitude and reliability of the gains unassessable and makes it impossible to determine whether improvements exceed what would be expected from simply increasing tool count.
[Abstract / §4] Abstract and §4 (method): the resolver is described as handling unmatched intents via repair/substitution/decomposition, yet no resolver accuracy metric, recursion-depth statistics, or error-propagation analysis is referenced. Without these, gains cannot be confidently attributed to recursive grounding rather than the sheer size of the 134-tool library.
[Further analysis] Further analysis paragraph: the statement that 'recursive grounding and fine-grained meta tools improve stability' requires supporting ablations (e.g., ReTool-Video vs. non-recursive baseline with identical MVTL, or meta-tool ablation). Absent such controls, the causal contribution of the two proposed designs remains unverified.

minor comments (2)

[Abstract] The abstract mentions 'Video-MME w/o sub.' without defining the abbreviation or the exact evaluation protocol on first use.
[§3 / §4] Notation for the resolver and MVTL access modes is introduced without a compact diagram or pseudocode in the method overview, making the dual-level access hard to visualize.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments on our work. We address each major comment below and will revise the manuscript to incorporate the suggested improvements for clarity and rigor.

read point-by-point responses

Referee: [Abstract / Experiments] Abstract and Experiments section: the claim that ReTool-Video 'consistently outperforms strong baselines' is stated without any reported numerical scores, baseline names, or statistical significance tests. This leaves the magnitude and reliability of the gains unassessable and makes it impossible to determine whether improvements exceed what would be expected from simply increasing tool count.

Authors: We acknowledge that the abstract provides a high-level summary without specific numerical values or baseline names. The experiments section does include detailed comparisons, but to make the claims more assessable, we will revise the abstract to include key performance metrics (e.g., accuracy improvements on each benchmark) and explicitly name the baselines. Additionally, we will add statistical significance tests in the experiments section and reference them in the abstract to confirm that the gains are meaningful beyond the increased tool count. revision: yes
Referee: [Abstract / §4] Abstract and §4 (method): the resolver is described as handling unmatched intents via repair/substitution/decomposition, yet no resolver accuracy metric, recursion-depth statistics, or error-propagation analysis is referenced. Without these, gains cannot be confidently attributed to recursive grounding rather than the sheer size of the 134-tool library.

Authors: We agree that quantitative evaluation of the resolver would help isolate the contribution of recursive grounding. In the revised manuscript, we will add metrics for resolver accuracy (success rates for repair, substitution, and decomposition), statistics on recursion depth across queries, and an analysis of error propagation. We will also include a comparison to a non-recursive variant using the full MVTL to demonstrate that the recursive mechanism provides benefits beyond the tool library size. revision: yes
Referee: [Further analysis] Further analysis paragraph: the statement that 'recursive grounding and fine-grained meta tools improve stability' requires supporting ablations (e.g., ReTool-Video vs. non-recursive baseline with identical MVTL, or meta-tool ablation). Absent such controls, the causal contribution of the two proposed designs remains unverified.

Authors: We recognize the need for explicit ablations to support the claims about stability improvements. We will expand the further analysis section with new ablation studies: one comparing ReTool-Video to a non-recursive baseline with the same 134-tool MVTL, and another ablating the meta-tools to measure their specific impact. These will provide direct evidence for the causal contributions of recursive grounding and the meta-augmented tools to stability and performance. revision: yes

Circularity Check

0 steps flagged

No circularity: design choices evaluated on external benchmarks

full rationale

The paper constructs MVTL (134 tools) and ReTool-Video's recursive resolver by design, then measures performance on independent external benchmarks (MVBench, MLVU, Video-MME). No equations, fitted parameters, or predictions reduce to quantities derived from the same data used to define the tools or resolver. No self-citation chains or uniqueness theorems are invoked to justify core claims. The derivation chain is self-contained against external evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the newly introduced MVTL library and recursive resolver being effective for video reasoning; these components are defined within the paper and evaluated only on the reported benchmarks.

axioms (1)

domain assumption Standard video understanding benchmarks (MVBench, MLVU, Video-MME) provide valid measures of complex temporal and cross-modal reasoning.
All reported gains are measured against these datasets.

invented entities (2)

MetaAug-Video Tool Library (MVTL) no independent evidence
purpose: Extensible collection of 26 base and 108 meta tools for fine-grained multimodal video operations
Newly constructed library introduced in this work.
ReTool-Video recursive resolver no independent evidence
purpose: Mechanism that matches, repairs, substitutes, or decomposes high-level intents into executable tool chains
Core algorithmic contribution proposed here.

pith-pipeline@v0.9.0 · 5614 in / 1442 out tokens · 58900 ms · 2026-05-14T20:04:07.533132+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

89 extracted references · 80 canonical work pages · 11 internal anchors

[1]

Model System Cards

Anthropic. Claude Sonnet 4.5 System Card. https://www.anthropic.com/ claude-sonnet-4-5-system-card , September 2025. Listed on Anthropic “Model System Cards” page

work page 2025
[2]

Sharegpt4video: Improving video understanding and generation with better captions.Advances in Neural Information Processing Systems, 37:19472–19495, 2024

Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, et al. Sharegpt4video: Improving video understanding and generation with better captions.Advances in Neural Information Processing Systems, 37:19472–19495, 2024

2024
[3]

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, et al. Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms.arXiv preprint arXiv:2406.07476, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Video question answering with procedural programs

Rohan Choudhury, Koichiro Niinuma, Kris M Kitani, and László A Jeni. Video question answering with procedural programs. InEuropean Conference on Computer Vision, pages 315–332. Springer, 2024

2024
[5]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with ad- vanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Tool-augmented spatiotemporal reasoning for streamlining video question answering task.arXiv preprint arXiv:2512.10359, 2025

Sunqi Fan, Jiashuo Cui, Meng-Hao Guo, and Shuojin Yang. Tool-augmented spatiotemporal reasoning for streamlining video question answering task.arXiv preprint arXiv:2512.10359, 2025

work page arXiv 2025
[7]

Videoagent: A memory- augmented multimodal agent for video understanding

Yue Fan, Xiaojian Ma, Rujie Wu, Yuntao Du, Jiaqi Li, Zhi Gao, and Qing Li. Videoagent: A memory- augmented multimodal agent for video understanding. InEuropean Conference on Computer Vision, pages 75–92. Springer, 2024

work page 2024
[8]

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24108–24118, 2025

2025
[9]

Framemind: Frame-interleaved video reasoning via reinforcement learning.arXiv preprint arXiv:2509.24008, 2025

Haonan Ge, Yiwei Wang, Kai-Wei Chang, Hang Wu, and Yujun Cai. Framemind: Frame-interleaved video reasoning via reinforcement learning.arXiv preprint arXiv:2509.24008, 2025

work page arXiv 2025
[10]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Dense-captioning events in videos

Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. Dense-captioning events in videos. InProceedings of the IEEE international conference on computer vision, pages 706–715, 2017

work page 2017
[12]

Tvqa: Localized, compositional video question answering

Jie Lei, Licheng Yu, Mohit Bansal, and Tamara Berg. Tvqa: Localized, compositional video question answering. InProceedings of the 2018 conference on empirical methods in natural language processing, pages 1369–1379, 2018

2018
[13]

arXiv preprint arXiv:2410.02189 , year=

Ao Li, Yuexiang Xie, Songze Li, Fugee Tsung, Bolin Ding, and Yaliang Li. Agent-oriented planning in multi-agent systems.arXiv preprint arXiv:2410.02189, 2024

work page arXiv 2024
[14]

VideoThinker: Building Agentic VideoLLMs with LLM-Guided Tool Reasoning

Chenglin Li, Qianglong Chen, Feng Han, Yikun Wang, Xingxi Yin, Yan Gong, Ruilin Li, Yin Zhang, and Jiaqi Wang. Videothinker: Building agentic videollms with llm-guided tool reasoning.arXiv preprint arXiv:2601.15724, 2026. 10

work page internal anchor Pith review Pith/arXiv arXiv 2026
[15]

Mvbench: A comprehensive multi-modal video understanding benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195–22206, 2024

work page 2024
[16]

Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking

Mingxin Li, Yanzhao Zhang, Dingkun Long, Chen Keqin, Sibo Song, Shuai Bai, Zhibo Yang, Pengjun Xie, An Yang, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Qwen3-vl-embedding and qwen3-vl-reranker: A unified framework for state-of-the-art multimodal retrieval and ranking.arXiv preprint arXiv:2601.04720, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[17]

Llama-vid: An image is worth 2 tokens in large language models

Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. InEuropean Conference on Computer Vision, pages 323–340. Springer, 2024

work page 2024
[18]

Video-llava: Learning united visual representation by alignment before projection

Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection. InProceedings of the 2024 conference on empirical methods in natural language processing, pages 5971–5984, 2024

work page 2024
[19]

Videoseek: Long-horizon video agent with tool-guided seeking.arXiv preprint arXiv:2603.20185, 2026

Jingyang Lin, Jialian Wu, Jiang Liu, Ximeng Sun, Ze Wang, Xiaodong Yu, Jiebo Luo, Zicheng Liu, and Emad Barsoum. Videoseek: Long-horizon video agent with tool-guided seeking.arXiv preprint arXiv:2603.20185, 2026

work page arXiv 2026
[20]

Pc-agent: A hierarchical multi-agent collaboration framework for complex task automation on pc.arXiv preprint arXiv:2502.14282, 2025

Haowei Liu, Xi Zhang, Haiyang Xu, Yuyang Wanyan, Junyang Wang, Ming Yan, Ji Zhang, Chunfeng Yuan, Changsheng Xu, Weiming Hu, et al. Pc-agent: A hierarchical multi-agent collaboration framework for complex task automation on pc.arXiv preprint arXiv:2502.14282, 2025

work page arXiv 2025
[21]

Kangaroo: A powerful video-language model supporting long-context video input: J

Jiajun Liu, Yibing Wang, Hanghang Ma, Xiaoping Wu, Xiaoqi Ma, Xiaoming Wei, Jianbin Jiao, Enhua Wu, and Jie Hu. Kangaroo: A powerful video-language model supporting long-context video input: J. liu et al.International Journal of Computer Vision, 134(3):114, 2026

work page 2026
[22]

Videomind: A chain-of-lora agent for long video reasoning.arXiv e-prints, pages arXiv–2503, 2025

Ye Liu, Kevin Qinghong Lin, Chang Wen Chen, and Mike Zheng Shou. Videomind: A chain-of-lora agent for long video reasoning.arXiv e-prints, pages arXiv–2503, 2025

work page 2025
[23]

Nvila: Efficient frontier visual language models

Zhijian Liu, Ligeng Zhu, Baifeng Shi, Zhuoyang Zhang, Yuming Lou, Shang Yang, Haocheng Xi, Shiyi Cao, Yuxian Gu, Dacheng Li, et al. Nvila: Efficient frontier visual language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4122–4134, 2025

work page 2025
[24]

Drvideo: Document retrieval based long video understanding

Ziyu Ma, Chenhui Gou, Hengcan Shi, Bin Sun, Shutao Li, Hamid Rezatofighi, and Jianfei Cai. Drvideo: Document retrieval based long video understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18936–18946, 2025

work page 2025
[25]

Video-chatgpt: Towards detailed video understanding via large vision and language models

Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 12585–12602, 2024

work page 2024
[26]

Video-bench: A comprehensive benchmark and toolkit for evaluating video-based large language models.Computational Visual Media, 2025

Munan Ning, Bin Zhu, Yujia Xie, Bin Lin, Jiaxi Cui, Lu Yuan, Dongdong Chen, and Li Yuan. Video-bench: A comprehensive benchmark and toolkit for evaluating video-based large language models.Computational Visual Media, 2025

work page 2025
[27]

GPT-4V(ision) System Card.https://openai.com/index/gpt-4v-system-card/ , Septem- ber 2023

OpenAI. GPT-4V(ision) System Card.https://openai.com/index/gpt-4v-system-card/ , Septem- ber 2023. Accessed: 2025-09-19

work page 2023
[28]

Qwen3.5: Towards native multimodal agents, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https://qwen.ai/ blog?id=qwen3.5

work page 2026
[29]

Agentic very long video understanding.arXiv preprint arXiv:2601.18157, 2026

Aniket Rege, Arka Sadhu, Yuliang Li, Kejie Li, Ramya Korlakai Vinayak, Yuning Chai, Yong Jae Lee, and Hyo Jin Kim. Agentic very long video understanding.arXiv preprint arXiv:2601.18157, 2026

work page arXiv 2026
[30]

Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36:68539–68551, 2023

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36:68539–68551, 2023

work page 2023
[31]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face.Advances in Neural Information Processing Systems, 36:38154–38180, 2023

Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face.Advances in Neural Information Processing Systems, 36:38154–38180, 2023. 11

work page 2023
[33]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaugh- lin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Moviechat: From dense token to sparse memory for long video understanding

Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, et al. Moviechat: From dense token to sparse memory for long video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18221–18232, 2024

work page 2024
[35]

Vipergpt: Visual inference via python execution for reasoning

Dídac Surís, Sachit Menon, and Carl V ondrick. Vipergpt: Visual inference via python execution for reasoning. InProceedings of the IEEE/CVF international conference on computer vision, pages 11888– 11898, 2023

work page 2023
[36]

Ego-r1: Chain-of-tool-thought for ultra-long egocentric video reasoning.arXiv preprint arXiv:2506.13654, 2025

Shulin Tian, Ruiqi Wang, Hongming Guo, Penghao Wu, Yuhao Dong, Xiuying Wang, Jingkang Yang, Hao Zhang, Hongyuan Zhu, and Ziwei Liu. Ego-r1: Chain-of-tool-thought for ultra-long egocentric video reasoning.arXiv preprint arXiv:2506.13654, 2025

work page arXiv 2025
[37]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

Cfvbench: A comprehensive video benchmark for fine-grained multimodal retrieval-augmented generation

Kaiwen Wei, Xiao Liu, Jie Zhang, Zijian Wang, Ruida Liu, Yuming Yang, Xin Xiao, Xiao Sun, Haoyang Zeng, Changzai Pan, et al. Cfvbench: A comprehensive video benchmark for fine-grained multimodal retrieval-augmented generation. InProceedings of the ACM Web Conference 2026, pages 2501–2512, 2026

work page 2026
[39]

Gap: Graph- based agent planning with parallel tool use and reinforcement learning.arXiv preprint arXiv:2510.25320, 2025

Jiaqi Wu, Qinlao Zhao, Zefeng Chen, Kai Qin, Yifei Zhao, Xueqian Wang, and Yuhang Yao. Gap: Graph- based agent planning with parallel tool use and reinforcement learning.arXiv preprint arXiv:2510.25320, 2025

work page arXiv 2025
[40]

Next-qa: Next phase of question-answering to explaining temporal actions

Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question-answering to explaining temporal actions. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9777–9786, 2021

work page 2021
[41]

MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan Wang. Mm-react: Prompting chatgpt for multimodal reasoning and action.arXiv preprint arXiv:2303.11381, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[42]

Doraemongpt: Toward under- standing dynamic scenes with large language models (exemplified as a video agent).arXiv preprint arXiv:2401.08392, 2024

Zongxin Yang, Guikun Chen, Xiaodi Li, Wenguan Wang, and Yi Yang. Doraemongpt: Toward under- standing dynamic scenes with large language models (exemplified as a video agent).arXiv preprint arXiv:2401.08392, 2024

work page arXiv 2024
[43]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations, 2022

work page 2022
[44]

Videoarm: Agentic reasoning over hierarchical memory for long-form video understanding.arXiv preprint arXiv:2512.12360, 2025

Yufei Yin, Qianke Meng, Minghao Chen, Jiajun Ding, Zhenwei Shao, and Zhou Yu. Videoarm: Agentic reasoning over hierarchical memory for long-form video understanding.arXiv preprint arXiv:2512.12360, 2025

work page arXiv 2025
[45]

Recode: Unify plan and action for universal granularity control.arXiv preprint arXiv:2510.23564, 2025

Zhaoyang Yu, Jiayi Zhang, Huixue Su, Yufan Zhao, Yifan Wu, Mingyi Deng, Jinyu Xiang, Yizhang Lin, Lingxiao Tang, Yuyu Luo, et al. Recode: Unify plan and action for universal granularity control.arXiv preprint arXiv:2510.23564, 2025

work page arXiv 2025
[46]

Videoexplorer: Think with videos for agentic long-video understanding.arXiv preprint arXiv:2506.10821, 2025

Huaying Yuan, Zheng Liu, Junjie Zhou, Hongjin Qian, Yan Shu, Nicu Sebe, Ji-Rong Wen, and Zhicheng Dou. Videoexplorer: Think with videos for agentic long-video understanding.arXiv preprint arXiv:2506.10821, 2025

work page arXiv 2025
[47]

Thinking with videos: Multimodal tool-augmented reinforcement learning for long video reasoning.arXiv preprint arXiv:2508.04416, 2025

Haoji Zhang, Xin Gu, Jiawen Li, Chixiang Ma, Sule Bai, Chubin Zhang, Bowen Zhang, Zhichao Zhou, Dongliang He, and Yansong Tang. Thinking with videos: Multimodal tool-augmented reinforcement learning for long video reasoning.arXiv preprint arXiv:2508.04416, 2025

work page arXiv 2025
[48]

Flash-vstream: Efficient real-time understanding for long video streams

Haoji Zhang, Yiqin Wang, Yansong Tang, Yong Liu, Jiashi Feng, and Xiaojie Jin. Flash-vstream: Efficient real-time understanding for long video streams. InProceedings of the IEEE/CVF international conference on computer vision, pages 21059–21069, 2025

work page 2025
[49]

Long Context Transfer from Language to Vision

Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, and Ziwei Liu. Long context transfer from language to vision.arXiv preprint arXiv:2406.16852, 2024. 12

work page internal anchor Pith review Pith/arXiv arXiv 2024
[50]

Deep video dis- covery: Agentic search with tool use for long-form video understanding.arXiv preprint arXiv:2505.18079, 2025

Xiaoyi Zhang, Zhaoyang Jia, Zongyu Guo, Jiahao Li, Bin Li, Houqiang Li, and Yan Lu. Deep video dis- covery: Agentic search with tool use for long-form video understanding.arXiv preprint arXiv:2505.18079, 2025

work page arXiv 2025
[51]

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. Qwen3 embedding: Advancing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[52]

how the current action should be grounded

Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Zhengyang Liang, Shitao Xiao, Minghao Qin, Xi Yang, Yongping Xiong, Bo Zhang, et al. Mlvu: Benchmarking multi-task long video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13691– 13701, 2025. A Additional Details of MVTL A.1 Tool Counting and Exposure Prot...

work page 2025
[53]

**Thinking Process**: You may use ‘<think>...</think>‘ tags for internal reasoning

work page
[54]

**Final JSON**: Your response MUST conclude with a single, valid JSON object

work page
[55]

$" only for runtime result pointers from prior tool outputs (e.g.,

**Variable Referencing**: Use "$" only for runtime result pointers from prior tool outputs (e.g., "$last_inspect_frame_result", "$frame_evidence_from_Inspect_Frame")

work page
[56]

Thought":

**Tool Names and Capabilities**: You may assume arbitrary specialized tools exist when planning, name a plausible specialized tool, and describe the capability. Do not rely on a hard-coded planner tool inventory in the prompt. # Required JSON Schema { "Thought": "Concise reasoning (max 2 short sentences)", "Plan": "Short step plan", "Evidence": [ { "label...

work page
[57]

Thought":

Continue with tools: <think> ... </think> { "Thought": "...", "Plan": "...", "Actions": [{"tool": "ACTUAL_TOOL_NAME_HERE", "params": {"arg": "value"}}] }

work page
[58]

Thought":

Finish the task: <think> ... </think> { "Thought": "...", "Plan": "...", "Actions": [], "Finish": { "chain_complete": true, "completion_basis": "...", "answer": "..." } } Never output non-empty ‘Actions‘ together with a valid top-level ‘Finish‘. # Data Access Capabilities Tools can access two tiers of video infomation. Specify your needs in the tool descr...

work page
[59]

If one tool already provides clear, direct, and decisive evidence with little ambiguity, you may finish without extra verification

**Cross-tool verification**: By default, validate each piece of evidence and the final conclusion with result from at least two different tools, especially for visual evidence, ambiguous evidence, fine-grained actions, future-event prediction, or when candidate answers are easily confused. If one tool already provides clear, direct, and decisive evidence ...

work page
[60]

**Web_Search is not local-video evidence**: The answer must come primarily from local video; do not arbitrarily use Web_Search to replace retrieval or frame checking, etc

work page
[61]

**Exact-value tasks need Evidence tracking**: When a query asks for exact values, totals, counts, prices, or scenario comparisons from the video, keep an ‘Evidence‘ array updated with one item per required fact/scenario
[62]

**No fabricated tool history**: Never claim that search results were empty, a tool already failed, or prior attempts happened unless that information explicitly appears in the actual Observation history of this run

work page
[63]

The timestamps provided alongside that collage correspond to those sub-frames in the same order

**Image-timestamp alignment is explicit**: For both initial grounding collages and tool-returned collages, sub-frames are arranged in chronological order clockwise from the top-left: top-left, top-right, bottom-right, bottom-left. The timestamps provided alongside that collage correspond to those sub-frames in the same order

work page
[64]

Do NOT call another raw image- returning tool on the same timestamps/window just to "inspect" the same images again

**Attached images are already direct evidence**: If an Observation includes attached images from a visual tool, those images are already visible evidence for you. Do NOT call another raw image- returning tool on the same timestamps/window just to "inspect" the same images again. Re-probe only if you materially change the time window or apply ‘Crop‘/‘Zoom_...

work page
[65]

**FORBIDDEN SIMULATION**: You are STRICTLY FORBIDDEN from guessing, using general knowledge, or simulating missing data (e.g., assuming a 2% payment rate)

work page
[66]

Switch to retrieval, verification, or direct-evidence probing instead of guessing

**MISSING INFO = PROBE TRIGGER**: If a value is missing from the initial summary, it is a MANDATE to search the raw video. Switch to retrieval, verification, or direct-evidence probing instead of guessing. 19

work page
[67]

Never call a tool with "assumed" numbers

**EVIDENCE-ONLY PARAMETERS**: Every parameter passed to a logic tool MUST be explicitly found first. Never call a tool with "assumed" numbers. # Video-Centric Probe Hierarchy

work page
[68]

Use tools to probe the remainder of the timeline

**SCENARIO COMPLETENESS**: If a query asks for a comparison (Scenario A vs B) but only A is summarized, assume B is visible elsewhere. Use tools to probe the remainder of the timeline

work page
[69]

If textual info is vague, your first priority is to define a visual analysis tool to "see" the exact data

**VISUAL IS TRUTH**: Visual frames contain data that summaries often miss. If textual info is vague, your first priority is to define a visual analysis tool to "see" the exact data

work page
[70]

**LOGICAL RESTRAINT**: Logic tools (like ‘Python_Executor‘) are ONLY for processing data that has already been retrieved from the evidence
[71]

**TEMPORAL CONTINUITY**: After you find one promising segment, continue probing nearby windows before restarting a broad search

work page
[72]

around",

**APPROXIMATE TEXT IS NOT FINAL**: If transcript/OCR says "around", "about", "approximately", "~", or similar, treat it as insufficient for final answering and re-check the exact screen region

work page
[73]

Use a chain like text->visual->answer or visual->text->answer

**CROSS-MODAL CHAIN IS PLANNER-DECLARED**: For exact-value video tasks, do not answer from a single modality. Use a chain like text->visual->answer or visual->text->answer. Only set ‘Finish. chain_complete=true‘ after you judge that the required chain is complete

work page
[74]

answer" must explicitly cite the timestamp for every value used. If question contains options,

**PYTHON IS NOT A VISION FALLBACK**: Never use ‘Python_Executor‘ to infer what local video/images show from filenames or paths. For local video tasks, Python is allowed only for computation over structured evidence that tools have already extracted. # Termination Protocol - **MANDATORY CITATION**: Your final "answer" must explicitly cite the timestamp for...
[75]

parallel_decomposition

**Read Mediation Markers First**: Respect ‘[Mediation Rule]‘, ‘[Allowed Resolution Modes]‘, ‘[ Locked Target Tool]‘, ‘[Target Tool Schema]‘, ‘[Candidate Tool Schemas]‘, and ‘[Route Plan Candidates]‘. 20 If the runtime provides ‘[Composite Intent Guidance]‘ with ‘recommended_resolution=" parallel_decomposition"‘, treat that as a strong hint to prefer L4 ov...

work page
[76]

local_video

If mediation is L2, keep the target tool fixed by default. Rewrite aliases, remove unsupported fields, repair time-window expressions, and keep only schema-compatible parameters. Treat tool-search schemas as authoritative parameter requirements. If the schema says ‘t_start‘, ‘ t_end‘, and ‘density‘, do not emit old aliases like ‘time_range_start‘, ‘time_r...

work page
[77]

**Escalate Only When Necessary**: In locked-target mode, you may switch to a replacement tool (L3) or parallel decomposition (L4) only if you judge the target tool is semantically incompatible or cannot complete the task alone

work page
[78]

If mediation is ‘L3 or L4‘, use route metadata and candidate tools to decide between: - one executable replacement tool, or - multiple child tools returned together in ‘Actions‘ When the request simultaneously asks about multiple evidence dimensions such as actions, style/ category, subtitles or audio cues, and scene distinctions/comparisons, prefer multi...

work page
[79]

Do not output placeholder or abstract tool names

**Abstract-tool boundary is level-specific**: - In ‘L1-L3‘, every returned action must already be a concrete executable tool. Do not output placeholder or abstract tool names. - Only in ‘L4‘ may you selectively return abstract child tools in ‘Actions‘, when one tool is insufficient and the task must be decomposed for resolution

work page
[80]

Child tools that become valid concrete tools will execute directly

**L4 Semantics**: For each child tool produced at ‘L4‘, the runtime will run Tool_Search plus parameter validation independently. Child tools that become valid concrete tools will execute directly. Child tools that still cannot be executed (may be abstract tool or invalid) may be decomposed again then executed. Requests may contain nested tools, where one...

work page

Showing first 80 references.