Recognition: no theorem link
ReTool-Video: Recursive Tool-Using Video Agents with Meta-Augmented Tool Grounding
Pith reviewed 2026-05-14 20:04 UTC · model grok-4.3
The pith
ReTool-Video recursively grounds abstract video intents into executable tool chains using a library of 134 meta-augmented tools.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ReTool-Video proposes a recursive tool-using method that grounds high-level video intents into executable tool chains, where matched actions execute directly and unmatched intents are delegated to a resolver for parameter repair, tool substitution, or decomposition. This is enabled by the MetaAug-Video Tool Library (MVTL) containing 134 registered tools—26 base tools for general multimodal signal processing and 108 meta tools for intermediate operations—supporting diverse scenarios through dual-level access to structured video information and raw modal evidence.
What carries the argument
The recursive grounding mechanism in ReTool-Video, which delegates unmatched high-level intents to a resolver for progressive translation into concrete multimodal tool operations, supported by the MVTL library of 134 tools.
If this is right
- Abstract actions such as temporal merging, cross-modal verification, and repeated-event aggregation can be translated into concrete multimodal operations at runtime.
- Recursive decomposition reduces the need for rigid pre-mapping of intents and improves handling of compositional video reasoning.
- Fine-grained meta tools for filtering, aggregation, and reranking increase stability and effectiveness on tasks requiring temporal and multimodal evidence.
- Performance gains hold across MVBench, MLVU, and Video-MME without subtitles when both recursion and meta tools are used.
Where Pith is reading between the lines
- The resolver-based decomposition could be tested on queries engineered to trigger repeated failures, revealing the practical depth limit of recursion.
- Similar recursive tool libraries might transfer to other sequential modalities such as audio streams or long document navigation.
- Extending the 134-tool set with domain-specific additions could support specialized video domains without retraining the core agent.
- Measuring resolver error rates directly on held-out intents would quantify how much of the reported gains depend on reliable decomposition.
Load-bearing premise
High-level video intents can be reliably matched or decomposed by the resolver into the 134 registered tools without introducing errors, excessive recursion, or loss of reasoning fidelity.
What would settle it
A controlled test set of video queries with deliberately ambiguous or complex intents where the resolver produces frequent mismatches or excessive recursion, resulting in accuracy on MVBench no higher than non-recursive baselines.
Figures
read the original abstract
Video understanding requires active evidence seeking, motivating tool-augmented video agents for temporal reasoning, cross-modal understanding, and complex question answering. Existing video agents have improved video reasoning with retrieval, memory, frame inspection, and verifier tools, but they still face two limitations: (1) a coarse tool space that lacks fine-grained operations for compositional reasoning; and (2) a flat action space that forces high-level video intents into primitive executable tool calls. In this paper, we address these challenges with two complementary designs. First, we construct a MetaAug-Video Tool Library (MVTL), an extensible tool library with 134 registered tools, including 26 base tools for general multimodal signal processing and 108 meta tools for filtering, aggregation, reranking, formatting, and other intermediate-result operations. MVTL supports dual-level access to both structured video information and raw modal evidence, enabling diverse video reasoning scenarios. Second, we propose ReTool-Video, a recursive tool-using method that grounds high-level video intents into executable tool chains. In ReTool-Video, matched actions are executed directly, while unmatched intents are delegated to a resolver for parameter repair, tool substitution, or decomposition. This allows abstract actions such as temporal merging, cross-modal verification, or repeated-event aggregation to be progressively translated into concrete multimodal operations at runtime. Experiments on MVBench, MLVU, and Video-MME w/o sub. show that ReTool-Video consistently outperforms strong baselines. Further analysis demonstrates that recursive grounding and fine-grained meta tools improve the stability and effectiveness of complex video understanding.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes ReTool-Video, a recursive tool-using agent for video understanding that grounds high-level intents into executable chains via a resolver. It introduces the MetaAug-Video Tool Library (MVTL) with 134 tools (26 base multimodal processing tools plus 108 meta tools for filtering, aggregation, reranking, and formatting). The central claim is that this dual-level tool access plus recursive delegation for unmatched intents (repair, substitution, or decomposition) yields consistent gains over baselines on MVBench, MLVU, and Video-MME (without subtitles), while improving stability for complex temporal and cross-modal reasoning.
Significance. If the empirical results and ablation analyses hold, the work offers a concrete mechanism for scaling tool-augmented video agents beyond flat action spaces. The extensible MVTL and resolver-based recursion could provide a reusable substrate for compositional video reasoning, with potential impact on downstream tasks requiring evidence-seeking over long videos.
major comments (3)
- [Abstract / Experiments] Abstract and Experiments section: the claim that ReTool-Video 'consistently outperforms strong baselines' is stated without any reported numerical scores, baseline names, or statistical significance tests. This leaves the magnitude and reliability of the gains unassessable and makes it impossible to determine whether improvements exceed what would be expected from simply increasing tool count.
- [Abstract / §4] Abstract and §4 (method): the resolver is described as handling unmatched intents via repair/substitution/decomposition, yet no resolver accuracy metric, recursion-depth statistics, or error-propagation analysis is referenced. Without these, gains cannot be confidently attributed to recursive grounding rather than the sheer size of the 134-tool library.
- [Further analysis] Further analysis paragraph: the statement that 'recursive grounding and fine-grained meta tools improve stability' requires supporting ablations (e.g., ReTool-Video vs. non-recursive baseline with identical MVTL, or meta-tool ablation). Absent such controls, the causal contribution of the two proposed designs remains unverified.
minor comments (2)
- [Abstract] The abstract mentions 'Video-MME w/o sub.' without defining the abbreviation or the exact evaluation protocol on first use.
- [§3 / §4] Notation for the resolver and MVTL access modes is introduced without a compact diagram or pseudocode in the method overview, making the dual-level access hard to visualize.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments on our work. We address each major comment below and will revise the manuscript to incorporate the suggested improvements for clarity and rigor.
read point-by-point responses
-
Referee: [Abstract / Experiments] Abstract and Experiments section: the claim that ReTool-Video 'consistently outperforms strong baselines' is stated without any reported numerical scores, baseline names, or statistical significance tests. This leaves the magnitude and reliability of the gains unassessable and makes it impossible to determine whether improvements exceed what would be expected from simply increasing tool count.
Authors: We acknowledge that the abstract provides a high-level summary without specific numerical values or baseline names. The experiments section does include detailed comparisons, but to make the claims more assessable, we will revise the abstract to include key performance metrics (e.g., accuracy improvements on each benchmark) and explicitly name the baselines. Additionally, we will add statistical significance tests in the experiments section and reference them in the abstract to confirm that the gains are meaningful beyond the increased tool count. revision: yes
-
Referee: [Abstract / §4] Abstract and §4 (method): the resolver is described as handling unmatched intents via repair/substitution/decomposition, yet no resolver accuracy metric, recursion-depth statistics, or error-propagation analysis is referenced. Without these, gains cannot be confidently attributed to recursive grounding rather than the sheer size of the 134-tool library.
Authors: We agree that quantitative evaluation of the resolver would help isolate the contribution of recursive grounding. In the revised manuscript, we will add metrics for resolver accuracy (success rates for repair, substitution, and decomposition), statistics on recursion depth across queries, and an analysis of error propagation. We will also include a comparison to a non-recursive variant using the full MVTL to demonstrate that the recursive mechanism provides benefits beyond the tool library size. revision: yes
-
Referee: [Further analysis] Further analysis paragraph: the statement that 'recursive grounding and fine-grained meta tools improve stability' requires supporting ablations (e.g., ReTool-Video vs. non-recursive baseline with identical MVTL, or meta-tool ablation). Absent such controls, the causal contribution of the two proposed designs remains unverified.
Authors: We recognize the need for explicit ablations to support the claims about stability improvements. We will expand the further analysis section with new ablation studies: one comparing ReTool-Video to a non-recursive baseline with the same 134-tool MVTL, and another ablating the meta-tools to measure their specific impact. These will provide direct evidence for the causal contributions of recursive grounding and the meta-augmented tools to stability and performance. revision: yes
Circularity Check
No circularity: design choices evaluated on external benchmarks
full rationale
The paper constructs MVTL (134 tools) and ReTool-Video's recursive resolver by design, then measures performance on independent external benchmarks (MVBench, MLVU, Video-MME). No equations, fitted parameters, or predictions reduce to quantities derived from the same data used to define the tools or resolver. No self-citation chains or uniqueness theorems are invoked to justify core claims. The derivation chain is self-contained against external evaluation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard video understanding benchmarks (MVBench, MLVU, Video-MME) provide valid measures of complex temporal and cross-modal reasoning.
invented entities (2)
-
MetaAug-Video Tool Library (MVTL)
no independent evidence
-
ReTool-Video recursive resolver
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Anthropic. Claude Sonnet 4.5 System Card. https://www.anthropic.com/ claude-sonnet-4-5-system-card , September 2025. Listed on Anthropic “Model System Cards” page
work page 2025
-
[2]
Sharegpt4video: Improving video understanding and generation with better captions.Advances in Neural Information Processing Systems, 37:19472–19495, 2024
Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, et al. Sharegpt4video: Improving video understanding and generation with better captions.Advances in Neural Information Processing Systems, 37:19472–19495, 2024
2024
-
[3]
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, et al. Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms.arXiv preprint arXiv:2406.07476, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
Video question answering with procedural programs
Rohan Choudhury, Koichiro Niinuma, Kris M Kitani, and László A Jeni. Video question answering with procedural programs. InEuropean Conference on Computer Vision, pages 315–332. Springer, 2024
2024
-
[5]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with ad- vanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Sunqi Fan, Jiashuo Cui, Meng-Hao Guo, and Shuojin Yang. Tool-augmented spatiotemporal reasoning for streamlining video question answering task.arXiv preprint arXiv:2512.10359, 2025
-
[7]
Videoagent: A memory- augmented multimodal agent for video understanding
Yue Fan, Xiaojian Ma, Rujie Wu, Yuntao Du, Jiaqi Li, Zhi Gao, and Qing Li. Videoagent: A memory- augmented multimodal agent for video understanding. InEuropean Conference on Computer Vision, pages 75–92. Springer, 2024
work page 2024
-
[8]
Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis
Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24108–24118, 2025
2025
-
[9]
Haonan Ge, Yiwei Wang, Kai-Wei Chang, Hang Wu, and Yujun Cai. Framemind: Frame-interleaved video reasoning via reinforcement learning.arXiv preprint arXiv:2509.24008, 2025
-
[10]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[11]
Dense-captioning events in videos
Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. Dense-captioning events in videos. InProceedings of the IEEE international conference on computer vision, pages 706–715, 2017
work page 2017
-
[12]
Tvqa: Localized, compositional video question answering
Jie Lei, Licheng Yu, Mohit Bansal, and Tamara Berg. Tvqa: Localized, compositional video question answering. InProceedings of the 2018 conference on empirical methods in natural language processing, pages 1369–1379, 2018
2018
-
[13]
arXiv preprint arXiv:2410.02189 , year=
Ao Li, Yuexiang Xie, Songze Li, Fugee Tsung, Bolin Ding, and Yaliang Li. Agent-oriented planning in multi-agent systems.arXiv preprint arXiv:2410.02189, 2024
-
[14]
VideoThinker: Building Agentic VideoLLMs with LLM-Guided Tool Reasoning
Chenglin Li, Qianglong Chen, Feng Han, Yikun Wang, Xingxi Yin, Yan Gong, Ruilin Li, Yin Zhang, and Jiaqi Wang. Videothinker: Building agentic videollms with llm-guided tool reasoning.arXiv preprint arXiv:2601.15724, 2026. 10
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[15]
Mvbench: A comprehensive multi-modal video understanding benchmark
Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195–22206, 2024
work page 2024
-
[16]
Mingxin Li, Yanzhao Zhang, Dingkun Long, Chen Keqin, Sibo Song, Shuai Bai, Zhibo Yang, Pengjun Xie, An Yang, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Qwen3-vl-embedding and qwen3-vl-reranker: A unified framework for state-of-the-art multimodal retrieval and ranking.arXiv preprint arXiv:2601.04720, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[17]
Llama-vid: An image is worth 2 tokens in large language models
Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. InEuropean Conference on Computer Vision, pages 323–340. Springer, 2024
work page 2024
-
[18]
Video-llava: Learning united visual representation by alignment before projection
Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection. InProceedings of the 2024 conference on empirical methods in natural language processing, pages 5971–5984, 2024
work page 2024
-
[19]
Videoseek: Long-horizon video agent with tool-guided seeking.arXiv preprint arXiv:2603.20185, 2026
Jingyang Lin, Jialian Wu, Jiang Liu, Ximeng Sun, Ze Wang, Xiaodong Yu, Jiebo Luo, Zicheng Liu, and Emad Barsoum. Videoseek: Long-horizon video agent with tool-guided seeking.arXiv preprint arXiv:2603.20185, 2026
-
[20]
Haowei Liu, Xi Zhang, Haiyang Xu, Yuyang Wanyan, Junyang Wang, Ming Yan, Ji Zhang, Chunfeng Yuan, Changsheng Xu, Weiming Hu, et al. Pc-agent: A hierarchical multi-agent collaboration framework for complex task automation on pc.arXiv preprint arXiv:2502.14282, 2025
-
[21]
Kangaroo: A powerful video-language model supporting long-context video input: J
Jiajun Liu, Yibing Wang, Hanghang Ma, Xiaoping Wu, Xiaoqi Ma, Xiaoming Wei, Jianbin Jiao, Enhua Wu, and Jie Hu. Kangaroo: A powerful video-language model supporting long-context video input: J. liu et al.International Journal of Computer Vision, 134(3):114, 2026
work page 2026
-
[22]
Videomind: A chain-of-lora agent for long video reasoning.arXiv e-prints, pages arXiv–2503, 2025
Ye Liu, Kevin Qinghong Lin, Chang Wen Chen, and Mike Zheng Shou. Videomind: A chain-of-lora agent for long video reasoning.arXiv e-prints, pages arXiv–2503, 2025
work page 2025
-
[23]
Nvila: Efficient frontier visual language models
Zhijian Liu, Ligeng Zhu, Baifeng Shi, Zhuoyang Zhang, Yuming Lou, Shang Yang, Haocheng Xi, Shiyi Cao, Yuxian Gu, Dacheng Li, et al. Nvila: Efficient frontier visual language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4122–4134, 2025
work page 2025
-
[24]
Drvideo: Document retrieval based long video understanding
Ziyu Ma, Chenhui Gou, Hengcan Shi, Bin Sun, Shutao Li, Hamid Rezatofighi, and Jianfei Cai. Drvideo: Document retrieval based long video understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18936–18946, 2025
work page 2025
-
[25]
Video-chatgpt: Towards detailed video understanding via large vision and language models
Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 12585–12602, 2024
work page 2024
-
[26]
Munan Ning, Bin Zhu, Yujia Xie, Bin Lin, Jiaxi Cui, Lu Yuan, Dongdong Chen, and Li Yuan. Video-bench: A comprehensive benchmark and toolkit for evaluating video-based large language models.Computational Visual Media, 2025
work page 2025
-
[27]
GPT-4V(ision) System Card.https://openai.com/index/gpt-4v-system-card/ , Septem- ber 2023
OpenAI. GPT-4V(ision) System Card.https://openai.com/index/gpt-4v-system-card/ , Septem- ber 2023. Accessed: 2025-09-19
work page 2023
-
[28]
Qwen3.5: Towards native multimodal agents, February 2026
Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https://qwen.ai/ blog?id=qwen3.5
work page 2026
-
[29]
Agentic very long video understanding.arXiv preprint arXiv:2601.18157, 2026
Aniket Rege, Arka Sadhu, Yuliang Li, Kejie Li, Ramya Korlakai Vinayak, Yuning Chai, Yong Jae Lee, and Hyo Jin Kim. Agentic very long video understanding.arXiv preprint arXiv:2601.18157, 2026
-
[30]
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36:68539–68551, 2023
work page 2023
-
[31]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[32]
Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face.Advances in Neural Information Processing Systems, 36:38154–38180, 2023. 11
work page 2023
-
[33]
Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaugh- lin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[34]
Moviechat: From dense token to sparse memory for long video understanding
Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, et al. Moviechat: From dense token to sparse memory for long video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18221–18232, 2024
work page 2024
-
[35]
Vipergpt: Visual inference via python execution for reasoning
Dídac Surís, Sachit Menon, and Carl V ondrick. Vipergpt: Visual inference via python execution for reasoning. InProceedings of the IEEE/CVF international conference on computer vision, pages 11888– 11898, 2023
work page 2023
-
[36]
Shulin Tian, Ruiqi Wang, Hongming Guo, Penghao Wu, Yuhao Dong, Xiuying Wang, Jingkang Yang, Hao Zhang, Hongyuan Zhu, and Ziwei Liu. Ego-r1: Chain-of-tool-thought for ultra-long egocentric video reasoning.arXiv preprint arXiv:2506.13654, 2025
-
[37]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[38]
Cfvbench: A comprehensive video benchmark for fine-grained multimodal retrieval-augmented generation
Kaiwen Wei, Xiao Liu, Jie Zhang, Zijian Wang, Ruida Liu, Yuming Yang, Xin Xiao, Xiao Sun, Haoyang Zeng, Changzai Pan, et al. Cfvbench: A comprehensive video benchmark for fine-grained multimodal retrieval-augmented generation. InProceedings of the ACM Web Conference 2026, pages 2501–2512, 2026
work page 2026
-
[39]
Jiaqi Wu, Qinlao Zhao, Zefeng Chen, Kai Qin, Yifei Zhao, Xueqian Wang, and Yuhang Yao. Gap: Graph- based agent planning with parallel tool use and reinforcement learning.arXiv preprint arXiv:2510.25320, 2025
-
[40]
Next-qa: Next phase of question-answering to explaining temporal actions
Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question-answering to explaining temporal actions. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9777–9786, 2021
work page 2021
-
[41]
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan Wang. Mm-react: Prompting chatgpt for multimodal reasoning and action.arXiv preprint arXiv:2303.11381, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[42]
Zongxin Yang, Guikun Chen, Xiaodi Li, Wenguan Wang, and Yi Yang. Doraemongpt: Toward under- standing dynamic scenes with large language models (exemplified as a video agent).arXiv preprint arXiv:2401.08392, 2024
-
[43]
React: Synergizing reasoning and acting in language models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations, 2022
work page 2022
-
[44]
Yufei Yin, Qianke Meng, Minghao Chen, Jiajun Ding, Zhenwei Shao, and Zhou Yu. Videoarm: Agentic reasoning over hierarchical memory for long-form video understanding.arXiv preprint arXiv:2512.12360, 2025
-
[45]
Zhaoyang Yu, Jiayi Zhang, Huixue Su, Yufan Zhao, Yifan Wu, Mingyi Deng, Jinyu Xiang, Yizhang Lin, Lingxiao Tang, Yuyu Luo, et al. Recode: Unify plan and action for universal granularity control.arXiv preprint arXiv:2510.23564, 2025
-
[46]
Huaying Yuan, Zheng Liu, Junjie Zhou, Hongjin Qian, Yan Shu, Nicu Sebe, Ji-Rong Wen, and Zhicheng Dou. Videoexplorer: Think with videos for agentic long-video understanding.arXiv preprint arXiv:2506.10821, 2025
-
[47]
Haoji Zhang, Xin Gu, Jiawen Li, Chixiang Ma, Sule Bai, Chubin Zhang, Bowen Zhang, Zhichao Zhou, Dongliang He, and Yansong Tang. Thinking with videos: Multimodal tool-augmented reinforcement learning for long video reasoning.arXiv preprint arXiv:2508.04416, 2025
-
[48]
Flash-vstream: Efficient real-time understanding for long video streams
Haoji Zhang, Yiqin Wang, Yansong Tang, Yong Liu, Jiashi Feng, and Xiaojie Jin. Flash-vstream: Efficient real-time understanding for long video streams. InProceedings of the IEEE/CVF international conference on computer vision, pages 21059–21069, 2025
work page 2025
-
[49]
Long Context Transfer from Language to Vision
Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, and Ziwei Liu. Long context transfer from language to vision.arXiv preprint arXiv:2406.16852, 2024. 12
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[50]
Xiaoyi Zhang, Zhaoyang Jia, Zongyu Guo, Jiahao Li, Bin Li, Houqiang Li, and Yan Lu. Deep video dis- covery: Agentic search with tool use for long-form video understanding.arXiv preprint arXiv:2505.18079, 2025
-
[51]
Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models
Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. Qwen3 embedding: Advancing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[52]
how the current action should be grounded
Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Zhengyang Liang, Shitao Xiao, Minghao Qin, Xi Yang, Yongping Xiong, Bo Zhang, et al. Mlvu: Benchmarking multi-task long video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13691– 13701, 2025. A Additional Details of MVTL A.1 Tool Counting and Exposure Prot...
work page 2025
-
[53]
**Thinking Process**: You may use ‘<think>...</think>‘ tags for internal reasoning
-
[54]
**Final JSON**: Your response MUST conclude with a single, valid JSON object
-
[55]
$" only for runtime result pointers from prior tool outputs (e.g.,
**Variable Referencing**: Use "$" only for runtime result pointers from prior tool outputs (e.g., "$last_inspect_frame_result", "$frame_evidence_from_Inspect_Frame")
-
[56]
**Tool Names and Capabilities**: You may assume arbitrary specialized tools exist when planning, name a plausible specialized tool, and describe the capability. Do not rely on a hard-coded planner tool inventory in the prompt. # Required JSON Schema { "Thought": "Concise reasoning (max 2 short sentences)", "Plan": "Short step plan", "Evidence": [ { "label...
- [57]
-
[58]
Finish the task: <think> ... </think> { "Thought": "...", "Plan": "...", "Actions": [], "Finish": { "chain_complete": true, "completion_basis": "...", "answer": "..." } } Never output non-empty ‘Actions‘ together with a valid top-level ‘Finish‘. # Data Access Capabilities Tools can access two tiers of video infomation. Specify your needs in the tool descr...
-
[59]
**Cross-tool verification**: By default, validate each piece of evidence and the final conclusion with result from at least two different tools, especially for visual evidence, ambiguous evidence, fine-grained actions, future-event prediction, or when candidate answers are easily confused. If one tool already provides clear, direct, and decisive evidence ...
-
[60]
**Web_Search is not local-video evidence**: The answer must come primarily from local video; do not arbitrarily use Web_Search to replace retrieval or frame checking, etc
-
[61]
**Exact-value tasks need Evidence tracking**: When a query asks for exact values, totals, counts, prices, or scenario comparisons from the video, keep an ‘Evidence‘ array updated with one item per required fact/scenario
-
[62]
**No fabricated tool history**: Never claim that search results were empty, a tool already failed, or prior attempts happened unless that information explicitly appears in the actual Observation history of this run
-
[63]
The timestamps provided alongside that collage correspond to those sub-frames in the same order
**Image-timestamp alignment is explicit**: For both initial grounding collages and tool-returned collages, sub-frames are arranged in chronological order clockwise from the top-left: top-left, top-right, bottom-right, bottom-left. The timestamps provided alongside that collage correspond to those sub-frames in the same order
-
[64]
**Attached images are already direct evidence**: If an Observation includes attached images from a visual tool, those images are already visible evidence for you. Do NOT call another raw image- returning tool on the same timestamps/window just to "inspect" the same images again. Re-probe only if you materially change the time window or apply ‘Crop‘/‘Zoom_...
-
[65]
**FORBIDDEN SIMULATION**: You are STRICTLY FORBIDDEN from guessing, using general knowledge, or simulating missing data (e.g., assuming a 2% payment rate)
-
[66]
Switch to retrieval, verification, or direct-evidence probing instead of guessing
**MISSING INFO = PROBE TRIGGER**: If a value is missing from the initial summary, it is a MANDATE to search the raw video. Switch to retrieval, verification, or direct-evidence probing instead of guessing. 19
-
[67]
Never call a tool with "assumed" numbers
**EVIDENCE-ONLY PARAMETERS**: Every parameter passed to a logic tool MUST be explicitly found first. Never call a tool with "assumed" numbers. # Video-Centric Probe Hierarchy
-
[68]
Use tools to probe the remainder of the timeline
**SCENARIO COMPLETENESS**: If a query asks for a comparison (Scenario A vs B) but only A is summarized, assume B is visible elsewhere. Use tools to probe the remainder of the timeline
-
[69]
**VISUAL IS TRUTH**: Visual frames contain data that summaries often miss. If textual info is vague, your first priority is to define a visual analysis tool to "see" the exact data
-
[70]
**LOGICAL RESTRAINT**: Logic tools (like ‘Python_Executor‘) are ONLY for processing data that has already been retrieved from the evidence
-
[71]
**TEMPORAL CONTINUITY**: After you find one promising segment, continue probing nearby windows before restarting a broad search
- [72]
-
[73]
Use a chain like text->visual->answer or visual->text->answer
**CROSS-MODAL CHAIN IS PLANNER-DECLARED**: For exact-value video tasks, do not answer from a single modality. Use a chain like text->visual->answer or visual->text->answer. Only set ‘Finish. chain_complete=true‘ after you judge that the required chain is complete
-
[74]
answer" must explicitly cite the timestamp for every value used. If question contains options,
**PYTHON IS NOT A VISION FALLBACK**: Never use ‘Python_Executor‘ to infer what local video/images show from filenames or paths. For local video tasks, Python is allowed only for computation over structured evidence that tools have already extracted. # Termination Protocol - **MANDATORY CITATION**: Your final "answer" must explicitly cite the timestamp for...
-
[75]
**Read Mediation Markers First**: Respect ‘[Mediation Rule]‘, ‘[Allowed Resolution Modes]‘, ‘[ Locked Target Tool]‘, ‘[Target Tool Schema]‘, ‘[Candidate Tool Schemas]‘, and ‘[Route Plan Candidates]‘. 20 If the runtime provides ‘[Composite Intent Guidance]‘ with ‘recommended_resolution=" parallel_decomposition"‘, treat that as a strong hint to prefer L4 ov...
-
[76]
If mediation is L2, keep the target tool fixed by default. Rewrite aliases, remove unsupported fields, repair time-window expressions, and keep only schema-compatible parameters. Treat tool-search schemas as authoritative parameter requirements. If the schema says ‘t_start‘, ‘ t_end‘, and ‘density‘, do not emit old aliases like ‘time_range_start‘, ‘time_r...
-
[77]
**Escalate Only When Necessary**: In locked-target mode, you may switch to a replacement tool (L3) or parallel decomposition (L4) only if you judge the target tool is semantically incompatible or cannot complete the task alone
-
[78]
If mediation is ‘L3 or L4‘, use route metadata and candidate tools to decide between: - one executable replacement tool, or - multiple child tools returned together in ‘Actions‘ When the request simultaneously asks about multiple evidence dimensions such as actions, style/ category, subtitles or audio cues, and scene distinctions/comparisons, prefer multi...
-
[79]
Do not output placeholder or abstract tool names
**Abstract-tool boundary is level-specific**: - In ‘L1-L3‘, every returned action must already be a concrete executable tool. Do not output placeholder or abstract tool names. - Only in ‘L4‘ may you selectively return abstract child tools in ‘Actions‘, when one tool is insufficient and the task must be decomposed for resolution
-
[80]
Child tools that become valid concrete tools will execute directly
**L4 Semantics**: For each child tool produced at ‘L4‘, the runtime will run Tool_Search plus parameter validation independently. Child tools that become valid concrete tools will execute directly. Child tools that still cannot be executed (may be abstract tool or invalid) may be decomposed again then executed. Requests may contain nested tools, where one...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.