{"paper":{"title":"ReTool-Video: Recursive Tool-Using Video Agents with Meta-Augmented Tool Grounding","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"ReTool-Video recursively grounds abstract video intents into executable tool chains using a library of 134 meta-augmented tools.","cross_cats":["cs.AI"],"primary_cat":"cs.CV","authors_text":"Changjian Wang, Guohui Xiang, Jiang Zhong, Junnan Zhu, KaiWen Wei, Nayu Liu, Rongzhen Li, Ruirui Chen, Xiao Liu","submitted_at":"2026-05-13T09:19:22Z","abstract_excerpt":"Video understanding requires active evidence seeking, motivating tool-augmented video agents for temporal reasoning, cross-modal understanding, and complex question answering. Existing video agents have improved video reasoning with retrieval, memory, frame inspection, and verifier tools, but they still face two limitations: (1) a coarse tool space that lacks fine-grained operations for compositional reasoning; and (2) a flat action space that forces high-level video intents into primitive executable tool calls. In this paper, we address these challenges with two complementary designs. First, "},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Experiments on MVBench, MLVU, and Video-MME w/o sub. show that ReTool-Video consistently outperforms strong baselines. Further analysis demonstrates that recursive grounding and fine-grained meta tools improve the stability and effectiveness of complex video understanding.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That high-level video intents can be reliably matched or decomposed by the resolver into the 134 registered tools without introducing errors, excessive recursion, or loss of reasoning fidelity.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"ReTool-Video uses a 134-tool meta-augmented library and recursive grounding to translate abstract video intents into fine-grained multimodal operations, outperforming baselines on MVBench, MLVU, and Video-MME.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"ReTool-Video recursively grounds abstract video intents into executable tool chains using a library of 134 meta-augmented tools.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"da242774fea16a4a4d0c5680a18ae5c657dc379c6a5a17aabceadd0d270362f3"},"source":{"id":"2605.13228","kind":"arxiv","version":1},"verdict":{"id":"0a4c5d12-6012-421a-ab93-d5da07c29031","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-14T20:04:07.533132Z","strongest_claim":"Experiments on MVBench, MLVU, and Video-MME w/o sub. show that ReTool-Video consistently outperforms strong baselines. Further analysis demonstrates that recursive grounding and fine-grained meta tools improve the stability and effectiveness of complex video understanding.","one_line_summary":"ReTool-Video uses a 134-tool meta-augmented library and recursive grounding to translate abstract video intents into fine-grained multimodal operations, outperforming baselines on MVBench, MLVU, and Video-MME.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That high-level video intents can be reliably matched or decomposed by the resolver into the 134 registered tools without introducing errors, excessive recursion, or loss of reasoning fidelity.","pith_extraction_headline":"ReTool-Video recursively grounds abstract video intents into executable tool chains using a library of 134 meta-augmented tools."},"references":{"count":89,"sample":[{"doi":"","year":2025,"title":"Model System Cards","work_id":"48f69590-3d62-41e5-87e8-e792337e716a","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"Sharegpt4video: Improving video understanding and generation with better captions.Advances in Neural Information Processing Systems, 37:19472–19495, 2024","work_id":"9b40200f-b968-41d0-b007-b4deebd1b256","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs","work_id":"ccfc3f89-c510-45f1-8a35-ed1a56c0ae5c","ref_index":3,"cited_arxiv_id":"2406.07476","is_internal_anchor":true},{"doi":"","year":2024,"title":"Video question answering with procedural programs","work_id":"a31fff1b-ca21-4c20-bbac-fbad471f690a","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2025,"title":"Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities","work_id":"008df105-2fdd-45d8-857a-8e35868aecb6","ref_index":5,"cited_arxiv_id":"2507.06261","is_internal_anchor":true}],"resolved_work":89,"snapshot_sha256":"64acdeda5f654aa9d140477dac2d5590cd096eefd333f32db4c0d6996eb8ab0b","internal_anchors":11},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}