Pith Number

pith:7XCJPIKR

pith:2026:7XCJPIKRCOGJ7CQH6DK2A4AIP5

not attested not anchored not stored refs resolved

ReTool-Video: Recursive Tool-Using Video Agents with Meta-Augmented Tool Grounding

Changjian Wang, Guohui Xiang, Jiang Zhong, Junnan Zhu, KaiWen Wei, Nayu Liu, Rongzhen Li, Ruirui Chen, Xiao Liu

ReTool-Video recursively grounds abstract video intents into executable tool chains using a library of 134 meta-augmented tools.

arxiv:2605.13228 v1 · 2026-05-13 · cs.CV · cs.AI

Open paper page JSON Open Graph Bundle Merged state Verified badge What is a Pith Number?

Add to your LaTeX paper

\usepackage{pith}
\pithnumber{7XCJPIKRCOGJ7CQH6DK2A4AIP5}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Experiments on MVBench, MLVU, and Video-MME w/o sub. show that ReTool-Video consistently outperforms strong baselines. Further analysis demonstrates that recursive grounding and fine-grained meta tools improve the stability and effectiveness of complex video understanding.

C2weakest assumption

That high-level video intents can be reliably matched or decomposed by the resolver into the 134 registered tools without introducing errors, excessive recursion, or loss of reasoning fidelity.

C3one line summary

ReTool-Video uses a 134-tool meta-augmented library and recursive grounding to translate abstract video intents into fine-grained multimodal operations, outperforming baselines on MVBench, MLVU, and Video-MME.

References

89 extracted · 89 resolved · 11 Pith anchors

[1] Model System Cards 2025

[2] Sharegpt4video: Improving video understanding and generation with better captions.Advances in Neural Information Processing Systems, 37:19472–19495, 2024 2024

[3] VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs 2024 · arXiv:2406.07476

[4] Video question answering with procedural programs 2024

[5] Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities 2025 · arXiv:2507.06261

Receipt and verification

First computed	2026-05-18T02:44:49.613036Z
Builder	pith-number-builder-2026-05-17-v1
Signature	Pith Ed25519 (`pith-v1-2026-05`) · public key
Schema	pith-number/v1.0

Canonical hash

fdc497a151138c9f8a07f0d5a070087f6ffd930ccc1a1e41edb1c2f35c299013

Aliases

arxiv: 2605.13228 · arxiv_version: 2605.13228v1 · doi: 10.48550/arxiv.2605.13228 · pith_short_12: 7XCJPIKRCOGJ · pith_short_16: 7XCJPIKRCOGJ7CQH · pith_short_8: 7XCJPIKR

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/7XCJPIKRCOGJ7CQH6DK2A4AIP5 \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: fdc497a151138c9f8a07f0d5a070087f6ffd930ccc1a1e41edb1c2f35c299013

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "f1688adf80f2975bf8fa10833ac208bb0ccc94eaf0296ecdbc44f220013acc4a",
    "cross_cats_sorted": [
      "cs.AI"
    ],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.CV",
    "submitted_at": "2026-05-13T09:19:22Z",
    "title_canon_sha256": "94b40a10610ce69c1fd17eef5f72ab6985bb50e9173e1bff79511a606bf95cf8"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2605.13228",
    "kind": "arxiv",
    "version": 1
  }
}