Do Multimodal Agents Really Benefit from Tool Use? A Systematic Study of Capability Gains

Donglei Yu; Garvin Guo; Huaxing Liu; Minpeng Liao; Qinghao Wang; Shuai Li; Xiang Wang; Xinpei Zhao; Yu Chen

arxiv: 2606.02357 · v1 · pith:3ALO7SHLnew · submitted 2026-06-01 · 💻 cs.CV · cs.AI

Do Multimodal Agents Really Benefit from Tool Use? A Systematic Study of Capability Gains

Garvin Guo , Donglei Yu , Yu Chen , Xiang Wang , Shuai Li , Xinpei Zhao , Huaxing Liu , Qinghao Wang

show 1 more author

Minpeng Liao

This is my paper

Pith reviewed 2026-06-28 14:56 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords multimodal agentstool usecapability evaluationagent benchmarkstool-augmented reasoningOCRchart understandingmathematical reasoning

0 comments

The pith

Tool access yields little consistent improvement for multimodal agents beyond learning call patterns.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether tool use in multimodal agents like Thyme and DeepEyesV2 delivers real capability gains on real-world understanding, OCR, chart tasks, and math reasoning. Each agent is measured against a Tool-Free counterpart and a Pure-Text Reasoner trained on the same data without tool trajectories. Aggregate scores show minimal gains from tools, token costs do not reliably drop, and 93 to 96 percent of the problems solved only when tools are present are also solved in at least one non-tool condition. Ablations of the tool loop reveal that the call format or the execution result alone often matches full tool use. The work concludes that current agents learn reliable tool-calling behavior more than they acquire new problem-solving power from the tools.

Core claim

Across the studied agents and tasks, tool access produces little consistent aggregate improvement and does not reliably reduce generated-token cost. Only a small tool-only solved set remains: 93 percent of DeepEyesV2 tool-solved problems and 96 percent of Thyme tool-solved problems are also solved by at least one non-tool setting. Mechanism ablations show the full tool-use loop does not consistently outperform the tool-call format or the returned execution result alone.

What carries the argument

The tool-only solved set, which counts problems solved exclusively when tools are available but not in any non-tool control condition, serves as the direct measure of capability expansion from tool use.

If this is right

Tool access does not reliably reduce generated-token cost in the evaluated settings.
The complete tool-use loop does not consistently outperform the tool-call format or execution result alone.
Agents learn tool-calling patterns more reliably than they acquire tool-contributed capabilities.
Evaluation protocols should separate tool availability from whether tools expand the set of solvable problems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pattern of limited tool-only gains may appear in other multimodal agents trained with similar reinforcement or supervised signals.
Benchmarks could be strengthened by adding explicit controls that test whether tool results supply answer-critical information rather than just format cues.
Training objectives that reward integration of tool outputs rather than call frequency might shift agents toward greater capability expansion.
Parallel studies on text-only agents would clarify whether the observed pattern is specific to multimodal tool loops.

Load-bearing premise

The Tool-Free counterpart and Pure-Text Reasoner trained from the same source pool without tool-calling trajectories form fair controls that isolate the contribution of tool use itself.

What would settle it

A substantially larger set of problems solved exclusively by the full tool-use condition and unsolved by all non-tool and partial-tool ablations would demonstrate genuine capability gains from tool access.

Figures

Figures reproduced from arXiv: 2606.02357 by Donglei Yu, Garvin Guo, Huaxing Liu, Minpeng Liao, Qinghao Wang, Shuai Li, Xiang Wang, Xinpei Zhao, Yu Chen.

**Figure 2.** Figure 2: Solved-set coverage of Tool-Enabled Agents relative to non-tool settings across task families. Most [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Process-level analysis of tool use. Tool calls often reflect redundant confirmation, failed contribution, [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Average generated-token counts across task families for the Tool-Enabled Agent, the Tool-Free Agent, [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗

read the original abstract

Tool-augmented multimodal agents show strong benchmark gains, often taken as evidence that agents have learned to use tools. We argue that this interpretation can be premature: a tool-call trace alone does not show whether the tool supplied answer-critical information. We study two representative ``thinking with images'' agents, Thyme and DeepEyesV2, across real-world understanding, OCR, chart understanding, and mathematical reasoning. Each agent is compared with its Tool-Free counterpart and with a Pure-Text Reasoner trained from the same source pool without tool-calling trajectories. Tool access yields little consistent aggregate improvement, does not reliably reduce generated-token cost, and leaves only a small tool-only solved set: 93% of DeepEyesV2's tool-solved problems and 96% of Thyme's are also solved by at least one non-tool setting. Mechanism ablations further show that the full tool-use loop does not consistently outperform either the tool-call format or the returned execution result alone. In the settings we study, the analyzed agents appear to learn tool-calling patterns more reliably than tool-contributed capabilities, suggesting that evaluation should distinguish tool availability from whether tools actually expand what agents can solve.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Tool use here mostly teaches calling patterns rather than expanding what the agents can actually solve.

read the letter

The punchline here is that these two agents mostly solve the same problems with or without tools, which undercuts the idea that tool use is expanding their capabilities much.

The paper does a good job laying out a head-to-head comparison across several task categories: real-world understanding, OCR, chart understanding, and math reasoning. By training Pure-Text Reasoners from the same source pool without tool-calling trajectories, they try to separate the effect of tool access from other factors. The finding that 93% and 96% of tool-solved problems are covered by non-tool settings is a concrete number that future work can build on or challenge. The mechanism ablations showing that the full loop doesn't always beat just the format or the result are also worth noting.

What stands out is the attempt to quantify the "tool-only solved set." That moves the discussion beyond aggregate scores.

Where it might be soft is in verifying that the controls are matched properly. The abstract mentions the Pure-Text Reasoner is trained from the same source pool, but without details on data filtering, compute used, or performance on non-tool tasks, it's possible the baselines aren't equivalent in capability before tool use is added. If the non-tool models are weaker for other reasons, the small tool-only set could be misleading. The stress-test concern about this isolation assumption seems on point based on what's in the abstract.

This paper is aimed at people building or evaluating tool-using multimodal agents. Anyone running benchmarks on agents will find the cautionary note useful.

I would recommend sending it for peer review. The empirical question is worth pursuing, and the setup is clear enough that referees can check the methods and data.

Referee Report

1 major / 0 minor

Summary. The paper claims that tool access in multimodal agents Thyme and DeepEyesV2 yields little consistent aggregate improvement over Tool-Free counterparts and Pure-Text Reasoners trained from the same source pool without tool-calling trajectories. Across real-world understanding, OCR, chart understanding, and mathematical reasoning, tool use does not reliably reduce generated-token cost, and 93% of DeepEyesV2's tool-solved problems and 96% of Thyme's are also solved by at least one non-tool setting. Mechanism ablations indicate the full tool-use loop does not consistently outperform tool-call format or execution result alone, suggesting agents learn tool-calling patterns more reliably than tool-contributed capabilities.

Significance. If the result holds after verification of baseline parity, the finding would be significant for multimodal agent research by challenging the common interpretation of benchmark gains as evidence of learned tool use and by advocating evaluations that separate tool availability from actual expansion of solvable problems. The study supplies concrete overlap statistics and ablation outcomes that could inform more rigorous agent assessment protocols.

major comments (1)

[Abstract] Abstract: The central claim that tool access adds little capability beyond patterns rests on Tool-Free counterparts and Pure-Text Reasoners serving as fair controls that isolate the contribution of tool use. The abstract supplies no quantitative comparison of training compute, data volume, filtering effects, or parity on non-tool tasks, so performance gaps cannot yet be attributed specifically to the presence or absence of tool-calling trajectories.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed feedback. We address the concern regarding the abstract below.

read point-by-point responses

Referee: The central claim that tool access adds little capability beyond patterns rests on Tool-Free counterparts and Pure-Text Reasoners serving as fair controls that isolate the contribution of tool use. The abstract supplies no quantitative comparison of training compute, data volume, filtering effects, or parity on non-tool tasks, so performance gaps cannot yet be attributed specifically to the presence or absence of tool-calling trajectories.

Authors: We agree that the abstract would be strengthened by briefly noting the quantitative controls. The full manuscript (Methods and Appendix) establishes that Tool-Free and Pure-Text models were trained from the identical source pool with the same data volume, compute budget, and hyperparameters, with no differential filtering applied. Direct parity on non-tool tasks is reported via side-by-side benchmark results. We will revise the abstract to include a concise statement referencing these equalities so that the isolation of tool-use effects is clearer from the outset. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparison study

full rationale

The paper conducts a systematic empirical evaluation of two multimodal agents (Thyme and DeepEyesV2) against Tool-Free counterparts and Pure-Text Reasoners trained from the same source pool. All claims rest on direct benchmark performance measurements, ablation studies, and counts of tool-only solved problems across real-world understanding, OCR, chart, and math tasks. No equations, derivations, fitted parameters, or mathematical predictions are present. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The controls are described explicitly in the abstract and methods as removing tool-calling trajectories while sharing the source pool; any debate over their fairness concerns experimental validity rather than circular reduction of a claimed derivation to its inputs. The study is self-contained against external benchmarks and contains no steps that match the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical study; no mathematical derivations, fitted parameters, or new postulated entities are introduced.

pith-pipeline@v0.9.1-grok · 5766 in / 947 out tokens · 24843 ms · 2026-06-28T14:56:07.782735+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 2 canonical work pages · 2 internal anchors

[1]

MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning

Visual-rft: Visual reinforcement fine-tuning. In2025 IEEE/CVF International Conference on Com- puter Vision (ICCV), pages 2034–2044. Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chun- yuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai- Wei Chang, Michel Galley, and Jianfeng Gao. 2024. MathVista: Evaluating mathematical reasoning of foundation models in visual...

work page internal anchor Pith review Pith/arXiv arXiv 2034
[2]

LogicVista: Multimodal llm logical reason- ing benchmark in visual contexts.arXiv preprint arXiv:2407.04973. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Day- iheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, and 41 others. 2025. Qwen...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

InThe Fourteenth International Conference on Learning Representa- tions

Deepeyes: Incentivizing ”thinking with im- ages” via reinforcement learning. InThe Fourteenth International Conference on Learning Representa- tions. Chengke Zou, Xingang Guo, Rui Yang, Junyu Zhang, Bin Hu, and Huan Zhang. 2025. DynaMath: A dy- namic visual benchmark for evaluating mathematical reasoning robustness of vision language models. In The Thirte...

2025
[4]

,→ ,→ - Confirmatory: the tool mainly confirms or repeats an already available conclusion.,→ - Irrelevant: the tool output is not materially useful for solving the task.,→

Information Gain - Novel: the tool provides new task-relevant information not already available from the model's unaided reasoning. ,→ ,→ - Confirmatory: the tool mainly confirms or repeats an already available conclusion.,→ - Irrelevant: the tool output is not materially useful for solving the task.,→
[5]

Tool Output Quality - Useful/Correct: the tool output is correct and meaningfully usable.,→ - Partially useful: the tool output is partially correct, incomplete, or only weakly useful.,→ - Wrong/Failed: the tool output is wrong, failed, or unusable.,→
[6]

Integration Status - Used correctly: the final reasoning meaningfully uses the tool output in a correct way. ,→ ,→ - Ignored: useful or potentially useful tool output is not actually used.,→ - Misused/Misinterpreted: the model uses the tool output incorrectly or is misled by it.,→ Important decision rules:
[7]

Only assign Novel if the tool output clearly reveals or computes information ,→ ,→ that was not already available from the pre-tool reasoning.,→

Novel should be rare. Only assign Novel if the tool output clearly reveals or computes information ,→ ,→ that was not already available from the pre-tool reasoning.,→
[8]

If the tool only redisplays the original image without crop, zoom, OCR, enhancement, measurement, ,→ ,→ parsing, or other transformation, do NOT label it as Novel.,→
[9]

If the tool output is empty, trivial, or only repeats the original image view, it should usually be ,→ ,→ Confirmatory or Irrelevant, not Novel
[10]

A correct final answer can still correspond to ,→ ,→ Confirmatory, Partially useful, or even Irrelevant tool use.,→

Do not infer tool usefulness from final correctness. A correct final answer can still correspond to ,→ ,→ Confirmatory, Partially useful, or even Irrelevant tool use.,→
[11]

If it adds ,→ ,→ little or no new evidence, prefer Partially useful or Wrong/Failed as appropriate.,→

Useful/Correct requires that the tool output itself is materially informative and correct. If it adds ,→ ,→ little or no new evidence, prefer Partially useful or Wrong/Failed as appropriate.,→
[12]

If the post-tool answer mostly repeats the same conclusion already stated before the tool call, that is ,→ ,→ evidence against Novel information gain
[13]

It is NOT a valid value for tool_output_quality or ,→ ,→ integration_status

Irrelevant is ONLY valid for information_gain. It is NOT a valid value for tool_output_quality or ,→ ,→ integration_status
[14]

If the tool output is empty, blank, trivial, or unusable, then tool_output_quality should usually be ,→ ,→ Wrong/Failed
[15]

If the tool output is merely the same unprocessed image view, with no crop / zoom / OCR / enhancement / ,→ ,→ measurement / parsing, then tool_output_quality should usually be Partially useful at most, and often ,→ ,→ Wrong/Failed if the execution adds nothing
[16]

information_gain

If the tool output adds nothing and the final answer comes from the same pre-tool reasoning, prefer ,→ ,→ integration_status = Ignored, not Used correctly.,→ Field-specific allowed values: - information_gain must be exactly one of: Novel, Confirmatory, Irrelevant,→ - tool_output_quality must be exactly one of: Useful/Correct, Partially useful, Wrong/Faile...

[1] [1]

MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning

Visual-rft: Visual reinforcement fine-tuning. In2025 IEEE/CVF International Conference on Com- puter Vision (ICCV), pages 2034–2044. Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chun- yuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai- Wei Chang, Michel Galley, and Jianfeng Gao. 2024. MathVista: Evaluating mathematical reasoning of foundation models in visual...

work page internal anchor Pith review Pith/arXiv arXiv 2034

[2] [2]

LogicVista: Multimodal llm logical reason- ing benchmark in visual contexts.arXiv preprint arXiv:2407.04973. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Day- iheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, and 41 others. 2025. Qwen...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

InThe Fourteenth International Conference on Learning Representa- tions

Deepeyes: Incentivizing ”thinking with im- ages” via reinforcement learning. InThe Fourteenth International Conference on Learning Representa- tions. Chengke Zou, Xingang Guo, Rui Yang, Junyu Zhang, Bin Hu, and Huan Zhang. 2025. DynaMath: A dy- namic visual benchmark for evaluating mathematical reasoning robustness of vision language models. In The Thirte...

2025

[4] [4]

,→ ,→ - Confirmatory: the tool mainly confirms or repeats an already available conclusion.,→ - Irrelevant: the tool output is not materially useful for solving the task.,→

Information Gain - Novel: the tool provides new task-relevant information not already available from the model's unaided reasoning. ,→ ,→ - Confirmatory: the tool mainly confirms or repeats an already available conclusion.,→ - Irrelevant: the tool output is not materially useful for solving the task.,→

[5] [5]

Tool Output Quality - Useful/Correct: the tool output is correct and meaningfully usable.,→ - Partially useful: the tool output is partially correct, incomplete, or only weakly useful.,→ - Wrong/Failed: the tool output is wrong, failed, or unusable.,→

[6] [6]

Integration Status - Used correctly: the final reasoning meaningfully uses the tool output in a correct way. ,→ ,→ - Ignored: useful or potentially useful tool output is not actually used.,→ - Misused/Misinterpreted: the model uses the tool output incorrectly or is misled by it.,→ Important decision rules:

[7] [7]

Only assign Novel if the tool output clearly reveals or computes information ,→ ,→ that was not already available from the pre-tool reasoning.,→

Novel should be rare. Only assign Novel if the tool output clearly reveals or computes information ,→ ,→ that was not already available from the pre-tool reasoning.,→

[8] [8]

If the tool only redisplays the original image without crop, zoom, OCR, enhancement, measurement, ,→ ,→ parsing, or other transformation, do NOT label it as Novel.,→

[9] [9]

If the tool output is empty, trivial, or only repeats the original image view, it should usually be ,→ ,→ Confirmatory or Irrelevant, not Novel

[10] [10]

A correct final answer can still correspond to ,→ ,→ Confirmatory, Partially useful, or even Irrelevant tool use.,→

Do not infer tool usefulness from final correctness. A correct final answer can still correspond to ,→ ,→ Confirmatory, Partially useful, or even Irrelevant tool use.,→

[11] [11]

If it adds ,→ ,→ little or no new evidence, prefer Partially useful or Wrong/Failed as appropriate.,→

Useful/Correct requires that the tool output itself is materially informative and correct. If it adds ,→ ,→ little or no new evidence, prefer Partially useful or Wrong/Failed as appropriate.,→

[12] [12]

If the post-tool answer mostly repeats the same conclusion already stated before the tool call, that is ,→ ,→ evidence against Novel information gain

[13] [13]

It is NOT a valid value for tool_output_quality or ,→ ,→ integration_status

Irrelevant is ONLY valid for information_gain. It is NOT a valid value for tool_output_quality or ,→ ,→ integration_status

[14] [14]

If the tool output is empty, blank, trivial, or unusable, then tool_output_quality should usually be ,→ ,→ Wrong/Failed

[15] [15]

If the tool output is merely the same unprocessed image view, with no crop / zoom / OCR / enhancement / ,→ ,→ measurement / parsing, then tool_output_quality should usually be Partially useful at most, and often ,→ ,→ Wrong/Failed if the execution adds nothing

[16] [16]

information_gain

If the tool output adds nothing and the final answer comes from the same pre-tool reasoning, prefer ,→ ,→ integration_status = Ignored, not Used correctly.,→ Field-specific allowed values: - information_gain must be exactly one of: Novel, Confirmatory, Irrelevant,→ - tool_output_quality must be exactly one of: Useful/Correct, Partially useful, Wrong/Faile...