Diversity Over Frequency: Rethinking Tool Use in Visual Chain-of-Thought Agents
Pith reviewed 2026-06-29 23:08 UTC · model grok-4.3
The pith
Visual agents reach higher accuracy by promoting diverse rollouts even as tool use declines during training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We identify a tool-use collapse phenomenon: models progressively stop using tools while still achieving higher task accuracy. Moreover, we observe a clear asymmetry: (i) completely eliminating tool use degrades performance, whereas (ii) incentivizing tool use yields only marginal gains despite substantially increasing usage. We find that vanilla training and tool-use encouragement both reduce rollout diversity, explaining why higher tool use does not yield stronger reasoning performance. Motivated by these findings, we add an entropy regularization term to encourage diverse rollout exploration, achieving the best performance despite gradually declining tool usage. Overall, our findings sugge
What carries the argument
Entropy regularization term added during training to increase the diversity of language-generation and tool-invocation sequences in visual chain-of-thought rollouts.
If this is right
- Completely disabling tool access lowers accuracy on 3D spatial reasoning and medical visual question answering.
- Explicit rewards for tool use raise invocation rates but deliver only small accuracy lifts.
- Both standard training and tool-use encouragement shrink the variety of generated trajectories.
- Entropy regularization produces the highest accuracy while tool usage continues to fall.
Where Pith is reading between the lines
- The scaffolding perspective implies that curricula could deliberately phase out tools once rollout diversity has been established.
- The same diversity-over-frequency pattern may appear in non-visual agent domains such as text-only reasoning or robotic planning.
- Future experiments could measure whether rollout diversity predicts final accuracy better than raw tool count across different model scales.
Load-bearing premise
Reduced rollout diversity caused by vanilla training and tool-use encouragement is the primary reason that higher tool frequency fails to improve reasoning performance.
What would settle it
An ablation that applies tool-use incentives while separately maintaining high rollout diversity and still records only marginal accuracy gains would indicate that diversity is not the decisive factor.
Figures
read the original abstract
Visual agents employ external visual tools within visual chains of thought to incorporate fine-grained evidence. While prior work has mainly studied these tools in visual search tasks, their role in more complex visual reasoning remains underexplored. In this paper, we move beyond simple visual search tasks to investigate more challenging tasks, including 3D spatial reasoning and medical visual question answering, where agents must integrate tool-acquired local evidence with the global context. We identify a {tool-use collapse phenomenon: models progressively stop using tools while still achieving higher task accuracy. Moreover, we observe a clear asymmetry: (i) completely eliminating tool use degrades performance, whereas (ii) incentivizing tool use yields only marginal gains despite substantially increasing usage. We find that vanilla training and tool-use encouragement both reduce rollout diversity, explaining why higher tool use does not yield stronger reasoning performance. Motivated by these findings, we add an entropy regularization term to encourage diverse rollout exploration, achieving the best performance despite gradually declining tool usage. Overall, our findings suggest a training-time view of tools as scaffolding, where broader exploration over language generation and visual tool invocation improves reasoning despite tool-use collapse. Project page: https://scaffolded-exploration.github.io
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript examines tool use in visual chain-of-thought agents beyond simple visual search, focusing on complex tasks such as 3D spatial reasoning and medical VQA. It reports a tool-use collapse phenomenon in which models progressively reduce tool usage while task accuracy increases, along with an asymmetry: eliminating tool use degrades performance while incentivizing it produces only marginal gains despite higher usage. The authors attribute the marginal gains to reduced rollout diversity under both vanilla training and tool-use encouragement, and introduce an entropy regularization term to promote diverse exploration, which yields the best performance despite declining tool usage. They conclude that tools function as scaffolding for broader exploration over language and tool invocation.
Significance. If the empirical patterns and the effectiveness of entropy regularization hold under rigorous controls, the work could shift training approaches for visual agents toward emphasizing rollout diversity rather than tool-use frequency, with potential benefits for multimodal reasoning in domains requiring integration of local evidence and global context.
major comments (2)
- [Abstract] Abstract: the explanatory claim that reduced rollout diversity (induced by vanilla training and tool-use encouragement) accounts for why incentivizing tool use yields only marginal accuracy gains is not supported by any isolation of diversity reduction as the causal mechanism versus alternative policy shifts (e.g., changes in tool timing or internal representations); this untested link is load-bearing for the motivation of the entropy regularization term.
- [Abstract] Abstract: the reported patterns of tool-use collapse, performance asymmetry, and diversity reduction lack any mention of experimental controls, statistical significance tests, error bars, dataset sizes, or ablation studies, leaving the central empirical claims only moderately supported.
minor comments (1)
- The abstract provides a project page but does not indicate whether code or experimental details will be released to support reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the two major comments on the abstract below and will revise the manuscript to improve clarity and support for the central claims.
read point-by-point responses
-
Referee: [Abstract] Abstract: the explanatory claim that reduced rollout diversity (induced by vanilla training and tool-use encouragement) accounts for why incentivizing tool use yields only marginal accuracy gains is not supported by any isolation of diversity reduction as the causal mechanism versus alternative policy shifts (e.g., changes in tool timing or internal representations); this untested link is load-bearing for the motivation of the entropy regularization term.
Authors: We agree the abstract phrasing presents diversity reduction as explanatory without an explicit isolation experiment ruling out alternatives such as shifts in tool timing or internal representations. The full manuscript reports ablations showing that both vanilla training and tool-use encouragement reduce measured rollout diversity (via entropy and unique action counts) while entropy regularization restores it and improves accuracy; however, these are correlational. We will revise the abstract to describe the observed associations more precisely and add a short discussion of alternative explanations, making the motivation for entropy regularization rest on the empirical patterns rather than an untested causal claim. revision: partial
-
Referee: [Abstract] Abstract: the reported patterns of tool-use collapse, performance asymmetry, and diversity reduction lack any mention of experimental controls, statistical significance tests, error bars, dataset sizes, or ablation studies, leaving the central empirical claims only moderately supported.
Authors: The abstract is length-constrained and focuses on the key phenomena. The full manuscript specifies dataset sizes for the 3D spatial reasoning and medical VQA benchmarks, includes ablation studies on tool-use incentives and entropy regularization, and reports results with error bars across multiple seeds. We will revise the abstract to briefly reference these elements (e.g., “supported by ablations with error bars across datasets”) so the empirical claims are presented with appropriate context. revision: yes
Circularity Check
No significant circularity; findings are empirical observations
full rationale
The paper's central claims consist of direct empirical observations from training runs (tool-use collapse, asymmetry in tool-use effects, and reduced rollout diversity under vanilla training or encouragement). These are measured outcomes from experiments rather than quantities derived by definition, fitted parameters renamed as predictions, or self-citation chains. No equations, uniqueness theorems, or ansatzes are invoked in the abstract or description that reduce to inputs by construction. The entropy regularization term is added as a motivated intervention based on observed correlations, not forced by the paper's own definitions or prior self-citations. This is self-contained empirical work with no load-bearing circular steps.
Axiom & Free-Parameter Ledger
free parameters (1)
- entropy regularization coefficient
axioms (1)
- domain assumption Visual chain-of-thought agents can be trained via objectives that treat tool invocation as part of the action space and that rollout diversity can be measured and regularized.
Reference graph
Works this paper leans on
-
[1]
Langley , title =
P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =
2000
-
[2]
T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980
1980
-
[3]
M. J. Kearns , title =
-
[4]
Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983
1983
-
[5]
R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000
2000
-
[6]
Suppressed for Anonymity , author=
-
[7]
Newell and P
A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981
1981
-
[8]
A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959
1959
-
[9]
Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
3dsrbench: A comprehensive 3d spatial reasoning benchmark , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
-
[10]
arXiv preprint arXiv:2510.01681 , year=
Look Less, Reason More: Rollout-Guided Adaptive Pixel-Space Reasoning , author=. arXiv preprint arXiv:2510.01681 , year=
-
[11]
Adaptive Chain-of-Focus Reasoning via Dynamic Visual Search and Zooming for Efficient VLMs
Chain-of-Focus: Adaptive Visual Search and Zooming for Multimodal Reasoning via RL , author=. arXiv preprint arXiv:2505.15436 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
CropVLM: Learning to Zoom for Fine-Grained Vision-Language Perception
CropVLM: Learning to Zoom for Fine-Grained Vision-Language Perception , author=. arXiv preprint arXiv:2511.19820 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Failure Modes of Maximum Entropy RLHF
Failure Modes of Maximum Entropy RLHF , author=. arXiv preprint arXiv:2509.20265 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
Reward gaming in conditional text generation , author=. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[15]
The Fourteenth International Conference on Learning Representations , year=
Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search , author=. The Fourteenth International Conference on Learning Representations , year=
-
[16]
arXiv preprint arXiv:2505.19255 , year=
VTool-R1: VLMs Learn to Think with Images via Reinforcement Learning on Multimodal Tool Use , author=. arXiv preprint arXiv:2505.19255 , year=
-
[17]
OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning
Openthinkimg: Learning to think with images via visual tool reinforcement learning , author=. arXiv preprint arXiv:2505.08617 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =
Wei, Tong and Yang, Yijun and Xing, Junliang and Shi, Yuanchun and Lu, Zongqing and Ye, Deheng , title =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =. 2025 , pages =
2025
-
[19]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Spatialvlm: Endowing vision-language models with spatial reasoning capabilities , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[20]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Qwen2-vl: Enhancing vision-language model's perception of the world at any resolution , author=. arXiv preprint arXiv:2409.12191 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond , author=. arXiv preprint arXiv:2308.12966 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
International conference on machine learning , pages=
Understanding the impact of entropy on policy optimization , author=. International conference on machine learning , pages=. 2019 , organization=
2019
-
[24]
International conference on machine learning , pages=
Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor , author=. International conference on machine learning , pages=. 2018 , organization=
2018
-
[25]
Proximal Policy Optimization Algorithms
Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
Advances in neural information processing systems , volume=
Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=
-
[27]
Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages=
Infographicvqa , author=. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages=
-
[28]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
Gqa: A new dataset for real-world visual reasoning and compositional question answering , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[29]
Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
Robospatial: Teaching spatial understanding to 2d and 3d vision-language models for robotics , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
-
[30]
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control , author=. arXiv preprint arXiv:2307.15818 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[31]
Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
Navigation world models , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
-
[32]
Advances in Neural Information Processing Systems , volume=
Spatialrgpt: Grounded spatial reasoning in vision-language models , author=. Advances in Neural Information Processing Systems , volume=
-
[33]
Openai o1 system card , author=. arXiv preprint arXiv:2412.16720 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[34]
2025 , month =
OpenAI , title =. 2025 , month =
2025
-
[35]
The Fourteenth International Conference on Learning Representations , year=
DeepEyes: Incentivizing ``Thinking with Images'' via Reinforcement Learning , author=. The Fourteenth International Conference on Learning Representations , year=
-
[36]
arXiv preprint arXiv:2505.19076 , year=
ChartSketcher: Reasoning with Multimodal Feedback and Reflection for Chart Understanding , author=. arXiv preprint arXiv:2505.19076 , year=
-
[37]
Proceedings of the 42nd International Conference on Machine Learning , pages =
Imagine While Reasoning in Space: Multimodal Visualization-of-Thought , author =. Proceedings of the 42nd International Conference on Machine Learning , pages =. 2025 , editor =
2025
-
[38]
2025 , url=
YiFan Zhang and Huanyu Zhang and Haochen Tian and Chaoyou Fu and Shuangqing Zhang and Junfei Wu and Feng Li and Kun Wang and Qingsong Wen and Zhang Zhang and Liang Wang and Rong Jin , booktitle=. 2025 , url=
2025
-
[39]
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
-
[40]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
V?: Guided visual search as a core mechanism in multimodal llms , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[41]
Evaluating Object Hallucination in Large Vision-Language Models
Li, Yifan and Du, Yifan and Zhou, Kun and Wang, Jinpeng and Zhao, Xin and Wen, Ji-Rong. Evaluating Object Hallucination in Large Vision-Language Models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.20
-
[42]
Li, Lei and Wang, Yuqi and Xu, Runxin and Wang, Peiyi and Feng, Xiachong and Kong, Lingpeng and Liu, Qi. Multimodal A r X iv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.775
-
[43]
Proceedings of the 31st International Conference on Computational Linguistics: Industry Track , pages=
Chartgemma: Visual instruction-tuning for chart reasoning in the wild , author=. Proceedings of the 31st International Conference on Computational Linguistics: Industry Track , pages=
-
[44]
Advances in Neural Information Processing Systems , volume=
Cambrian-1: A fully open, vision-centric exploration of multimodal llms , author=. Advances in Neural Information Processing Systems , volume=
-
[45]
The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
Pixel Reasoner: Incentivizing Pixel Space Reasoning via Curiosity-Driven Reinforcement Learning , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
-
[46]
2025 , url=
Qiying Yu and Zheng Zhang and Ruofei Zhu and Yufeng Yuan and Xiaochen Zuo and YuYue and Weinan Dai and Tiantian Fan and Gaohong Liu and Juncai Liu and LingJun Liu and Xin Liu and Haibin Lin and Zhiqi Lin and Bole Ma and Guangming Sheng and Yuxuan Tong and Chi Zhang and Mofan Zhang and Ru Zhang and Wang Zhang and Hang Zhu and Jinhua Zhu and Jiaze Chen and ...
2025
-
[47]
Forty-second International Conference on Machine Learning , year=
ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding , author=. Forty-second International Conference on Machine Learning , year=
-
[48]
arXiv preprint arXiv:2509.01656 , year=
Reinforced visual perception with tools , author=. arXiv preprint arXiv:2509.01656 , year=
-
[49]
The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
SpatialReasoner: Towards Explicit and Generalizable 3D Spatial Reasoning , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
-
[50]
Qwen2. 5-vl technical report , author=. arXiv preprint arXiv:2502.13923 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[51]
Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies , pages=
A diversity-promoting objective function for neural conversation models , author=. Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies , pages=
2016
-
[52]
Plummer and Ranjay Krishna and Kuo-Hao Zeng and Kate Saenko , booktitle=
Arijit Ray and Jiafei Duan and Ellis L Brown II and Reuben Tan and Dina Bashkirova and Rose Hendrix and Kiana Ehsani and Aniruddha Kembhavi and Bryan A. Plummer and Ranjay Krishna and Kuo-Hao Zeng and Kate Saenko , booktitle=. 2025 , url=
2025
-
[53]
arXiv preprint arXiv:2509.23789 , year=
Visual CoT Makes VLMs Smarter but More Fragile , author=. arXiv preprint arXiv:2509.23789 , year=
-
[54]
Your group-relative advantage is biased.arXiv preprint arXiv:2601.08521,
Your Group-Relative Advantage Is Biased , author=. arXiv preprint arXiv:2601.08521 , year=
-
[55]
Revisiting Entropy in Reinforcement Learning for Large Reasoning Models
Revisiting Entropy in Reinforcement Learning for Large Reasoning Models , author=. arXiv preprint arXiv:2511.05993 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[56]
Rediscovering entropy regularization: Adaptive coefficient unlocks its potential for llm reinforcement learning , author=. arXiv preprint arXiv:2510.10959 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[57]
Advances in Neural Information Processing Systems , volume=
Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning , author=. Advances in Neural Information Processing Systems , volume=
-
[58]
Proceedings of the IEEE/CVF winter conference on applications of computer vision , pages=
Docvqa: A dataset for vqa on document images , author=. Proceedings of the IEEE/CVF winter conference on applications of computer vision , pages=
-
[59]
International journal of computer vision , volume=
The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale , author=. International journal of computer vision , volume=. 2020 , publisher=
2020
-
[60]
International conference on machine learning , pages=
Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=
2021
-
[61]
Advances in Neural Information Processing Systems , volume=
Visual sketchpad: Sketching as a visual chain of thought for multimodal language models , author=. Advances in Neural Information Processing Systems , volume=
-
[62]
arXiv preprint arXiv:2508.12109 , year=
Simple o3: Towards interleaved vision-language reasoning , author=. arXiv preprint arXiv:2508.12109 , year=
-
[63]
ACTIVE-o3: Empowering MLLMs with Active Perception via Pure Reinforcement Learning
Active-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO , author=. arXiv preprint arXiv:2505.21457 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[64]
arXiv preprint arXiv:2511.22586 , year=
Revisiting the Necessity of Lengthy Chain-of-Thought in Vision-centric Reasoning Generalization , author=. arXiv preprint arXiv:2511.22586 , year=
-
[65]
Scientific data , volume=
A dataset of clinically generated visual questions and answers about radiology images , author=. Scientific data , volume=. 2018 , publisher=
2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.