Diversity Over Frequency: Rethinking Tool Use in Visual Chain-of-Thought Agents

Dong-Hee Kim; Donghyun Kim; Reuben Tan

arxiv: 2606.00096 · v2 · pith:HXZE5YRMnew · submitted 2026-05-25 · 💻 cs.CV · cs.AI

Diversity Over Frequency: Rethinking Tool Use in Visual Chain-of-Thought Agents

Dong-Hee Kim , Reuben Tan , Donghyun Kim This is my paper

Pith reviewed 2026-06-29 23:08 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords visual chain-of-thoughttool use collapserollout diversityentropy regularizationvisual reasoning agents3D spatial reasoningmedical VQA

0 comments

The pith

Visual agents reach higher accuracy by promoting diverse rollouts even as tool use declines during training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies visual chain-of-thought agents that call external tools to gather local evidence on complex tasks such as 3D spatial reasoning and medical visual question answering. It documents a tool-use collapse in which models stop invoking tools yet post rising task accuracy. Standard training and explicit tool-use rewards both shrink the variety of generated trajectories, which accounts for the observed asymmetry that removing tools hurts results while forcing more tool calls produces only small gains. Adding an entropy regularization term restores rollout diversity and delivers the strongest performance despite continued reduction in tool frequency. The work therefore treats tools as temporary scaffolding whose main value lies in supporting broader exploration rather than in their raw frequency.

Core claim

We identify a tool-use collapse phenomenon: models progressively stop using tools while still achieving higher task accuracy. Moreover, we observe a clear asymmetry: (i) completely eliminating tool use degrades performance, whereas (ii) incentivizing tool use yields only marginal gains despite substantially increasing usage. We find that vanilla training and tool-use encouragement both reduce rollout diversity, explaining why higher tool use does not yield stronger reasoning performance. Motivated by these findings, we add an entropy regularization term to encourage diverse rollout exploration, achieving the best performance despite gradually declining tool usage. Overall, our findings sugge

What carries the argument

Entropy regularization term added during training to increase the diversity of language-generation and tool-invocation sequences in visual chain-of-thought rollouts.

If this is right

Completely disabling tool access lowers accuracy on 3D spatial reasoning and medical visual question answering.
Explicit rewards for tool use raise invocation rates but deliver only small accuracy lifts.
Both standard training and tool-use encouragement shrink the variety of generated trajectories.
Entropy regularization produces the highest accuracy while tool usage continues to fall.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The scaffolding perspective implies that curricula could deliberately phase out tools once rollout diversity has been established.
The same diversity-over-frequency pattern may appear in non-visual agent domains such as text-only reasoning or robotic planning.
Future experiments could measure whether rollout diversity predicts final accuracy better than raw tool count across different model scales.

Load-bearing premise

Reduced rollout diversity caused by vanilla training and tool-use encouragement is the primary reason that higher tool frequency fails to improve reasoning performance.

What would settle it

An ablation that applies tool-use incentives while separately maintaining high rollout diversity and still records only marginal accuracy gains would indicate that diversity is not the decisive factor.

Figures

Figures reproduced from arXiv: 2606.00096 by Dong-Hee Kim, Donghyun Kim, Reuben Tan.

**Figure 1.** Figure 1: Effect of Early Exploration on the Late Training Phase. This figure compares agent behaviors in the early training (top) and late training (bottom) phases. Both vanilla (Left) and tool-encouraged RFT (Middle) suffer from limited early exploration in both text and visual modalities. This lack of rollout diversity restricts models to local optima with marginal gains regardless of tool usage frequency. In con… view at source ↗

**Figure 2.** Figure 2: Analysis of tool usage ratio and performance across training steps. (a) Tool Use Ratio over training defined as the fraction of rollouts that contain at least one grounding action. (b) Validation accuracy trends. Vanilla RFT (blue) and Tool-Encourage reward (red) exhibit divergent tool usage behaviors despite similar limited accuracy gains. In contrast, adding an entropy regularization term (green) encoura… view at source ↗

**Figure 3.** Figure 3: Analysis of Textual Exploration During Training. We report the distinct n-gram ratio (Li et al., 2016) (n ∈ {3, 4, 5, 6}) to measure token diversity of generated rollouts. Both the Vanilla RFT (Blue) and Tool-Encourage (Red) suffer from diversity degradation, indicating increasingly repetitive reasoning patterns. In contrast, adding an Entropy regularization term (Green) maintains higher textual diversity,… view at source ↗

**Figure 4.** Figure 4: Visual Exploration During Training. Comparison of crop distributions at an early-training checkpoint where tool use still actively explores. While the baseline (a) narrowly focuses on the salient subject, the entropy-regularized model (b) actively explores contextually relevant regions (the stroller) referenced in the query, demonstrating wider exploration. model explores diverse crop locations in image sp… view at source ↗

**Figure 5.** Figure 5: Tool use ratio over training for OpenThinkIMG (Su et al., 2025) on VQA-RAD (Lau et al., 2018), defined as the fraction of rollouts that invoke at least one tool. entropy-regularized setting (62.9%). This contrast indicates that the gains from entropy regularization are not purely toolagnostic, but depend on tool-mediated visual exploration during training. When tools are unavailable, increasing exploratio… view at source ↗

read the original abstract

Visual agents employ external visual tools within visual chains of thought to incorporate fine-grained evidence. While prior work has mainly studied these tools in visual search tasks, their role in more complex visual reasoning remains underexplored. In this paper, we move beyond simple visual search tasks to investigate more challenging tasks, including 3D spatial reasoning and medical visual question answering, where agents must integrate tool-acquired local evidence with the global context. We identify a {tool-use collapse phenomenon: models progressively stop using tools while still achieving higher task accuracy. Moreover, we observe a clear asymmetry: (i) completely eliminating tool use degrades performance, whereas (ii) incentivizing tool use yields only marginal gains despite substantially increasing usage. We find that vanilla training and tool-use encouragement both reduce rollout diversity, explaining why higher tool use does not yield stronger reasoning performance. Motivated by these findings, we add an entropy regularization term to encourage diverse rollout exploration, achieving the best performance despite gradually declining tool usage. Overall, our findings suggest a training-time view of tools as scaffolding, where broader exploration over language generation and visual tool invocation improves reasoning despite tool-use collapse. Project page: https://scaffolded-exploration.github.io

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Tool-use collapse on hard visual tasks is a solid observation, but the diversity-reduction story does not yet isolate the mechanism behind marginal gains from tool incentives.

read the letter

The paper reports that visual CoT agents on 3D spatial reasoning and medical VQA tasks stop calling tools as training proceeds, yet accuracy keeps rising. Removing tool access hurts results, while pushing for more tool use raises usage without much accuracy lift. Both standard training and tool encouragement shrink rollout diversity, and adding entropy regularization produces the strongest performance even as tool calls still decline.

The shift to integrated reasoning tasks beyond simple search is a clear step forward, and the reported asymmetry between tool removal and tool encouragement is a useful empirical marker. Framing tools as temporary scaffolding for exploration rather than a quantity to maximize is a practical way to think about the training dynamics.

The soft spot is the explanatory link. The abstract ties the weak returns from tool encouragement to reduced diversity, but does not isolate diversity from other policy shifts such as changes in internal reasoning or call timing. Without ablations that hold other factors fixed, the motivation for entropy regularization rests on correlation. The stress-test note is accurate on this point. If the full paper contains controls that pin down the mechanism, the claim would land more solidly; from the given text it remains the main gap.

The work is aimed at researchers training visual agents who care about exploration during RL. Readers working on similar CoT or tool-augmented systems will find the patterns worth testing, even if they end up running their own ablations on the diversity angle.

The empirical observations are sharp enough to merit peer review. Referees can press on the causal isolation and check the experimental details that are missing from the abstract.

Referee Report

2 major / 1 minor

Summary. The manuscript examines tool use in visual chain-of-thought agents beyond simple visual search, focusing on complex tasks such as 3D spatial reasoning and medical VQA. It reports a tool-use collapse phenomenon in which models progressively reduce tool usage while task accuracy increases, along with an asymmetry: eliminating tool use degrades performance while incentivizing it produces only marginal gains despite higher usage. The authors attribute the marginal gains to reduced rollout diversity under both vanilla training and tool-use encouragement, and introduce an entropy regularization term to promote diverse exploration, which yields the best performance despite declining tool usage. They conclude that tools function as scaffolding for broader exploration over language and tool invocation.

Significance. If the empirical patterns and the effectiveness of entropy regularization hold under rigorous controls, the work could shift training approaches for visual agents toward emphasizing rollout diversity rather than tool-use frequency, with potential benefits for multimodal reasoning in domains requiring integration of local evidence and global context.

major comments (2)

[Abstract] Abstract: the explanatory claim that reduced rollout diversity (induced by vanilla training and tool-use encouragement) accounts for why incentivizing tool use yields only marginal accuracy gains is not supported by any isolation of diversity reduction as the causal mechanism versus alternative policy shifts (e.g., changes in tool timing or internal representations); this untested link is load-bearing for the motivation of the entropy regularization term.
[Abstract] Abstract: the reported patterns of tool-use collapse, performance asymmetry, and diversity reduction lack any mention of experimental controls, statistical significance tests, error bars, dataset sizes, or ablation studies, leaving the central empirical claims only moderately supported.

minor comments (1)

The abstract provides a project page but does not indicate whether code or experimental details will be released to support reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments on the abstract below and will revise the manuscript to improve clarity and support for the central claims.

read point-by-point responses

Referee: [Abstract] Abstract: the explanatory claim that reduced rollout diversity (induced by vanilla training and tool-use encouragement) accounts for why incentivizing tool use yields only marginal accuracy gains is not supported by any isolation of diversity reduction as the causal mechanism versus alternative policy shifts (e.g., changes in tool timing or internal representations); this untested link is load-bearing for the motivation of the entropy regularization term.

Authors: We agree the abstract phrasing presents diversity reduction as explanatory without an explicit isolation experiment ruling out alternatives such as shifts in tool timing or internal representations. The full manuscript reports ablations showing that both vanilla training and tool-use encouragement reduce measured rollout diversity (via entropy and unique action counts) while entropy regularization restores it and improves accuracy; however, these are correlational. We will revise the abstract to describe the observed associations more precisely and add a short discussion of alternative explanations, making the motivation for entropy regularization rest on the empirical patterns rather than an untested causal claim. revision: partial
Referee: [Abstract] Abstract: the reported patterns of tool-use collapse, performance asymmetry, and diversity reduction lack any mention of experimental controls, statistical significance tests, error bars, dataset sizes, or ablation studies, leaving the central empirical claims only moderately supported.

Authors: The abstract is length-constrained and focuses on the key phenomena. The full manuscript specifies dataset sizes for the 3D spatial reasoning and medical VQA benchmarks, includes ablation studies on tool-use incentives and entropy regularization, and reports results with error bars across multiple seeds. We will revise the abstract to briefly reference these elements (e.g., “supported by ablations with error bars across datasets”) so the empirical claims are presented with appropriate context. revision: yes

Circularity Check

0 steps flagged

No significant circularity; findings are empirical observations

full rationale

The paper's central claims consist of direct empirical observations from training runs (tool-use collapse, asymmetry in tool-use effects, and reduced rollout diversity under vanilla training or encouragement). These are measured outcomes from experiments rather than quantities derived by definition, fitted parameters renamed as predictions, or self-citation chains. No equations, uniqueness theorems, or ansatzes are invoked in the abstract or description that reduce to inputs by construction. The entropy regularization term is added as a motivated intervention based on observed correlations, not forced by the paper's own definitions or prior self-citations. This is self-contained empirical work with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The work introduces one training modification (entropy regularization) whose coefficient is a free parameter; it rests on standard domain assumptions about agent training but introduces no new physical entities.

free parameters (1)

entropy regularization coefficient
Added to the training objective to encourage diverse rollout exploration; its specific value is not stated in the abstract.

axioms (1)

domain assumption Visual chain-of-thought agents can be trained via objectives that treat tool invocation as part of the action space and that rollout diversity can be measured and regularized.
Invoked to interpret the collapse phenomenon and to motivate the entropy term.

pith-pipeline@v0.9.1-grok · 5744 in / 1461 out tokens · 37393 ms · 2026-06-29T23:08:49.597602+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

65 extracted references · 24 canonical work pages · 14 internal anchors

[1]

Langley , title =

P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

2000
[2]

T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

1980
[3]

M. J. Kearns , title =
[4]

Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

1983
[5]

R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

2000
[6]

Suppressed for Anonymity , author=
[7]

Newell and P

A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

1981
[8]

A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

1959
[9]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

3dsrbench: A comprehensive 3d spatial reasoning benchmark , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
[10]

arXiv preprint arXiv:2510.01681 , year=

Look Less, Reason More: Rollout-Guided Adaptive Pixel-Space Reasoning , author=. arXiv preprint arXiv:2510.01681 , year=

work page arXiv
[11]

Adaptive Chain-of-Focus Reasoning via Dynamic Visual Search and Zooming for Efficient VLMs

Chain-of-Focus: Adaptive Visual Search and Zooming for Multimodal Reasoning via RL , author=. arXiv preprint arXiv:2505.15436 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[12]

CropVLM: Learning to Zoom for Fine-Grained Vision-Language Perception

CropVLM: Learning to Zoom for Fine-Grained Vision-Language Perception , author=. arXiv preprint arXiv:2511.19820 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Failure Modes of Maximum Entropy RLHF

Failure Modes of Maximum Entropy RLHF , author=. arXiv preprint arXiv:2509.20265 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Reward gaming in conditional text generation , author=. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[15]

The Fourteenth International Conference on Learning Representations , year=

Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search , author=. The Fourteenth International Conference on Learning Representations , year=
[16]

arXiv preprint arXiv:2505.19255 , year=

VTool-R1: VLMs Learn to Think with Images via Reinforcement Learning on Multimodal Tool Use , author=. arXiv preprint arXiv:2505.19255 , year=

work page arXiv
[17]

OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning

Openthinkimg: Learning to think with images via visual tool reinforcement learning , author=. arXiv preprint arXiv:2505.08617 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =

Wei, Tong and Yang, Yijun and Xing, Junliang and Shi, Yuanchun and Lu, Zongqing and Ye, Deheng , title =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =. 2025 , pages =

2025
[19]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Spatialvlm: Endowing vision-language models with spatial reasoning capabilities , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[20]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Qwen2-vl: Enhancing vision-language model's perception of the world at any resolution , author=. arXiv preprint arXiv:2409.12191 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond , author=. arXiv preprint arXiv:2308.12966 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[22]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[23]

International conference on machine learning , pages=

Understanding the impact of entropy on policy optimization , author=. International conference on machine learning , pages=. 2019 , organization=

2019
[24]

International conference on machine learning , pages=

Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor , author=. International conference on machine learning , pages=. 2018 , organization=

2018
[25]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=
[27]

Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages=

Infographicvqa , author=. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages=
[28]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Gqa: A new dataset for real-world visual reasoning and compositional question answering , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[29]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Robospatial: Teaching spatial understanding to 2d and 3d vision-language models for robotics , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
[30]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control , author=. arXiv preprint arXiv:2307.15818 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[31]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Navigation world models , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
[32]

Advances in Neural Information Processing Systems , volume=

Spatialrgpt: Grounded spatial reasoning in vision-language models , author=. Advances in Neural Information Processing Systems , volume=
[33]

OpenAI o1 System Card

Openai o1 system card , author=. arXiv preprint arXiv:2412.16720 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[34]

2025 , month =

OpenAI , title =. 2025 , month =

2025
[35]

The Fourteenth International Conference on Learning Representations , year=

DeepEyes: Incentivizing ``Thinking with Images'' via Reinforcement Learning , author=. The Fourteenth International Conference on Learning Representations , year=
[36]

arXiv preprint arXiv:2505.19076 , year=

ChartSketcher: Reasoning with Multimodal Feedback and Reflection for Chart Understanding , author=. arXiv preprint arXiv:2505.19076 , year=

work page arXiv
[37]

Proceedings of the 42nd International Conference on Machine Learning , pages =

Imagine While Reasoning in Space: Multimodal Visualization-of-Thought , author =. Proceedings of the 42nd International Conference on Machine Learning , pages =. 2025 , editor =

2025
[38]

2025 , url=

YiFan Zhang and Huanyu Zhang and Haochen Tian and Chaoyou Fu and Shuangqing Zhang and Junfei Wu and Feng Li and Kun Wang and Qingsong Wen and Zhang Zhang and Liang Wang and Rong Jin , booktitle=. 2025 , url=

2025
[39]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
[40]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

V?: Guided visual search as a core mechanism in multimodal llms , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[41]

Evaluating Object Hallucination in Large Vision-Language Models

Li, Yifan and Du, Yifan and Zhou, Kun and Wang, Jinpeng and Zhao, Xin and Wen, Ji-Rong. Evaluating Object Hallucination in Large Vision-Language Models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.20

work page doi:10.18653/v1/2023.emnlp-main.20 2023
[42]

Multimodal A r X iv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models

Li, Lei and Wang, Yuqi and Xu, Runxin and Wang, Peiyi and Feng, Xiachong and Kong, Lingpeng and Liu, Qi. Multimodal A r X iv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.775

work page doi:10.18653/v1/2024.acl-long.775 2024
[43]

Proceedings of the 31st International Conference on Computational Linguistics: Industry Track , pages=

Chartgemma: Visual instruction-tuning for chart reasoning in the wild , author=. Proceedings of the 31st International Conference on Computational Linguistics: Industry Track , pages=
[44]

Advances in Neural Information Processing Systems , volume=

Cambrian-1: A fully open, vision-centric exploration of multimodal llms , author=. Advances in Neural Information Processing Systems , volume=
[45]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Pixel Reasoner: Incentivizing Pixel Space Reasoning via Curiosity-Driven Reinforcement Learning , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
[46]

2025 , url=

Qiying Yu and Zheng Zhang and Ruofei Zhu and Yufeng Yuan and Xiaochen Zuo and YuYue and Weinan Dai and Tiantian Fan and Gaohong Liu and Juncai Liu and LingJun Liu and Xin Liu and Haibin Lin and Zhiqi Lin and Bole Ma and Guangming Sheng and Yuxuan Tong and Chi Zhang and Mofan Zhang and Ru Zhang and Wang Zhang and Hang Zhu and Jinhua Zhu and Jiaze Chen and ...

2025
[47]

Forty-second International Conference on Machine Learning , year=

ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding , author=. Forty-second International Conference on Machine Learning , year=
[48]

arXiv preprint arXiv:2509.01656 , year=

Reinforced visual perception with tools , author=. arXiv preprint arXiv:2509.01656 , year=

work page arXiv
[49]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

SpatialReasoner: Towards Explicit and Generalizable 3D Spatial Reasoning , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
[50]

Qwen2.5-VL Technical Report

Qwen2. 5-vl technical report , author=. arXiv preprint arXiv:2502.13923 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[51]

Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies , pages=

A diversity-promoting objective function for neural conversation models , author=. Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies , pages=

2016
[52]

Plummer and Ranjay Krishna and Kuo-Hao Zeng and Kate Saenko , booktitle=

Arijit Ray and Jiafei Duan and Ellis L Brown II and Reuben Tan and Dina Bashkirova and Rose Hendrix and Kiana Ehsani and Aniruddha Kembhavi and Bryan A. Plummer and Ranjay Krishna and Kuo-Hao Zeng and Kate Saenko , booktitle=. 2025 , url=

2025
[53]

arXiv preprint arXiv:2509.23789 , year=

Visual CoT Makes VLMs Smarter but More Fragile , author=. arXiv preprint arXiv:2509.23789 , year=

work page arXiv
[54]

Your group-relative advantage is biased.arXiv preprint arXiv:2601.08521,

Your Group-Relative Advantage Is Biased , author=. arXiv preprint arXiv:2601.08521 , year=

work page arXiv
[55]

Revisiting Entropy in Reinforcement Learning for Large Reasoning Models

Revisiting Entropy in Reinforcement Learning for Large Reasoning Models , author=. arXiv preprint arXiv:2511.05993 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[56]

Revisiting Entropy Regularization: Adaptive Coefficient Unlocks Its Potential for LLM Reinforcement Learning

Rediscovering entropy regularization: Adaptive coefficient unlocks its potential for llm reinforcement learning , author=. arXiv preprint arXiv:2510.10959 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[57]

Advances in Neural Information Processing Systems , volume=

Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning , author=. Advances in Neural Information Processing Systems , volume=
[58]

Proceedings of the IEEE/CVF winter conference on applications of computer vision , pages=

Docvqa: A dataset for vqa on document images , author=. Proceedings of the IEEE/CVF winter conference on applications of computer vision , pages=
[59]

International journal of computer vision , volume=

The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale , author=. International journal of computer vision , volume=. 2020 , publisher=

2020
[60]

International conference on machine learning , pages=

Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=

2021
[61]

Advances in Neural Information Processing Systems , volume=

Visual sketchpad: Sketching as a visual chain of thought for multimodal language models , author=. Advances in Neural Information Processing Systems , volume=
[62]

arXiv preprint arXiv:2508.12109 , year=

Simple o3: Towards interleaved vision-language reasoning , author=. arXiv preprint arXiv:2508.12109 , year=

work page arXiv
[63]

ACTIVE-o3: Empowering MLLMs with Active Perception via Pure Reinforcement Learning

Active-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO , author=. arXiv preprint arXiv:2505.21457 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[64]

arXiv preprint arXiv:2511.22586 , year=

Revisiting the Necessity of Lengthy Chain-of-Thought in Vision-centric Reasoning Generalization , author=. arXiv preprint arXiv:2511.22586 , year=

work page arXiv
[65]

Scientific data , volume=

A dataset of clinically generated visual questions and answers about radiology images , author=. Scientific data , volume=. 2018 , publisher=

2018

[1] [1]

Langley , title =

P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

2000

[2] [2]

T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

1980

[3] [3]

M. J. Kearns , title =

[4] [4]

Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

1983

[5] [5]

R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

2000

[6] [6]

Suppressed for Anonymity , author=

[7] [7]

Newell and P

A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

1981

[8] [8]

A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

1959

[9] [9]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

3dsrbench: A comprehensive 3d spatial reasoning benchmark , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

[10] [10]

arXiv preprint arXiv:2510.01681 , year=

Look Less, Reason More: Rollout-Guided Adaptive Pixel-Space Reasoning , author=. arXiv preprint arXiv:2510.01681 , year=

work page arXiv

[11] [11]

Adaptive Chain-of-Focus Reasoning via Dynamic Visual Search and Zooming for Efficient VLMs

Chain-of-Focus: Adaptive Visual Search and Zooming for Multimodal Reasoning via RL , author=. arXiv preprint arXiv:2505.15436 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

CropVLM: Learning to Zoom for Fine-Grained Vision-Language Perception

CropVLM: Learning to Zoom for Fine-Grained Vision-Language Perception , author=. arXiv preprint arXiv:2511.19820 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Failure Modes of Maximum Entropy RLHF

Failure Modes of Maximum Entropy RLHF , author=. arXiv preprint arXiv:2509.20265 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Reward gaming in conditional text generation , author=. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

[15] [15]

The Fourteenth International Conference on Learning Representations , year=

Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search , author=. The Fourteenth International Conference on Learning Representations , year=

[16] [16]

arXiv preprint arXiv:2505.19255 , year=

VTool-R1: VLMs Learn to Think with Images via Reinforcement Learning on Multimodal Tool Use , author=. arXiv preprint arXiv:2505.19255 , year=

work page arXiv

[17] [17]

OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning

Openthinkimg: Learning to think with images via visual tool reinforcement learning , author=. arXiv preprint arXiv:2505.08617 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =

Wei, Tong and Yang, Yijun and Xing, Junliang and Shi, Yuanchun and Lu, Zongqing and Ye, Deheng , title =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =. 2025 , pages =

2025

[19] [19]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Spatialvlm: Endowing vision-language models with spatial reasoning capabilities , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

[20] [20]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Qwen2-vl: Enhancing vision-language model's perception of the world at any resolution , author=. arXiv preprint arXiv:2409.12191 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond , author=. arXiv preprint arXiv:2308.12966 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

International conference on machine learning , pages=

Understanding the impact of entropy on policy optimization , author=. International conference on machine learning , pages=. 2019 , organization=

2019

[24] [24]

International conference on machine learning , pages=

Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor , author=. International conference on machine learning , pages=. 2018 , organization=

2018

[25] [25]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[26] [26]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

[27] [27]

Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages=

Infographicvqa , author=. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages=

[28] [28]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Gqa: A new dataset for real-world visual reasoning and compositional question answering , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

[29] [29]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Robospatial: Teaching spatial understanding to 2d and 3d vision-language models for robotics , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

[30] [30]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control , author=. arXiv preprint arXiv:2307.15818 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[31] [31]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Navigation world models , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

[32] [32]

Advances in Neural Information Processing Systems , volume=

Spatialrgpt: Grounded spatial reasoning in vision-language models , author=. Advances in Neural Information Processing Systems , volume=

[33] [33]

OpenAI o1 System Card

Openai o1 system card , author=. arXiv preprint arXiv:2412.16720 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[34] [34]

2025 , month =

OpenAI , title =. 2025 , month =

2025

[35] [35]

The Fourteenth International Conference on Learning Representations , year=

DeepEyes: Incentivizing ``Thinking with Images'' via Reinforcement Learning , author=. The Fourteenth International Conference on Learning Representations , year=

[36] [36]

arXiv preprint arXiv:2505.19076 , year=

ChartSketcher: Reasoning with Multimodal Feedback and Reflection for Chart Understanding , author=. arXiv preprint arXiv:2505.19076 , year=

work page arXiv

[37] [37]

Proceedings of the 42nd International Conference on Machine Learning , pages =

Imagine While Reasoning in Space: Multimodal Visualization-of-Thought , author =. Proceedings of the 42nd International Conference on Machine Learning , pages =. 2025 , editor =

2025

[38] [38]

2025 , url=

YiFan Zhang and Huanyu Zhang and Haochen Tian and Chaoyou Fu and Shuangqing Zhang and Junfei Wu and Feng Li and Kun Wang and Qingsong Wen and Zhang Zhang and Liang Wang and Rong Jin , booktitle=. 2025 , url=

2025

[39] [39]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

[40] [40]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

V?: Guided visual search as a core mechanism in multimodal llms , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

[41] [41]

Evaluating Object Hallucination in Large Vision-Language Models

Li, Yifan and Du, Yifan and Zhou, Kun and Wang, Jinpeng and Zhao, Xin and Wen, Ji-Rong. Evaluating Object Hallucination in Large Vision-Language Models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.20

work page doi:10.18653/v1/2023.emnlp-main.20 2023

[42] [42]

Multimodal A r X iv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models

Li, Lei and Wang, Yuqi and Xu, Runxin and Wang, Peiyi and Feng, Xiachong and Kong, Lingpeng and Liu, Qi. Multimodal A r X iv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.775

work page doi:10.18653/v1/2024.acl-long.775 2024

[43] [43]

Proceedings of the 31st International Conference on Computational Linguistics: Industry Track , pages=

Chartgemma: Visual instruction-tuning for chart reasoning in the wild , author=. Proceedings of the 31st International Conference on Computational Linguistics: Industry Track , pages=

[44] [44]

Advances in Neural Information Processing Systems , volume=

Cambrian-1: A fully open, vision-centric exploration of multimodal llms , author=. Advances in Neural Information Processing Systems , volume=

[45] [45]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Pixel Reasoner: Incentivizing Pixel Space Reasoning via Curiosity-Driven Reinforcement Learning , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

[46] [46]

2025 , url=

Qiying Yu and Zheng Zhang and Ruofei Zhu and Yufeng Yuan and Xiaochen Zuo and YuYue and Weinan Dai and Tiantian Fan and Gaohong Liu and Juncai Liu and LingJun Liu and Xin Liu and Haibin Lin and Zhiqi Lin and Bole Ma and Guangming Sheng and Yuxuan Tong and Chi Zhang and Mofan Zhang and Ru Zhang and Wang Zhang and Hang Zhu and Jinhua Zhu and Jiaze Chen and ...

2025

[47] [47]

Forty-second International Conference on Machine Learning , year=

ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding , author=. Forty-second International Conference on Machine Learning , year=

[48] [48]

arXiv preprint arXiv:2509.01656 , year=

Reinforced visual perception with tools , author=. arXiv preprint arXiv:2509.01656 , year=

work page arXiv

[49] [49]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

SpatialReasoner: Towards Explicit and Generalizable 3D Spatial Reasoning , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

[50] [50]

Qwen2.5-VL Technical Report

Qwen2. 5-vl technical report , author=. arXiv preprint arXiv:2502.13923 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[51] [51]

Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies , pages=

A diversity-promoting objective function for neural conversation models , author=. Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies , pages=

2016

[52] [52]

Plummer and Ranjay Krishna and Kuo-Hao Zeng and Kate Saenko , booktitle=

Arijit Ray and Jiafei Duan and Ellis L Brown II and Reuben Tan and Dina Bashkirova and Rose Hendrix and Kiana Ehsani and Aniruddha Kembhavi and Bryan A. Plummer and Ranjay Krishna and Kuo-Hao Zeng and Kate Saenko , booktitle=. 2025 , url=

2025

[53] [53]

arXiv preprint arXiv:2509.23789 , year=

Visual CoT Makes VLMs Smarter but More Fragile , author=. arXiv preprint arXiv:2509.23789 , year=

work page arXiv

[54] [54]

Your group-relative advantage is biased.arXiv preprint arXiv:2601.08521,

Your Group-Relative Advantage Is Biased , author=. arXiv preprint arXiv:2601.08521 , year=

work page arXiv

[55] [55]

Revisiting Entropy in Reinforcement Learning for Large Reasoning Models

Revisiting Entropy in Reinforcement Learning for Large Reasoning Models , author=. arXiv preprint arXiv:2511.05993 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[56] [56]

Revisiting Entropy Regularization: Adaptive Coefficient Unlocks Its Potential for LLM Reinforcement Learning

Rediscovering entropy regularization: Adaptive coefficient unlocks its potential for llm reinforcement learning , author=. arXiv preprint arXiv:2510.10959 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[57] [57]

Advances in Neural Information Processing Systems , volume=

Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning , author=. Advances in Neural Information Processing Systems , volume=

[58] [58]

Proceedings of the IEEE/CVF winter conference on applications of computer vision , pages=

Docvqa: A dataset for vqa on document images , author=. Proceedings of the IEEE/CVF winter conference on applications of computer vision , pages=

[59] [59]

International journal of computer vision , volume=

The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale , author=. International journal of computer vision , volume=. 2020 , publisher=

2020

[60] [60]

International conference on machine learning , pages=

Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=

2021

[61] [61]

Advances in Neural Information Processing Systems , volume=

Visual sketchpad: Sketching as a visual chain of thought for multimodal language models , author=. Advances in Neural Information Processing Systems , volume=

[62] [62]

arXiv preprint arXiv:2508.12109 , year=

Simple o3: Towards interleaved vision-language reasoning , author=. arXiv preprint arXiv:2508.12109 , year=

work page arXiv

[63] [63]

ACTIVE-o3: Empowering MLLMs with Active Perception via Pure Reinforcement Learning

Active-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO , author=. arXiv preprint arXiv:2505.21457 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[64] [64]

arXiv preprint arXiv:2511.22586 , year=

Revisiting the Necessity of Lengthy Chain-of-Thought in Vision-centric Reasoning Generalization , author=. arXiv preprint arXiv:2511.22586 , year=

work page arXiv

[65] [65]

Scientific data , volume=

A dataset of clinically generated visual questions and answers about radiology images , author=. Scientific data , volume=. 2018 , publisher=

2018