Are Tools Always Beneficial? Learning to Invoke Tools Adaptively for Dual-Mode Multimodal LLM Reasoning

Jian Zhang; Lei Bai; Qinghe Ma; Yiming Wu; Yinghuan Shi; Zhen Zhao

arxiv: 2605.19852 · v1 · pith:NDAVUMZMnew · submitted 2026-05-19 · 💻 cs.CL

Are Tools Always Beneficial? Learning to Invoke Tools Adaptively for Dual-Mode Multimodal LLM Reasoning

Qinghe Ma , Zhen Zhao , Yiming Wu , Jian Zhang , Lei Bai , Yinghuan Shi This is my paper

Pith reviewed 2026-05-20 06:12 UTC · model grok-4.3

classification 💻 cs.CL

keywords adaptive tool invocationmultimodal LLM reasoningreinforcement learningdual-mode reasoningtool-augmented modelsV* benchmarkPOPE benchmarkefficiency in reasoning

0 comments

The pith

Multimodal LLMs gain accuracy and speed by learning to invoke tools only when they help rather than always or never.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that tool use in multimodal models is not universally helpful because unnecessary calls add overhead and can distort answers. AutoTool addresses this by training the model in reinforcement learning to choose between a tool-assisted mode and a text-only mode for each query. Mode-specific reward functions steer each path toward correct outputs while joint exploration during training keeps the model from favoring one mode too early. Later free exploration further refines the adaptive policy. The result is higher accuracy on visual benchmarks and markedly lower reasoning cost compared with methods that apply tools by default.

Core claim

Within a reinforcement learning framework, AutoTool maintains an explicit dual-mode reasoning strategy in which the model jointly explores tool-assisted and text-centric paths, guided by mode-specific reward functions, so that it invokes tools only for queries where they improve accuracy and otherwise relies on internal reasoning to avoid unnecessary overhead.

What carries the argument

Dual-mode reasoning strategy inside reinforcement learning, using mode-specific reward functions plus phased joint exploration to decide per query whether to invoke tools.

If this is right

Models produce correct answers at lower computational cost by skipping tools on queries that do not need them.
Joint exploration during training keeps both reasoning modes viable rather than letting one dominate early.
Accuracy rises on visual reasoning tasks such as V* because the model avoids tool-induced errors.
Efficiency improves on benchmarks like POPE because redundant tool calls are eliminated.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same adaptive logic could be applied to other tool-using language models beyond the multimodal setting.
Over-provisioning tools in model design may systematically hurt performance on simpler queries.
Testing the method on larger base models would reveal whether the efficiency and accuracy gains scale.
The approach suggests future tool-augmented systems should include explicit training for when to refuse a tool.

Load-bearing premise

The reinforcement learning setup with mode-specific rewards and joint exploration will reliably produce stable adaptive switching instead of locking the model into one fixed mode.

What would settle it

Retraining the model and observing that it converges to using the same mode for nearly all queries while accuracy or efficiency falls below the reported gains on V* and POPE would show the adaptive mechanism failed.

Figures

Figures reproduced from arXiv: 2605.19852 by Jian Zhang, Lei Bai, Qinghe Ma, Yiming Wu, Yinghuan Shi, Zhen Zhao.

**Figure 1.** Figure 1: (a, b) Representative queries that do or do not trigger the zoom-in tool, illustrating that tool usage is not always necessary, while AutoTool adaptively invokes tools when beneficial. (c, d) Comparison of the proportion of tool-augmented reasoning trajectories during training, as well as the training and inference time costs between our AutoTool and SOTA DeepEyes (Zheng et al., 2025). 20.3% more than adap… view at source ↗

**Figure 2.** Figure 2: Illustration of the AutoTool training framework. Given a multimodal problem, the policy model first decides whether the subsequent reasoning process requires tool invocation. For each batch of generated reasoning trajectories, different reward functions are applied to evaluate the trajectories under distinct reasoning modes via the Mode-Specific Policy Optimization (MSPO), and the tool invocation reward co… view at source ↗

**Figure 3.** Figure 3: The outer ring shows the proportion of the dual reasoning modes on three datasets, while the inner ring presents their distribution across different splits within each dataset. ward hacking. For example, when λ base tool = 0.0, the model frequently selects the <tool on> mode while performing pure text reasoning without valid tool calls, exploiting Ron tool = 0 while benefiting from a larger λ off tool. I… view at source ↗

**Figure 4.** Figure 4: Detailed training-phase analysis. (a) Dual reasoning trajectories, general reasoning data, and overall average reward curves. (b) Average number of tool invocations under the <tool on> mode. (c) Response length variations throughout training. The shaded regions denote the standard deviation across multiple runs. For each batch of training data, we add tool-invocation prompts to the samples from the V* (whe… view at source ↗

**Figure 5.** Figure 5: Accuracy comparison between both reasoning modes across three benchmarks. F. Comparison with other Baselines We further compare AutoTool with prompt-based baselines on Qwen2.5-VL-7B (Bai et al., 2025), where reasoning modes are controlled solely through prompt design. We additionally include LlamaV-o1 (Thawakar et al., 2025) for comparison, which is built upon Llama-3.2-11B-Vision-Instruct (Meta, 2024) and… view at source ↗

**Figure 6.** Figure 6: Test accuracy across different training steps on HRBench-4K, HRBench-8K, and V*. M. Extension to the Multi-tool Setting Although the main experiments focus on a zoom-in tool for clarity and controlled analysis, our method is not restricted to a single tool type. To evaluate its generality, we further study our method in a multi-tool setting based on DeepEyesV2 (Hong et al., 2025), a recently proposed fram… view at source ↗

**Figure 7.** Figure 7: The outer ring shows the proportion of the dual reasoning modes on two datasets, while the inner ring presents their distribution across different splits within each dataset. The left two plots correspond to DeepEyesV2, and the right two plots correspond to DeepEyesV2 integrated with AMB and MSPO. This SFT-then-RL training paradigm shares a closely aligned objective with our AMB: both aim to first establis… view at source ↗

**Figure 8.** Figure 8: Qualitative examples of perception benchmark generated by AutoTool. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗

**Figure 9.** Figure 9: Qualitative examples of hallucination benchmark generated by AutoTool. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗

**Figure 10.** Figure 10: Qualitative examples of grounding and reasoning benchmark generated by AutoTool. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗

**Figure 11.** Figure 11: Failure cases. Orange boxes denote the ground-truth regions of interest that the model should attend to, while the red boxes show the regions actually selected for zoom-in. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗

read the original abstract

Tool-augmented reasoning has emerged as a promising direction for enhancing the reasoning capabilities of multimodal large language models (MLLMs). However, existing studies mainly focus on enabling models to perform tool invocation, while neglecting the necessity of invoking tools. We argue that tool usage is not always beneficial, as redundant or inappropriate invocations largely increase reasoning overhead and even mislead model predictions. To address this issue, we introduce AutoTool, a model that adaptively decides whether to invoke tools according to the characteristics of each query. Within a reinforcement learning framework, we design an explicit dual-mode reasoning strategy with mode-specific reward functions to guide the model toward producing accurate responses. Moreover, to prevent premature bias toward a single reasoning mode, AutoTool jointly explores and balances tool-assisted and text-centric reasoning throughout training, and promotes free exploration in later stages. Extensive experiments demonstrate that AutoTool exhibits outstanding performance and high efficiency, yielding a 21.8\% accuracy gain on V* benchmark compared to the base model, and a 44.9\% improvement in efficiency over existing tool-augmented methods on POPE benchmark. Code is available at https://github.com/MQinghe/AutoTool.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AutoTool trains MLLMs to pick tool use per query via dual-mode RL and reports accuracy plus efficiency gains, but the adaptivity claim rests on thin evidence.

read the letter

The main thing to know is that this paper trains a multimodal model to decide on the fly whether a given query needs external tools or can be handled with text-only reasoning, and the results show better accuracy than the base model plus lower overhead than methods that always call tools. The dual-mode RL setup with separate rewards for each mode and a late free-exploration phase is the concrete addition over earlier work that either forces tool calls or trains only for invocation when prompted. The numbers on V* and POPE are straightforward to check and do show the claimed lifts in performance and speed, which matters for anyone trying to run these models without wasting compute on unnecessary tool steps. That part is useful and worth noting. The soft spot is exactly the one the stress test flags. Nothing in the abstract or setup guarantees that the policy learns query-dependent choices rather than settling on whichever mode wins on average across the training distribution. Mode-specific rewards and joint exploration help in principle, but without reported statistics on how often each mode is actually selected on held-out queries or ablations that remove the balancing steps, the gains could come from a non-adaptive policy that simply matches the better fixed behavior. The paper would be tighter if it included those checks. This is aimed at people building practical tool-augmented MLLMs who care about deployment cost as much as raw accuracy. It deserves a serious referee because the motivation is real and the training approach is testable, even if the current experiments leave the adaptivity question open. I would send it out for review and ask the authors to add mode-usage breakdowns and the relevant ablations.

Referee Report

2 major / 2 minor

Summary. The paper proposes AutoTool, a multimodal LLM that adaptively decides whether to invoke tools or use text-centric reasoning for each query. It employs a reinforcement learning framework with dual-mode reasoning, mode-specific reward functions, joint exploration of both modes during training, and a late-stage free-exploration phase to avoid premature bias toward one mode. The central empirical claims are a 21.8% accuracy gain on the V* benchmark relative to the base model and a 44.9% efficiency improvement over prior tool-augmented methods on the POPE benchmark.

Significance. If the adaptive, query-dependent behavior is verifiably achieved rather than a collapse to a fixed policy, the approach could meaningfully improve the efficiency-accuracy tradeoff in tool-augmented MLLMs by avoiding redundant or misleading tool calls. The reported benchmark gains indicate practical promise, but the result's significance hinges on confirming that the dual-mode RL training produces stable, non-collapsed mode selection.

major comments (2)

[Reinforcement Learning Framework] The RL framework (as described in the methods) relies on mode-specific rewards plus joint exploration and late-stage free exploration to prevent bias, yet provides no explicit anti-collapse term (e.g., entropy bonus on mode choice) and no reported mode-selection frequency statistics during or after training. This is load-bearing for the central claim of adaptive invocation: if average reward favors one mode across the training distribution, policy gradients can still drive convergence to a non-adaptive policy, which could account for the observed gains without true query-dependent adaptation.
[§4 (Experimental Results)] §4 (Experimental Results): the manuscript reports clear benchmark gains but omits full details on data splits, complete ablation studies isolating the adaptive mechanism, and verification that improvements stem from dual-mode adaptivity rather than other training choices. Without these, it is not possible to confirm that the 21.8% V* accuracy gain and 44.9% POPE efficiency gain are attributable to the claimed adaptive behavior.

minor comments (2)

[Abstract] The abstract states a 44.9% efficiency improvement on POPE but does not specify the precise metric (e.g., average tool calls per query, latency, or FLOPs) or list the exact baselines; adding this clarification would aid interpretation.
[Methods] Notation for the two reasoning modes (tool-assisted vs. text-centric) and the mode-selection policy should be introduced more explicitly with consistent symbols to improve readability of the RL formulation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address the major concerns regarding the reinforcement learning framework and the experimental details below. We will make revisions to provide additional evidence and clarifications as outlined.

read point-by-point responses

Referee: [Reinforcement Learning Framework] The RL framework (as described in the methods) relies on mode-specific rewards plus joint exploration and late-stage free exploration to prevent bias, yet provides no explicit anti-collapse term (e.g., entropy bonus on mode choice) and no reported mode-selection frequency statistics during or after training. This is load-bearing for the central claim of adaptive invocation: if average reward favors one mode across the training distribution, policy gradients can still drive convergence to a non-adaptive policy, which could account for the observed gains without true query-dependent adaptation.

Authors: We agree that providing mode-selection frequency statistics would offer direct verification of the adaptive, query-dependent behavior. In the revised version, we will report the frequency of tool-assisted versus text-centric mode selections on both training and test sets to demonstrate that the policy does not collapse. While we did not include an explicit entropy bonus on mode choice, the training procedure incorporates joint exploration of both modes throughout training with mode-specific rewards, followed by a late-stage free-exploration phase. This design encourages the model to sample from both modes proportionally to their rewards without a priori bias toward one. We will clarify this rationale in the methods section and include the statistics to empirically support the non-collapse. revision: yes
Referee: [§4 (Experimental Results)] §4 (Experimental Results): the manuscript reports clear benchmark gains but omits full details on data splits, complete ablation studies isolating the adaptive mechanism, and verification that improvements stem from dual-mode adaptivity rather than other training choices. Without these, it is not possible to confirm that the 21.8% V* accuracy gain and 44.9% POPE efficiency gain are attributable to the claimed adaptive behavior.

Authors: We will revise §4 to include complete details on the data splits used for training and evaluation. Furthermore, we will add ablation studies that specifically isolate the adaptive mechanism, including comparisons to single-mode variants and ablations removing the free-exploration phase. These will help attribute the observed gains (21.8% on V* and 44.9% efficiency on POPE) to the dual-mode adaptive training rather than other factors. We believe the current results are consistent with adaptivity, but the additional experiments and details will strengthen this claim. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical RL training with mode-specific rewards produces reported benchmark gains

full rationale

The paper introduces AutoTool via a standard reinforcement learning setup that includes dual-mode reasoning, mode-specific reward functions, joint exploration of tool-assisted and text-centric paths, and a late-stage free-exploration phase. All central claims—21.8% accuracy improvement on V* and 44.9% efficiency gain on POPE—are presented as outcomes of this training process evaluated on held-out benchmarks, not as predictions derived from equations or parameters that are defined in terms of the same quantities. No self-definitional steps, fitted-input-as-prediction reductions, or load-bearing self-citations appear in the described framework; the adaptive policy is an emergent result of the RL objective rather than presupposed by construction. The derivation chain therefore remains self-contained and externally falsifiable through the reported experiments.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach relies on standard RL assumptions about reward design and exploration; no new free parameters or invented entities are introduced beyond the usual hyperparameters of the RL framework.

axioms (1)

domain assumption Mode-specific reward functions can be designed to guide accurate responses in each reasoning mode without introducing bias toward one mode.
Invoked in the description of the dual-mode reasoning strategy.

pith-pipeline@v0.9.0 · 5753 in / 1174 out tokens · 26869 ms · 2026-05-20T06:12:00.487050+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We design Mode-Specific Policy Optimization (MSPO), with distinct optimization objectives to different reasoning modes... Adaptive Mode Balancing (AMB) strategy that dynamically adjusts the reward coefficients to control the frequency of the two modes
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat induction and orbit embedding unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

to prevent premature bias toward a single reasoning mode, AutoTool jointly explores and balances tool-assisted and text-centric reasoning throughout training

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

98 extracted references · 98 canonical work pages · 27 internal anchors

[1]

Langley , title =

P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

work page 2000
[2]

T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

work page 1980
[3]

M. J. Kearns , title =

work page
[4]

Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

work page 1983
[5]

R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

work page 2000
[6]

Suppressed for Anonymity , author=

work page
[7]

Newell and P

A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

work page 1981
[8]

A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

work page 1959
[9]

arXiv preprint arXiv:2503.10639 , year=

Got: Unleashing reasoning capability of multimodal large language model for visual generation and editing , author=. arXiv preprint arXiv:2503.10639 , year=

work page arXiv
[10]

Advances in Neural Information Processing Systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in Neural Information Processing Systems , volume=

work page
[11]

Advances in Neural Information Processing Systems , volume=

Large language models are zero-shot reasoners , author=. Advances in Neural Information Processing Systems , volume=

work page
[12]

Gemini: A Family of Highly Capable Multimodal Models

Gemini: a family of highly capable multimodal models , author=. arXiv preprint arXiv:2312.11805 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Qwen2-vl: Enhancing vision-language model's perception of the world at any resolution , author=. arXiv preprint arXiv:2409.12191 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Advances in Neural Information Processing Systems , volume=

Visual instruction tuning , author=. Advances in Neural Information Processing Systems , volume=

work page
[15]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts , author=. arXiv preprint arXiv:2310.02255 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[16]

arXiv preprint arXiv:2312.11370 , year=

G-llava: Solving geometric problem with multi-modal large language model , author=. arXiv preprint arXiv:2312.11370 , year=

work page arXiv
[17]

arXiv preprint arXiv:2407.08739 , year=

Mavis: Mathematical visual instruction tuning with an automatic data engine , author=. arXiv preprint arXiv:2407.08739 , year=

work page arXiv
[18]

The Twelfth International Conference on Learning Representations , year=

Grounding multimodal large language models to the world , author=. The Twelfth International Conference on Learning Representations , year=

work page
[19]

Proceedings of the IEEE International Conference on Computer Vision , pages=

Vqa: Visual question answering , author=. Proceedings of the IEEE International Conference on Computer Vision , pages=

work page
[20]

Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning , author=. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[21]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[22]

arXiv preprint arXiv:2502.09621 , year=

Mme-cot: Benchmarking chain-of-thought in large multimodal models for reasoning quality, robustness, and efficiency , author=. arXiv preprint arXiv:2502.09621 , year=

work page arXiv
[23]

OpenAI o1 System Card

Openai o1 system card , author=. arXiv preprint arXiv:2412.16720 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[24]

European Conference on Computer Vision , pages=

Modeling context in referring expressions , author=. European Conference on Computer Vision , pages=. 2016 , organization=

work page 2016
[25]

Qwen2.5-VL Technical Report

Qwen2. 5-vl technical report , author=. arXiv preprint arXiv:2502.13923 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[26]

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding , author=. arXiv preprint arXiv:2412.10302 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[27]

arXiv e-prints , pages=

The llama 3 herd of models , author=. arXiv e-prints , pages=

work page
[28]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[29]

2025 , howpublished =

OpenAI , title =. 2025 , howpublished =

work page 2025
[30]

Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers

Thinking with images for multimodal reasoning: Foundations, methods, and future frontiers , author=. arXiv preprint arXiv:2506.23918 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[31]

Thyme: Think Beyond Images

Thyme: Think beyond images , author=. arXiv preprint arXiv:2508.11630 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[32]

Traceable evidence enhanced visual grounded reasoning: Evaluation and methodology.arXiv preprint arXiv:2507.07999, 2025

Traceable evidence enhanced visual grounded reasoning: Evaluation and methodology , author=. arXiv preprint arXiv:2507.07999 , year=

work page arXiv
[33]

DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

DeepEyes: Incentivizing" Thinking with Images" via Reinforcement Learning , author=. arXiv preprint arXiv:2505.14362 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[34]

arXiv preprint arXiv:2412.18072 , year=

MMFactory: A Universal Solution Search Engine for Vision-Language Tasks , author=. arXiv preprint arXiv:2412.18072 , year=

work page arXiv
[35]

arXiv preprint arXiv:2107.07566 , year=

Internet-augmented dialogue generation , author=. arXiv preprint arXiv:2107.07566 , year=

work page arXiv
[36]

Workshop on Reasoning and Planning for Large Language Models , year=

TACO: Learning Multi-modal Models to Reason and Act with Synthetic Chains-of-Thought-and-Action , author=. Workshop on Reasoning and Planning for Large Language Models , year=

work page
[37]

Cogcom: Train large vision-language models diving into details through chain of manipulations , author=

work page
[38]

OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning

Openthinkimg: Learning to think with images via visual tool reinforcement learning , author=. arXiv preprint arXiv:2505.08617 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[39]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[40]

SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models

Sft or rl? an early investigation into training r1-like reasoning large vision-language models , author=. arXiv preprint arXiv:2504.11468 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[41]

2024 , journal =

HybridFlow: A Flexible and Efficient RLHF Framework , author =. 2024 , journal =

work page 2024
[42]

Proceedings of the 29th Symposium on Operating Systems Principles , pages=

Efficient memory management for large language model serving with pagedattention , author=. Proceedings of the 29th Symposium on Operating Systems Principles , pages=

work page
[43]

2024 , howpublished =

Qwen2.5 Technical Report , author =. 2024 , howpublished =

work page 2024
[44]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

V?: Guided visual search as a core mechanism in multimodal llms , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[45]

arXiv preprint arXiv:2403.00231 , year=

Multimodal arxiv: A dataset for improving scientific comprehension of large vision-language models , author=. arXiv preprint arXiv:2403.00231 , year=

work page arXiv
[46]

arXiv preprint arXiv:2504.07934 , year=

Sota with less: Mcts-guided sample selection for data-efficient visual reasoning self-improvement , author=. arXiv preprint arXiv:2504.07934 , year=

work page arXiv
[47]

Proceedings of the AAAI Conference on Artificial Intelligence , pages=

Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models , author=. Proceedings of the AAAI Conference on Artificial Intelligence , pages=

work page
[48]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Coco-stuff: Thing and stuff classes in context , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[49]

Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing , pages=

Referitgame: Referring to objects in photographs of natural scenes , author=. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2014
[50]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Lisa: Reasoning segmentation via large language model , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[51]

Evaluating Object Hallucination in Large Vision-Language Models

Evaluating object hallucination in large vision-language models , author=. arXiv preprint arXiv:2305.10355 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[52]

European Conference on Computer Vision , pages=

Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? , author=. European Conference on Computer Vision , pages=. 2024 , organization=

work page 2024
[53]

Advances in Neural Information Processing Systems , volume=

Measuring multimodal mathematical reasoning with math-vision dataset , author=. Advances in Neural Information Processing Systems , volume=

work page
[54]

We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

We-math: Does your large multimodal model achieve human-like mathematical reasoning? , author=. arXiv preprint arXiv:2407.01284 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[55]

Dynamath: A dynamic visual benchmark for evaluating mathematical reasoning robustness of vision language models.arXiv preprint arXiv:2411.00836, 2024

Dynamath: A dynamic visual benchmark for evaluating mathematical reasoning robustness of vision language models , author=. arXiv preprint arXiv:2411.00836 , year=

work page arXiv
[56]

LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts

Logicvista: Multimodal llm logical reasoning benchmark in visual contexts , author=. arXiv preprint arXiv:2407.04973 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[57]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Dyfo: A training-free dynamic focus visual search for enhancing lmms in fine-grained visual understanding , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[58]

arXiv preprint arXiv:2411.16044 , year=

Zoomeye: Enhancing multimodal llms with human-like zooming capabilities through tree-based image exploration , author=. arXiv preprint arXiv:2411.16044 , year=

work page arXiv
[59]

Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

Pixel reasoner: Incentivizing pixel-space reasoning with curiosity-driven reinforcement learning , author=. arXiv preprint arXiv:2505.15966 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[60]

LLaVA-OneVision: Easy Visual Task Transfer

Llava-onevision: Easy visual task transfer , author=. arXiv preprint arXiv:2408.03326 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[61]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models , author=. arXiv preprint arXiv:2504.10479 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[62]

Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search

Mini-o3: Scaling up reasoning patterns and interaction turns for visual search , author=. arXiv preprint arXiv:2509.07969 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[63]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Improved baselines with visual instruction tuning , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[64]

International Conference on Machine Learning , pages=

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation , author=. International Conference on Machine Learning , pages=. 2022 , organization=

work page 2022
[65]

International Conference on Machine Learning , pages=

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models , author=. International Conference on Machine Learning , pages=. 2023 , organization=

work page 2023
[66]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond , author=. arXiv preprint arXiv:2308.12966 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[67]

GPT-4o System Card

Gpt-4o system card , author=. arXiv preprint arXiv:2410.21276 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[68]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Reproducible scaling laws for contrastive language-image learning , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[69]

International Conference on Machine Learning , pages=

Learning transferable visual models from natural language supervision , author=. International Conference on Machine Learning , pages=. 2021 , organization=

work page 2021
[70]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[71]

Advances in Neural Information Processing Systems , volume=

Flamingo: a visual language model for few-shot learning , author=. Advances in Neural Information Processing Systems , volume=

work page
[72]

Advances in Neural Information Processing Systems , volume=

Cambrian-1: A fully open, vision-centric exploration of multimodal llms , author=. Advances in Neural Information Processing Systems , volume=

work page
[73]

Advances in Neural Information Processing Systems , volume=

Unveiling encoder-free vision-language models , author=. Advances in Neural Information Processing Systems , volume=

work page
[74]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Mono-internvl: Pushing the boundaries of monolithic multimodal large language models with endogenous visual pre-training , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[75]

arXiv preprint arXiv:2504.10462 , year=

The scalability of simplicity: Empirical analysis of vision-language learning with a single transformer , author=. arXiv preprint arXiv:2504.10462 , year=

work page arXiv
[76]

arXiv preprint arXiv:2505.17018 , year=

SophiaVL-R1: Reinforcing MLLMs Reasoning with Thinking Reward , author=. arXiv preprint arXiv:2505.17018 , year=

work page arXiv
[77]

arXiv preprint arXiv:2505.16673 , year=

R1-ShareVL: Incentivizing Reasoning Capability of Multimodal Large Language Models via Share-GRPO , author=. arXiv preprint arXiv:2505.16673 , year=

work page arXiv
[78]

Advances in Neural Information Processing Systems , volume=

Visual sketchpad: Sketching as a visual chain of thought for multimodal language models , author=. Advances in Neural Information Processing Systems , volume=

work page
[79]

Proceedings of the IEEE International Conference on Computer Vision , pages=

Segment anything , author=. Proceedings of the IEEE International Conference on Computer Vision , pages=

work page
[80]

SAM 2: Segment Anything in Images and Videos

Sam 2: Segment anything in images and videos , author=. arXiv preprint arXiv:2408.00714 , year=

work page internal anchor Pith review Pith/arXiv arXiv

Showing first 80 references.

[1] [1]

Langley , title =

P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

work page 2000

[2] [2]

T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

work page 1980

[3] [3]

M. J. Kearns , title =

work page

[4] [4]

Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

work page 1983

[5] [5]

R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

work page 2000

[6] [6]

Suppressed for Anonymity , author=

work page

[7] [7]

Newell and P

A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

work page 1981

[8] [8]

A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

work page 1959

[9] [9]

arXiv preprint arXiv:2503.10639 , year=

Got: Unleashing reasoning capability of multimodal large language model for visual generation and editing , author=. arXiv preprint arXiv:2503.10639 , year=

work page arXiv

[10] [10]

Advances in Neural Information Processing Systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in Neural Information Processing Systems , volume=

work page

[11] [11]

Advances in Neural Information Processing Systems , volume=

Large language models are zero-shot reasoners , author=. Advances in Neural Information Processing Systems , volume=

work page

[12] [12]

Gemini: A Family of Highly Capable Multimodal Models

Gemini: a family of highly capable multimodal models , author=. arXiv preprint arXiv:2312.11805 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Qwen2-vl: Enhancing vision-language model's perception of the world at any resolution , author=. arXiv preprint arXiv:2409.12191 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Advances in Neural Information Processing Systems , volume=

Visual instruction tuning , author=. Advances in Neural Information Processing Systems , volume=

work page

[15] [15]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts , author=. arXiv preprint arXiv:2310.02255 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

arXiv preprint arXiv:2312.11370 , year=

G-llava: Solving geometric problem with multi-modal large language model , author=. arXiv preprint arXiv:2312.11370 , year=

work page arXiv

[17] [17]

arXiv preprint arXiv:2407.08739 , year=

Mavis: Mathematical visual instruction tuning with an automatic data engine , author=. arXiv preprint arXiv:2407.08739 , year=

work page arXiv

[18] [18]

The Twelfth International Conference on Learning Representations , year=

Grounding multimodal large language models to the world , author=. The Twelfth International Conference on Learning Representations , year=

work page

[19] [19]

Proceedings of the IEEE International Conference on Computer Vision , pages=

Vqa: Visual question answering , author=. Proceedings of the IEEE International Conference on Computer Vision , pages=

work page

[20] [20]

Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning , author=. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page

[21] [21]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

arXiv preprint arXiv:2502.09621 , year=

Mme-cot: Benchmarking chain-of-thought in large multimodal models for reasoning quality, robustness, and efficiency , author=. arXiv preprint arXiv:2502.09621 , year=

work page arXiv

[23] [23]

OpenAI o1 System Card

Openai o1 system card , author=. arXiv preprint arXiv:2412.16720 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[24] [24]

European Conference on Computer Vision , pages=

Modeling context in referring expressions , author=. European Conference on Computer Vision , pages=. 2016 , organization=

work page 2016

[25] [25]

Qwen2.5-VL Technical Report

Qwen2. 5-vl technical report , author=. arXiv preprint arXiv:2502.13923 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[26] [26]

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding , author=. arXiv preprint arXiv:2412.10302 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

arXiv e-prints , pages=

The llama 3 herd of models , author=. arXiv e-prints , pages=

work page

[28] [28]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[29] [29]

2025 , howpublished =

OpenAI , title =. 2025 , howpublished =

work page 2025

[30] [30]

Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers

Thinking with images for multimodal reasoning: Foundations, methods, and future frontiers , author=. arXiv preprint arXiv:2506.23918 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[31] [31]

Thyme: Think Beyond Images

Thyme: Think beyond images , author=. arXiv preprint arXiv:2508.11630 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[32] [32]

Traceable evidence enhanced visual grounded reasoning: Evaluation and methodology.arXiv preprint arXiv:2507.07999, 2025

Traceable evidence enhanced visual grounded reasoning: Evaluation and methodology , author=. arXiv preprint arXiv:2507.07999 , year=

work page arXiv

[33] [33]

DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

DeepEyes: Incentivizing" Thinking with Images" via Reinforcement Learning , author=. arXiv preprint arXiv:2505.14362 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[34] [34]

arXiv preprint arXiv:2412.18072 , year=

MMFactory: A Universal Solution Search Engine for Vision-Language Tasks , author=. arXiv preprint arXiv:2412.18072 , year=

work page arXiv

[35] [35]

arXiv preprint arXiv:2107.07566 , year=

Internet-augmented dialogue generation , author=. arXiv preprint arXiv:2107.07566 , year=

work page arXiv

[36] [36]

Workshop on Reasoning and Planning for Large Language Models , year=

TACO: Learning Multi-modal Models to Reason and Act with Synthetic Chains-of-Thought-and-Action , author=. Workshop on Reasoning and Planning for Large Language Models , year=

work page

[37] [37]

Cogcom: Train large vision-language models diving into details through chain of manipulations , author=

work page

[38] [38]

OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning

Openthinkimg: Learning to think with images via visual tool reinforcement learning , author=. arXiv preprint arXiv:2505.08617 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[39] [39]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[40] [40]

SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models

Sft or rl? an early investigation into training r1-like reasoning large vision-language models , author=. arXiv preprint arXiv:2504.11468 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[41] [41]

2024 , journal =

HybridFlow: A Flexible and Efficient RLHF Framework , author =. 2024 , journal =

work page 2024

[42] [42]

Proceedings of the 29th Symposium on Operating Systems Principles , pages=

Efficient memory management for large language model serving with pagedattention , author=. Proceedings of the 29th Symposium on Operating Systems Principles , pages=

work page

[43] [43]

2024 , howpublished =

Qwen2.5 Technical Report , author =. 2024 , howpublished =

work page 2024

[44] [44]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

V?: Guided visual search as a core mechanism in multimodal llms , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[45] [45]

arXiv preprint arXiv:2403.00231 , year=

Multimodal arxiv: A dataset for improving scientific comprehension of large vision-language models , author=. arXiv preprint arXiv:2403.00231 , year=

work page arXiv

[46] [46]

arXiv preprint arXiv:2504.07934 , year=

Sota with less: Mcts-guided sample selection for data-efficient visual reasoning self-improvement , author=. arXiv preprint arXiv:2504.07934 , year=

work page arXiv

[47] [47]

Proceedings of the AAAI Conference on Artificial Intelligence , pages=

Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models , author=. Proceedings of the AAAI Conference on Artificial Intelligence , pages=

work page

[48] [48]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Coco-stuff: Thing and stuff classes in context , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[49] [49]

Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing , pages=

Referitgame: Referring to objects in photographs of natural scenes , author=. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2014

[50] [50]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Lisa: Reasoning segmentation via large language model , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[51] [51]

Evaluating Object Hallucination in Large Vision-Language Models

Evaluating object hallucination in large vision-language models , author=. arXiv preprint arXiv:2305.10355 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[52] [52]

European Conference on Computer Vision , pages=

Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? , author=. European Conference on Computer Vision , pages=. 2024 , organization=

work page 2024

[53] [53]

Advances in Neural Information Processing Systems , volume=

Measuring multimodal mathematical reasoning with math-vision dataset , author=. Advances in Neural Information Processing Systems , volume=

work page

[54] [54]

We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

We-math: Does your large multimodal model achieve human-like mathematical reasoning? , author=. arXiv preprint arXiv:2407.01284 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[55] [55]

Dynamath: A dynamic visual benchmark for evaluating mathematical reasoning robustness of vision language models.arXiv preprint arXiv:2411.00836, 2024

Dynamath: A dynamic visual benchmark for evaluating mathematical reasoning robustness of vision language models , author=. arXiv preprint arXiv:2411.00836 , year=

work page arXiv

[56] [56]

LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts

Logicvista: Multimodal llm logical reasoning benchmark in visual contexts , author=. arXiv preprint arXiv:2407.04973 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[57] [57]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Dyfo: A training-free dynamic focus visual search for enhancing lmms in fine-grained visual understanding , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[58] [58]

arXiv preprint arXiv:2411.16044 , year=

Zoomeye: Enhancing multimodal llms with human-like zooming capabilities through tree-based image exploration , author=. arXiv preprint arXiv:2411.16044 , year=

work page arXiv

[59] [59]

Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

Pixel reasoner: Incentivizing pixel-space reasoning with curiosity-driven reinforcement learning , author=. arXiv preprint arXiv:2505.15966 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[60] [60]

LLaVA-OneVision: Easy Visual Task Transfer

Llava-onevision: Easy visual task transfer , author=. arXiv preprint arXiv:2408.03326 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[61] [61]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models , author=. arXiv preprint arXiv:2504.10479 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[62] [62]

Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search

Mini-o3: Scaling up reasoning patterns and interaction turns for visual search , author=. arXiv preprint arXiv:2509.07969 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[63] [63]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Improved baselines with visual instruction tuning , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[64] [64]

International Conference on Machine Learning , pages=

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation , author=. International Conference on Machine Learning , pages=. 2022 , organization=

work page 2022

[65] [65]

International Conference on Machine Learning , pages=

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models , author=. International Conference on Machine Learning , pages=. 2023 , organization=

work page 2023

[66] [66]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond , author=. arXiv preprint arXiv:2308.12966 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[67] [67]

GPT-4o System Card

Gpt-4o system card , author=. arXiv preprint arXiv:2410.21276 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[68] [68]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Reproducible scaling laws for contrastive language-image learning , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[69] [69]

International Conference on Machine Learning , pages=

Learning transferable visual models from natural language supervision , author=. International Conference on Machine Learning , pages=. 2021 , organization=

work page 2021

[70] [70]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[71] [71]

Advances in Neural Information Processing Systems , volume=

Flamingo: a visual language model for few-shot learning , author=. Advances in Neural Information Processing Systems , volume=

work page

[72] [72]

Advances in Neural Information Processing Systems , volume=

Cambrian-1: A fully open, vision-centric exploration of multimodal llms , author=. Advances in Neural Information Processing Systems , volume=

work page

[73] [73]

Advances in Neural Information Processing Systems , volume=

Unveiling encoder-free vision-language models , author=. Advances in Neural Information Processing Systems , volume=

work page

[74] [74]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Mono-internvl: Pushing the boundaries of monolithic multimodal large language models with endogenous visual pre-training , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[75] [75]

arXiv preprint arXiv:2504.10462 , year=

The scalability of simplicity: Empirical analysis of vision-language learning with a single transformer , author=. arXiv preprint arXiv:2504.10462 , year=

work page arXiv

[76] [76]

arXiv preprint arXiv:2505.17018 , year=

SophiaVL-R1: Reinforcing MLLMs Reasoning with Thinking Reward , author=. arXiv preprint arXiv:2505.17018 , year=

work page arXiv

[77] [77]

arXiv preprint arXiv:2505.16673 , year=

R1-ShareVL: Incentivizing Reasoning Capability of Multimodal Large Language Models via Share-GRPO , author=. arXiv preprint arXiv:2505.16673 , year=

work page arXiv

[78] [78]

Advances in Neural Information Processing Systems , volume=

Visual sketchpad: Sketching as a visual chain of thought for multimodal language models , author=. Advances in Neural Information Processing Systems , volume=

work page

[79] [79]

Proceedings of the IEEE International Conference on Computer Vision , pages=

Segment anything , author=. Proceedings of the IEEE International Conference on Computer Vision , pages=

work page

[80] [80]

SAM 2: Segment Anything in Images and Videos

Sam 2: Segment anything in images and videos , author=. arXiv preprint arXiv:2408.00714 , year=

work page internal anchor Pith review Pith/arXiv arXiv