pith. machine review for the scientific record. sign in

arxiv: 2605.11931 · v1 · submitted 2026-05-12 · 💻 cs.CV

Recognition: no theorem link

Learn to Think: Improving Multimodal Reasoning through Vision-Aware Self-Improvement Training

Authors on Pith no claims yet

Pith reviewed 2026-05-13 06:04 UTC · model grok-4.3

classification 💻 cs.CV
keywords multimodal reasoningself-improvement trainingMLLMsvision-aware attentionprefix resamplinglanguage prior biasreasoning tracespost-training
0
0 comments X

The pith

VISTA corrects data imbalance and language prior bias in self-improvement training to boost multimodal reasoning in MLLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish that self-improvement training for multimodal large language models creates data imbalance by overtraining simple samples while undertraining hard ones and creates language prior bias by favoring text patterns over visual evidence. VISTA counters both issues with a prefix resampling method that reuses useful segments of partially correct reasoning traces and a vision-aware attention score that measures and encourages greater focus on image content during generation. If these changes hold, models can produce higher-quality training data internally, raising reasoning accuracy on tasks that require integrating sight and language. A sympathetic reader would care because expensive external reasoning traces become less necessary for building capable multimodal systems.

Core claim

By introducing a prefix resampling strategy to reuse partial correct reasoning traces for efficient data collection and a vision-aware attention score to quantify the model's focus on visual information, VISTA mitigates data imbalance and language prior bias in self-generated reasoning data, resulting in improved multimodal reasoning capabilities when applied to supervised fine-tuning and preference learning across various MLLMs and tasks.

What carries the argument

VISTA's prefix resampling strategy paired with its vision-aware attention score, where the score calculates attention directed to visual tokens to promote image-grounded reasoning over linguistic shortcuts.

Load-bearing premise

That the prefix resampling strategy reuses partial traces without introducing new biases and that the vision-aware attention score reliably quantifies and corrects language prior bias in a way that directly causes performance gains.

What would settle it

Ablating the vision-aware attention score during training and checking whether performance gains disappear on tasks that require strong visual grounding, such as diagram-based problem solving or counting objects in complex scenes.

Figures

Figures reproduced from arXiv: 2605.11931 by Bo Du, Dacheng Tao, Juhua Liu, Liang Ding, Qihuang Zhong, Wenjie Xuan.

Figure 1
Figure 1. Figure 1: Comparison of two self-generated reasoning traces. As seen, although predicting the correct answer, MLLMs may still exhibit visual hallucinations during intermediate reasoning processes, due to over-reliance on language priors and neglect of visual cues. Encouragingly, our proposed vision-aware attention score (VAS) can accurately identify these hallucinated solutions. 1. Introduction Inspired by the succe… view at source ↗
Figure 2
Figure 2. Figure 2: (a) Distribution of the number of correct solutions in a single query. (b) Distribution of self-generated training samples for different difficulty levels, where level-1 denotes the simplest and level-4 denotes the hardest. (c) Attention allocation between system prompts, visual, and instruction tokens across different model layers. Here, we use the Qwen2.5-VL-3B-Instruct as the base model. where Mθ initia… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of our VISTA framework, which consists of two simple-yet-effective strategies: 1) prefix resampling, aiming to collect more accurate solutions for difficult queries; 2) vision-aware attention score, aiming to filter out undesired hallucinated solutions. to focus on the linguistic information of instruction and sys￾tem prompt tokens. We conjecture that visual hallucinations in MLLMs may originate f… view at source ↗
Figure 4
Figure 4. Figure 4: (a) Performance comparison of tuned Qwen2.5-VL-7B models using different data collection methods. (b) Performance comparison of tuned Qwen2.5-VL-3B models using different data selection metrics on SLAKE. (c) Parameter analysis of threshold τ in VISTA on Qwen2.5-VL-3B models. Notably, in these experiments, we perform the self-improvement SFT training for one iteration. in [PITH_FULL_IMAGE:figures/full_fig_… view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of OOD results between tuned Qwen2.5- VL-3B models using different self-improvement SFT methods. The x-axis denotes the index of self-improvement iteration. cluding Classic Illusion, Real Scene Illusion, No Illusion, Ishihara Images, and Trap Illusion. In this experiment, we use the VQA-Rad as the training set, and illustrate the re￾sults of tuned Qwen2.5-VL-3B models on IllusionBench in [PITH_… view at source ↗
Figure 7
Figure 7. Figure 7: Examples of self-generated long-CoT data in various multimodal reasoning tasks. bilingual closed-ended and open-ended questions across various modalities and human body parts. In this work, we only use the closed-ended English questions and their corresponding answers. By filtering the data, we finally obtain a training set with 1,681 samples, and a test set with 355 samples. • VQA-Rad: VQA-Rad (Lau et al.… view at source ↗
Figure 8
Figure 8. Figure 8: (a) Sampling success rate of our prefix resampling on the hardest queries (i.e., without any prior correct solutions). The x-axis denotes the number of resampled correct solutions in a query. (b) Distribution of the number of correct tokens in a query before and after using our prefix resampling strategy. Here, we show the results on the challenging Geometry3K task. (c) Distribution of our VAS scores in co… view at source ↗
Figure 9
Figure 9. Figure 9: Examples of two types of self-generated hallucinated solutions, where the hallucinated contents are highlighted in red. Accuracy (%) +16.62 +10.76 +10.48 Index of Layers Tasks Accuracy on SLAKE (%) (a) (b) 67.04 76.06 76.06 78.03 80.85 77.18 78.31 [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: (a) Analysis of layer depth for calculating VAS scores. (b) Performance comparison between with and without the self￾consistency method. Notably, we report the results of Qwen2.5-VL-3B-Instruct models self-trained for one iteration. answer (denoted as “Self-contradictory Hallucination”). We speculate that the main reason for this problem may lie in MLLMs’ strong language priors, which affect the focus and… view at source ↗
Figure 11
Figure 11. Figure 11: Illustration of our prefix resampling strategy. Notably, we use examples from the SLAKE (Left) and ChartQA (Right). C.5. Case Study To better understand our methods and demonstrate their effectiveness, we provide some cases in this part. First, [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Examples of self-generated solutions with our VAS scores in the VQA-Rad (Left) and ChartQA (Right) tasks. Ques%on: Does the picture contain spleen? A. Yes B. No ✔ Output of SFT-Seed: <think>To determine if the CT scan image contains a spleen, I carefully examined the anatomical structures visible in the image. The spleen is typically located in the upper leF quadrant of the abdomen, close to the stomach. … view at source ↗
Figure 13
Figure 13. Figure 13: Examples of solutions predicted by Qwen2.5-VL-3B models tuned with different self-improvement training methods. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗
read the original abstract

Post-training with explicit reasoning traces is common to improve the reasoning capabilities of Multimodal Large Language Models (MLLMs). However, acquiring high-quality reasoning traces is often costly and time-consuming. Hence, the self-improvement paradigm has emerged, enabling MLLMs to self-generate reasoning traces for training without external supervision. Despite its effectiveness, we reveal two shortcomings in the self-improvement training of MLLMs: 1) data imbalance, where simple samples are over-trained, but the challenging yet crucial samples are under-trained; 2) language prior bias, where MLLMs overly rely on linguistic priors while neglecting the visual cues. To this end, we propose VISTA, a vision-aware self-improvement training framework for enhancing the multimodal reasoning of MLLMs. Specifically, VISTA first introduces a prefix resampling strategy to reuse the partial correct reasoning traces for efficient data collection, and then designs a vision-aware attention score to quantify the model's focus on visual information. Extensive experiments show that VISTA can be applied to various post-training scenarios, i.e., supervised fine-tuning and preference learning, and effectively enhances the multimodal reasoning performance across various MLLMs and tasks, e.g., bringing up to +13.66% average performance gains for Qwen2.5-VL-3B-Instruct.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Circularity Check

0 steps flagged

No circularity: empirical framework with independent experimental validation

full rationale

The paper presents VISTA as an empirical self-improvement method using prefix resampling and a vision-aware attention score, with all performance claims (+13.66% on Qwen2.5-VL-3B-Instruct) reported strictly as outcomes of experiments across MLLMs and tasks. No equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The central claims rest on external benchmarks and ablations rather than reducing to inputs by construction, making the work self-contained with no circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

This is an empirical machine-learning paper whose central claim rests on experimental results rather than formal axioms or derivations. No free parameters beyond standard training hyperparameters are mentioned. The vision-aware attention score is a new metric introduced by the authors.

axioms (1)
  • standard math Standard assumptions of gradient-based optimization and attention mechanisms in transformer models
    Implicit in any fine-tuning or attention-based training paper.
invented entities (1)
  • Vision-aware attention score no independent evidence
    purpose: Quantify and encourage model focus on visual information during self-improvement
    Newly defined metric in the proposed framework; no independent evidence outside the paper's experiments.

pith-pipeline@v0.9.0 · 5550 in / 1335 out tokens · 87696 ms · 2026-05-13T06:04:27.587077+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 6 internal anchors

  1. [1]

    and Mitchell, T

    Azaria, A. and Mitchell, T. The internal state of an llm knows when it’s lying. InFindings of the Association for Computational Linguistics: EMNLP 2023,

  2. [2]

    Qwen3-VL Technical Report

    Bai, S., Cai, Y ., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al. Qwen3- vl technical report.arXiv preprint arXiv:2511.21631, 2025a. Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025b. Bai, Z., Wang,...

  3. [3]

    Huatuogpt-o1, towards medical com- plex reasoning with llms.arXiv:2412.18925, 2024

    Chen, J., Cai, Z., Ji, K., Wang, X., Liu, W., Wang, R., Hou, J., and Wang, B. Huatuogpt-o1, towards medical complex reasoning with llms.arXiv preprint arXiv:2412.18925,

  4. [4]

    Vision-language models can self-improve reasoning via reflection

    Cheng, K., YanTao, L., Xu, F., Zhang, J., Zhou, H., and Liu, Y . Vision-language models can self-improve reasoning via reflection. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Associa- tion for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers),

  5. [5]

    Self- improvement in multimodal large language models: A survey

    Deng, S., Wang, K., Yang, T., Singh, H., and Tian, Y . Self- improvement in multimodal large language models: A survey. InFindings of the Association for Computational Linguistics: EMNLP 2025,

  6. [6]

    Mitigating tail narrowing in llm self-improvement via socratic-guided sampling

    Ding, Y ., Xi, Z., He, W., Lizhuoyuan, L., Zhai, Y ., Xi- aowei, S., Cai, X., Gui, T., Zhang, Q., and Huang, X.-J. Mitigating tail narrowing in llm self-improvement via socratic-guided sampling. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies,

  7. [7]

    Reinforced Self-Training (ReST) for Language Modeling

    Gulcehre, C., Paine, T. L., Srinivasan, S., Konyushkova, K., Weerts, L., Sharma, A., Siddhant, A., Ahern, A., Wang, M., Gu, C., et al. Reinforced self-training (rest) for language modeling.arXiv preprint arXiv:2308.08998,

  8. [8]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: In- centivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

  9. [9]

    PathVQA: 30000+ Questions for Medical Visual Question Answering

    He, X., Zhang, Y ., Mou, L., Xing, E., and Xie, P. Pathvqa: 30000+ questions for medical visual question answering. arXiv preprint arXiv:2003.10286,

  10. [10]

    J., Rohatgi, D., Zhang, C., Simchowitz, M., Ash, J

    Huang, A., Block, A., Foster, D. J., Rohatgi, D., Zhang, C., Simchowitz, M., Ash, J. T., and Krishnamurthy, A. Self-improvement in language models: The sharpening mechanism. InThe Thirteenth International Conference on Learning Representations, 2025a. Huang, J., Gu, S., Hou, L., Wu, Y ., Wang, X., Yu, H., and Han, J. Large language models can self-improve...

  11. [11]

    Visual hallucina- tions of multi-modal large language models

    Huang, W., Liu, H., Guo, M., and Gong, N. Visual hallucina- tions of multi-modal large language models. InFindings of the Association for Computational Linguistics: ACL 2024,

  12. [12]

    Medvl- thinker: Simple baselines for multimodal medical reason- ing.arXiv preprint arXiv:2508.02669, 2025b

    Huang, X., Wu, J., Liu, H., Tang, X., and Zhou, Y . Medvl- thinker: Simple baselines for multimodal medical reason- ing.arXiv preprint arXiv:2508.02669, 2025b. Jaech, A., Kalai, A., Lerer, A., Richardson, A., El-Kishky, A., Low, A., Helyar, A., Madry, A., Beutel, A., Car- ney, A., et al. Openai o1 system card.arXiv preprint arXiv:2412.16720,

  13. [13]

    The first few tokens are all you need: An efficient and effective unsupervised prefix fine-tuning method for reasoning models.arXiv preprint arXiv:2503.02875,

    Ji, K., Xu, J., Liang, T., Liu, Q., He, Z., Chen, X., Liu, X., Wang, Z., Chen, J., Wang, B., et al. The first few tokens are all you need: An efficient and effective unsupervised prefix fine-tuning method for reasoning models.arXiv preprint arXiv:2503.02875,

  14. [14]

    More thinking, less seeing? assessing amplified halluci- nation in multimodal reasoning models.arXiv preprint arXiv:2505.21523, 2025

    Liu, C., Xu, Z., Wei, Q., Wu, J., Zou, J., Wang, X. E., Zhou, Y ., and Liu, S. More thinking, less seeing? assessing amplified hallucination in multimodal reasoning models. arXiv preprint arXiv:2505.21523, 2025a. Liu, N. F., Gardner, M., Belinkov, Y ., Peters, M. E., and Smith, N. A. Linguistic knowledge and transferability of contextual representations. ...

  15. [15]

    L., Tan, J

    Masry, A., Do, X. L., Tan, J. Q., Joty, S., and Hoque, E. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. InFindings of the association for computational linguistics: ACL 2022,

  16. [16]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., Wu, Y ., et al. Deepseekmath: Push- ing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

  17. [17]

    Wang, Y ., Chen, W., Han, X., Lin, X., Zhao, H., Liu, Y ., Zhai, B., Yuan, J., You, Q., and Yang, H. Exploring the reasoning abilities of multimodal large language models (mllms): A comprehensive survey on emerging trends in multimodal reasoning.arXiv preprint arXiv:2401.06805, 2024b. Wang, Y ., Wu, S., Zhang, Y ., Yan, S., Liu, Z., Luo, J., and Fei, H. M...

  18. [18]

    Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs

    Wen, X., Liu, Z., Zheng, S., Ye, S., Wu, Z., Wang, Y ., Xu, Z., Liang, X., Li, J., Miao, Z., et al. Reinforcement learn- ing with verifiable rewards implicitly incentivizes correct reasoning in base llms.arXiv preprint arXiv:2506.14245,

  19. [19]

    arXiv preprint arXiv:2506.07044 (2025)

    Xu, W., Chan, H. P., Li, L., Aljunied, M., Yuan, R., Wang, J., Xiao, C., Chen, G., Liu, C., Li, Z., et al. Lingshu: A generalist foundation model for unified multimodal medical understanding and reasoning.arXiv preprint arXiv:2506.07044,

  20. [20]

    arXiv preprint arXiv:2504.15895 , year=

    Yang, C., Si, Q., Duan, Y ., Zhu, Z., Zhu, C., Li, Q., Lin, Z., Cao, L., and Wang, W. Dynamic early exit in reasoning models.arXiv preprint arXiv:2504.15895, 2025a. Yang, S., Tong, Y ., Niu, X., Neubig, G., and Yue, X. Demys- tifying long chain-of-thought reasoning. InForty-second International Conference on Machine Learning, 2025b. Yuan, Z., Yuan, H., Li...

  21. [21]

    Illusionbench: A large-scale and comprehensive bench- mark for visual illusion understanding in vision-language models.arXiv preprint arXiv:2501.00848,

    Zhang, Y ., Zhang, Z., Wei, X., Liu, X., Zhai, G., and Min, X. Illusionbench: A large-scale and comprehensive bench- mark for visual illusion understanding in vision-language models.arXiv preprint arXiv:2501.00848,

  22. [22]

    Thinking before looking: Improving multimodal llm rea- soning via mitigating visual hallucination.arXiv preprint arXiv:2411.12591,

    Zheng, H., Xu, T., Sun, H., Pu, S., Chen, R., and Sun, L. Thinking before looking: Improving multimodal llm rea- soning via mitigating visual hallucination.arXiv preprint arXiv:2411.12591,

  23. [23]

    Kaft: Knowledge-aware fine-tuning for boosting llms’ domain-specific question-answering performance

    Zhong, Q., Ding, L., Cai, X., Liu, J., Du, B., and Tao, D. Kaft: Knowledge-aware fine-tuning for boosting llms’ domain-specific question-answering performance. In Findings of the Association for Computational Linguis- tics: ACL 2025,

  24. [24]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Zhong, Q., Ding, L., Liu, J., Du, B., Rutkowski, L., and Tao, D. Better, faster: Harnessing self-improvement in large reasoning models.arXiv preprint, 2026a. Zhong, Q., Wang, K., Xu, Z., Ding, L., Liu, J., and Du, B. Achieving> 97% on gsm8k: Deeply understanding the problems makes llms better solvers for math word problems.Frontiers of Computer Science, 2...

  25. [25]

    and OpenAI o1 (Jaech et al., 2024), in a diversity of natural language processing tasks (Shao et al., 2024; Chen et al., 2024; Zhong et al., 2026b). Motivated by this, extending the advantage of long-CoT reasoning to multimodal context has attracted significant interest (Wang et al., 2024b; 2025; Zhu et al., 2025; Bai et al., 2025b;a). To achieve this goa...

  26. [26]

    self-improvement

    does not strictly require the explicit reasoning trajectories, cold-start training with these trajectories can effectively improve the performance and training efficiency (Yang et al., 2025b), which also underscores the importance of these trajectories. Self-improvement Training for MLLMs.To reduce the reliance on explicit reasoning trajectories, a “self-...

  27. [27]

    In the preference learning setting, Pang et al

    extend STaR by sampling multiple responses for each question. In the preference learning setting, Pang et al. (2024) and Wang et al. (2024a) propose to construct preference pairs by using the self-generated correct responses as the pair winners and the incorrect responses as the pair losers. Recent advances attempt to extend self-improvement training to m...

  28. [28]

    However, these efforts either fail to exploit prior failed solutions fully or rely on external models and additional computational overhead to estimate the language prior bias

    and designing metrics to measure the language priors (He et al., 2025; Liu et al., 2025a). However, these efforts either fail to exploit prior failed solutions fully or rely on external models and additional computational overhead to estimate the language prior bias. Different from them, we propose two simple-yet-effective approaches to address these prob...

  29. [29]

    is a widely-used medical visual question-answering (VQA) task, which contains both 13 Learn to Think: Improving Multimodal Reasoning through Vision-Aware Self-Improvement Training Does the picture contain lung?A.Yes B.No SLAKE ✔ <think>Todetermineiftheimagecontainslung,Ianalyzedthecontent.Theimageappearstobeacross-secDonalscanofathoracicregion,typicallyus...

  30. [30]

    In the DPO training phase, the batch size is set to 16, and the peak learning rate is set to 1e-5

    The max image pixels are set to 512×512 . In the DPO training phase, the batch size is set to 16, and the peak learning rate is set to 1e-5. All models are trained for 3 epochs. Both SFT and DPO training are performed using the popular LLaMA-Factory4 toolkit, following prior work (Zhong et al., 2025). As for GRPO training, the batch size is set to 32, and...

  31. [31]

    We use the EasyR15 as the training framework of GRPO

    Each model is trained for 3 epochs. We use the EasyR15 as the training framework of GRPO. Notably, for the model optimizer of all settings, we keep the vision encoder and multimodal projector fixed, and only update the parameters of the LLM backbone. All experiments are conducted on 8 NVIDIA A800 (80GB) GPUs. For the model evaluation, we use the greedy de...

  32. [32]

    Description Hallucination

    We observe that when k is too small (k=1), the model suffers from over-calibration, truncating at very early prefix tokens and degrading sampling efficiency. Conversely, when k is too large (k=50), the model overlooks many critical tokens, reducing the effectiveness of prefix resampling. With k=5, the model achieves the best overall performance, and we th...