arxiv: 2605.11931 · v1 · submitted 2026-05-12 · 💻 cs.CV

Recognition: no theorem link

Learn to Think: Improving Multimodal Reasoning through Vision-Aware Self-Improvement Training

Qihuang Zhong , Liang Ding , Wenjie Xuan , Juhua Liu , Bo Du , Dacheng Tao

Authors on Pith no claims yet

Pith reviewed 2026-05-13 06:04 UTC · model grok-4.3

classification 💻 cs.CV

keywords multimodal reasoningself-improvement trainingMLLMsvision-aware attentionprefix resamplinglanguage prior biasreasoning tracespost-training

0 comments

The pith

VISTA corrects data imbalance and language prior bias in self-improvement training to boost multimodal reasoning in MLLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish that self-improvement training for multimodal large language models creates data imbalance by overtraining simple samples while undertraining hard ones and creates language prior bias by favoring text patterns over visual evidence. VISTA counters both issues with a prefix resampling method that reuses useful segments of partially correct reasoning traces and a vision-aware attention score that measures and encourages greater focus on image content during generation. If these changes hold, models can produce higher-quality training data internally, raising reasoning accuracy on tasks that require integrating sight and language. A sympathetic reader would care because expensive external reasoning traces become less necessary for building capable multimodal systems.

Core claim

By introducing a prefix resampling strategy to reuse partial correct reasoning traces for efficient data collection and a vision-aware attention score to quantify the model's focus on visual information, VISTA mitigates data imbalance and language prior bias in self-generated reasoning data, resulting in improved multimodal reasoning capabilities when applied to supervised fine-tuning and preference learning across various MLLMs and tasks.

What carries the argument

VISTA's prefix resampling strategy paired with its vision-aware attention score, where the score calculates attention directed to visual tokens to promote image-grounded reasoning over linguistic shortcuts.

Load-bearing premise

That the prefix resampling strategy reuses partial traces without introducing new biases and that the vision-aware attention score reliably quantifies and corrects language prior bias in a way that directly causes performance gains.

What would settle it

Ablating the vision-aware attention score during training and checking whether performance gains disappear on tasks that require strong visual grounding, such as diagram-based problem solving or counting objects in complex scenes.

Figures

Figures reproduced from arXiv: 2605.11931 by Bo Du, Dacheng Tao, Juhua Liu, Liang Ding, Qihuang Zhong, Wenjie Xuan.

**Figure 1.** Figure 1: Comparison of two self-generated reasoning traces. As seen, although predicting the correct answer, MLLMs may still exhibit visual hallucinations during intermediate reasoning processes, due to over-reliance on language priors and neglect of visual cues. Encouragingly, our proposed vision-aware attention score (VAS) can accurately identify these hallucinated solutions. 1. Introduction Inspired by the succe… view at source ↗

**Figure 2.** Figure 2: (a) Distribution of the number of correct solutions in a single query. (b) Distribution of self-generated training samples for different difficulty levels, where level-1 denotes the simplest and level-4 denotes the hardest. (c) Attention allocation between system prompts, visual, and instruction tokens across different model layers. Here, we use the Qwen2.5-VL-3B-Instruct as the base model. where Mθ initia… view at source ↗

**Figure 3.** Figure 3: Overview of our VISTA framework, which consists of two simple-yet-effective strategies: 1) prefix resampling, aiming to collect more accurate solutions for difficult queries; 2) vision-aware attention score, aiming to filter out undesired hallucinated solutions. to focus on the linguistic information of instruction and system prompt tokens. We conjecture that visual hallucinations in MLLMs may originate f… view at source ↗

**Figure 4.** Figure 4: (a) Performance comparison of tuned Qwen2.5-VL-7B models using different data collection methods. (b) Performance comparison of tuned Qwen2.5-VL-3B models using different data selection metrics on SLAKE. (c) Parameter analysis of threshold τ in VISTA on Qwen2.5-VL-3B models. Notably, in these experiments, we perform the self-improvement SFT training for one iteration. in [PITH_FULL_IMAGE:figures/full_fig_… view at source ↗

**Figure 6.** Figure 6: Comparison of OOD results between tuned Qwen2.5- VL-3B models using different self-improvement SFT methods. The x-axis denotes the index of self-improvement iteration. cluding Classic Illusion, Real Scene Illusion, No Illusion, Ishihara Images, and Trap Illusion. In this experiment, we use the VQA-Rad as the training set, and illustrate the results of tuned Qwen2.5-VL-3B models on IllusionBench in [PITH_… view at source ↗

**Figure 7.** Figure 7: Examples of self-generated long-CoT data in various multimodal reasoning tasks. bilingual closed-ended and open-ended questions across various modalities and human body parts. In this work, we only use the closed-ended English questions and their corresponding answers. By filtering the data, we finally obtain a training set with 1,681 samples, and a test set with 355 samples. • VQA-Rad: VQA-Rad (Lau et al.… view at source ↗

**Figure 8.** Figure 8: (a) Sampling success rate of our prefix resampling on the hardest queries (i.e., without any prior correct solutions). The x-axis denotes the number of resampled correct solutions in a query. (b) Distribution of the number of correct tokens in a query before and after using our prefix resampling strategy. Here, we show the results on the challenging Geometry3K task. (c) Distribution of our VAS scores in co… view at source ↗

**Figure 9.** Figure 9: Examples of two types of self-generated hallucinated solutions, where the hallucinated contents are highlighted in red. Accuracy (%) +16.62 +10.76 +10.48 Index of Layers Tasks Accuracy on SLAKE (%) (a) (b) 67.04 76.06 76.06 78.03 80.85 77.18 78.31 [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: (a) Analysis of layer depth for calculating VAS scores. (b) Performance comparison between with and without the selfconsistency method. Notably, we report the results of Qwen2.5-VL-3B-Instruct models self-trained for one iteration. answer (denoted as “Self-contradictory Hallucination”). We speculate that the main reason for this problem may lie in MLLMs’ strong language priors, which affect the focus and… view at source ↗

**Figure 11.** Figure 11: Illustration of our prefix resampling strategy. Notably, we use examples from the SLAKE (Left) and ChartQA (Right). C.5. Case Study To better understand our methods and demonstrate their effectiveness, we provide some cases in this part. First, [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

**Figure 12.** Figure 12: Examples of self-generated solutions with our VAS scores in the VQA-Rad (Left) and ChartQA (Right) tasks. Ques%on: Does the picture contain spleen? A. Yes B. No ✔ Output of SFT-Seed: <think>To determine if the CT scan image contains a spleen, I carefully examined the anatomical structures visible in the image. The spleen is typically located in the upper leF quadrant of the abdomen, close to the stomach. … view at source ↗

**Figure 13.** Figure 13: Examples of solutions predicted by Qwen2.5-VL-3B models tuned with different self-improvement training methods. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗

read the original abstract

Post-training with explicit reasoning traces is common to improve the reasoning capabilities of Multimodal Large Language Models (MLLMs). However, acquiring high-quality reasoning traces is often costly and time-consuming. Hence, the self-improvement paradigm has emerged, enabling MLLMs to self-generate reasoning traces for training without external supervision. Despite its effectiveness, we reveal two shortcomings in the self-improvement training of MLLMs: 1) data imbalance, where simple samples are over-trained, but the challenging yet crucial samples are under-trained; 2) language prior bias, where MLLMs overly rely on linguistic priors while neglecting the visual cues. To this end, we propose VISTA, a vision-aware self-improvement training framework for enhancing the multimodal reasoning of MLLMs. Specifically, VISTA first introduces a prefix resampling strategy to reuse the partial correct reasoning traces for efficient data collection, and then designs a vision-aware attention score to quantify the model's focus on visual information. Extensive experiments show that VISTA can be applied to various post-training scenarios, i.e., supervised fine-tuning and preference learning, and effectively enhances the multimodal reasoning performance across various MLLMs and tasks, e.g., bringing up to +13.66% average performance gains for Qwen2.5-VL-3B-Instruct.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VISTA pairs prefix resampling with a vision-aware attention score to fix imbalance and visual neglect in MLLM self-improvement, but the causal link to the reported gains rests on unshown controls.

read the letter

VISTA targets two real problems in self-improvement for multimodal models: data imbalance from over-training easy examples and language bias that ignores visuals. The prefix resampling reuses partial correct traces to get more training signal from hard cases, and the vision-aware attention score tries to push the model toward visual information. The new part is the specific pairing of these two fixes. Prior self-improvement work exists, but this combination for MLLMs is not standard. It applies to both supervised fine-tuning and preference learning, which broadens its use. The paper shows gains across models and tasks, with the biggest number being +13.66% average on Qwen2.5-VL-3B-Instruct. If the full experiments include proper controls, this could be a useful practical method for people fine-tuning these models. The main soft spot is in the validation. The abstract does not describe ablations or statistical tests, so it is not clear whether the gains come from the proposed mechanisms or from simply generating more data. The resampling could favor certain patterns and create new biases, and the attention score might not causally fix the neglect without direct evidence like attention maps before and after. The stress-test concern holds based on what is shown so far. This paper is for practitioners and researchers who train or adapt MLLMs for reasoning tasks. Readers working on post-training improvements will find the ideas straightforward to implement and test. It deserves a serious referee because the problems it identifies are common and the proposed solutions are concrete. I would send it to review.

Circularity Check

0 steps flagged

No circularity: empirical framework with independent experimental validation

full rationale

The paper presents VISTA as an empirical self-improvement method using prefix resampling and a vision-aware attention score, with all performance claims (+13.66% on Qwen2.5-VL-3B-Instruct) reported strictly as outcomes of experiments across MLLMs and tasks. No equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The central claims rest on external benchmarks and ablations rather than reducing to inputs by construction, making the work self-contained with no circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

This is an empirical machine-learning paper whose central claim rests on experimental results rather than formal axioms or derivations. No free parameters beyond standard training hyperparameters are mentioned. The vision-aware attention score is a new metric introduced by the authors.

axioms (1)

standard math Standard assumptions of gradient-based optimization and attention mechanisms in transformer models
Implicit in any fine-tuning or attention-based training paper.

invented entities (1)

Vision-aware attention score no independent evidence
purpose: Quantify and encourage model focus on visual information during self-improvement
Newly defined metric in the proposed framework; no independent evidence outside the paper's experiments.

pith-pipeline@v0.9.0 · 5550 in / 1335 out tokens · 87696 ms · 2026-05-13T06:04:27.587077+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 6 internal anchors

[1]

and Mitchell, T

Azaria, A. and Mitchell, T. The internal state of an llm knows when it’s lying. InFindings of the Association for Computational Linguistics: EMNLP 2023,

work page 2023
[2]

Qwen3-VL Technical Report

Bai, S., Cai, Y ., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al. Qwen3- vl technical report.arXiv preprint arXiv:2511.21631, 2025a. Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025b. Bai, Z., Wang,...

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Huatuogpt-o1, towards medical com- plex reasoning with llms.arXiv:2412.18925, 2024

Chen, J., Cai, Z., Ji, K., Wang, X., Liu, W., Wang, R., Hou, J., and Wang, B. Huatuogpt-o1, towards medical complex reasoning with llms.arXiv preprint arXiv:2412.18925,

work page arXiv
[4]

Vision-language models can self-improve reasoning via reflection

Cheng, K., YanTao, L., Xu, F., Zhang, J., Zhou, H., and Liu, Y . Vision-language models can self-improve reasoning via reflection. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Associa- tion for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers),

work page 2025
[5]

Self- improvement in multimodal large language models: A survey

Deng, S., Wang, K., Yang, T., Singh, H., and Tian, Y . Self- improvement in multimodal large language models: A survey. InFindings of the Association for Computational Linguistics: EMNLP 2025,

work page 2025
[6]

Mitigating tail narrowing in llm self-improvement via socratic-guided sampling

Ding, Y ., Xi, Z., He, W., Lizhuoyuan, L., Zhai, Y ., Xi- aowei, S., Cai, X., Gui, T., Zhang, Q., and Huang, X.-J. Mitigating tail narrowing in llm self-improvement via socratic-guided sampling. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies,

work page 2025
[7]

Reinforced Self-Training (ReST) for Language Modeling

Gulcehre, C., Paine, T. L., Srinivasan, S., Konyushkova, K., Weerts, L., Sharma, A., Siddhant, A., Ahern, A., Wang, M., Gu, C., et al. Reinforced self-training (rest) for language modeling.arXiv preprint arXiv:2308.08998,

work page Pith review arXiv
[8]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: In- centivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

PathVQA: 30000+ Questions for Medical Visual Question Answering

He, X., Zhang, Y ., Mou, L., Xing, E., and Xie, P. Pathvqa: 30000+ questions for medical visual question answering. arXiv preprint arXiv:2003.10286,

work page internal anchor Pith review arXiv 2003
[10]

J., Rohatgi, D., Zhang, C., Simchowitz, M., Ash, J

Huang, A., Block, A., Foster, D. J., Rohatgi, D., Zhang, C., Simchowitz, M., Ash, J. T., and Krishnamurthy, A. Self-improvement in language models: The sharpening mechanism. InThe Thirteenth International Conference on Learning Representations, 2025a. Huang, J., Gu, S., Hou, L., Wu, Y ., Wang, X., Yu, H., and Han, J. Large language models can self-improve...

work page 2023
[11]

Visual hallucina- tions of multi-modal large language models

Huang, W., Liu, H., Guo, M., and Gong, N. Visual hallucina- tions of multi-modal large language models. InFindings of the Association for Computational Linguistics: ACL 2024,

work page 2024
[12]

Medvl- thinker: Simple baselines for multimodal medical reason- ing.arXiv preprint arXiv:2508.02669, 2025b

Huang, X., Wu, J., Liu, H., Tang, X., and Zhou, Y . Medvl- thinker: Simple baselines for multimodal medical reason- ing.arXiv preprint arXiv:2508.02669, 2025b. Jaech, A., Kalai, A., Lerer, A., Richardson, A., El-Kishky, A., Low, A., Helyar, A., Madry, A., Beutel, A., Car- ney, A., et al. Openai o1 system card.arXiv preprint arXiv:2412.16720,

work page arXiv
[13]

The first few tokens are all you need: An efficient and effective unsupervised prefix fine-tuning method for reasoning models.arXiv preprint arXiv:2503.02875,

Ji, K., Xu, J., Liang, T., Liu, Q., He, Z., Chen, X., Liu, X., Wang, Z., Chen, J., Wang, B., et al. The first few tokens are all you need: An efficient and effective unsupervised prefix fine-tuning method for reasoning models.arXiv preprint arXiv:2503.02875,

work page arXiv
[14]

More thinking, less seeing? assessing amplified halluci- nation in multimodal reasoning models.arXiv preprint arXiv:2505.21523, 2025

Liu, C., Xu, Z., Wei, Q., Wu, J., Zou, J., Wang, X. E., Zhou, Y ., and Liu, S. More thinking, less seeing? assessing amplified hallucination in multimodal reasoning models. arXiv preprint arXiv:2505.21523, 2025a. Liu, N. F., Gardner, M., Belinkov, Y ., Peters, M. E., and Smith, N. A. Linguistic knowledge and transferability of contextual representations. ...

work page arXiv 2019
[15]

L., Tan, J

Masry, A., Do, X. L., Tan, J. Q., Joty, S., and Hoque, E. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. InFindings of the association for computational linguistics: ACL 2022,

work page 2022
[16]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., Wu, Y ., et al. Deepseekmath: Push- ing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Wang, Y ., Chen, W., Han, X., Lin, X., Zhao, H., Liu, Y ., Zhai, B., Yuan, J., You, Q., and Yang, H. Exploring the reasoning abilities of multimodal large language models (mllms): A comprehensive survey on emerging trends in multimodal reasoning.arXiv preprint arXiv:2401.06805, 2024b. Wang, Y ., Wu, S., Zhang, Y ., Yan, S., Liu, Z., Luo, J., and Fei, H. M...

work page arXiv
[18]

Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs

Wen, X., Liu, Z., Zheng, S., Ye, S., Wu, Z., Wang, Y ., Xu, Z., Liang, X., Li, J., Miao, Z., et al. Reinforcement learn- ing with verifiable rewards implicitly incentivizes correct reasoning in base llms.arXiv preprint arXiv:2506.14245,

work page internal anchor Pith review arXiv
[19]

arXiv preprint arXiv:2506.07044 (2025)

Xu, W., Chan, H. P., Li, L., Aljunied, M., Yuan, R., Wang, J., Xiao, C., Chen, G., Liu, C., Li, Z., et al. Lingshu: A generalist foundation model for unified multimodal medical understanding and reasoning.arXiv preprint arXiv:2506.07044,

work page arXiv
[20]

arXiv preprint arXiv:2504.15895 , year=

Yang, C., Si, Q., Duan, Y ., Zhu, Z., Zhu, C., Li, Q., Lin, Z., Cao, L., and Wang, W. Dynamic early exit in reasoning models.arXiv preprint arXiv:2504.15895, 2025a. Yang, S., Tong, Y ., Niu, X., Neubig, G., and Yue, X. Demys- tifying long chain-of-thought reasoning. InForty-second International Conference on Machine Learning, 2025b. Yuan, Z., Yuan, H., Li...

work page arXiv
[21]

Illusionbench: A large-scale and comprehensive bench- mark for visual illusion understanding in vision-language models.arXiv preprint arXiv:2501.00848,

Zhang, Y ., Zhang, Z., Wei, X., Liu, X., Zhai, G., and Min, X. Illusionbench: A large-scale and comprehensive bench- mark for visual illusion understanding in vision-language models.arXiv preprint arXiv:2501.00848,

work page arXiv
[22]

Thinking before looking: Improving multimodal llm rea- soning via mitigating visual hallucination.arXiv preprint arXiv:2411.12591,

Zheng, H., Xu, T., Sun, H., Pu, S., Chen, R., and Sun, L. Thinking before looking: Improving multimodal llm rea- soning via mitigating visual hallucination.arXiv preprint arXiv:2411.12591,

work page arXiv
[23]

Kaft: Knowledge-aware fine-tuning for boosting llms’ domain-specific question-answering performance

Zhong, Q., Ding, L., Cai, X., Liu, J., Du, B., and Tao, D. Kaft: Knowledge-aware fine-tuning for boosting llms’ domain-specific question-answering performance. In Findings of the Association for Computational Linguis- tics: ACL 2025,

work page 2025
[24]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Zhong, Q., Ding, L., Liu, J., Du, B., Rutkowski, L., and Tao, D. Better, faster: Harnessing self-improvement in large reasoning models.arXiv preprint, 2026a. Zhong, Q., Wang, K., Xu, Z., Ding, L., Liu, J., and Du, B. Achieving> 97% on gsm8k: Deeply understanding the problems makes llms better solvers for math word problems.Frontiers of Computer Science, 2...

work page internal anchor Pith review Pith/arXiv arXiv
[25]

and OpenAI o1 (Jaech et al., 2024), in a diversity of natural language processing tasks (Shao et al., 2024; Chen et al., 2024; Zhong et al., 2026b). Motivated by this, extending the advantage of long-CoT reasoning to multimodal context has attracted significant interest (Wang et al., 2024b; 2025; Zhu et al., 2025; Bai et al., 2025b;a). To achieve this goa...

work page 2024
[26]

self-improvement

does not strictly require the explicit reasoning trajectories, cold-start training with these trajectories can effectively improve the performance and training efficiency (Yang et al., 2025b), which also underscores the importance of these trajectories. Self-improvement Training for MLLMs.To reduce the reliance on explicit reasoning trajectories, a “self-...

work page 2022
[27]

In the preference learning setting, Pang et al

extend STaR by sampling multiple responses for each question. In the preference learning setting, Pang et al. (2024) and Wang et al. (2024a) propose to construct preference pairs by using the self-generated correct responses as the pair winners and the incorrect responses as the pair losers. Recent advances attempt to extend self-improvement training to m...

work page 2024
[28]

However, these efforts either fail to exploit prior failed solutions fully or rely on external models and additional computational overhead to estimate the language prior bias

and designing metrics to measure the language priors (He et al., 2025; Liu et al., 2025a). However, these efforts either fail to exploit prior failed solutions fully or rely on external models and additional computational overhead to estimate the language prior bias. Different from them, we propose two simple-yet-effective approaches to address these prob...

work page 2025
[29]

is a widely-used medical visual question-answering (VQA) task, which contains both 13 Learn to Think: Improving Multimodal Reasoning through Vision-Aware Self-Improvement Training Does the picture contain lung?A.Yes B.No SLAKE ✔ <think>Todetermineiftheimagecontainslung,Ianalyzedthecontent.Theimageappearstobeacross-secDonalscanofathoracicregion,typicallyus...

work page 2017
[30]

In the DPO training phase, the batch size is set to 16, and the peak learning rate is set to 1e-5

The max image pixels are set to 512×512 . In the DPO training phase, the batch size is set to 16, and the peak learning rate is set to 1e-5. All models are trained for 3 epochs. Both SFT and DPO training are performed using the popular LLaMA-Factory4 toolkit, following prior work (Zhong et al., 2025). As for GRPO training, the batch size is set to 32, and...

work page 2025
[31]

We use the EasyR15 as the training framework of GRPO

Each model is trained for 3 epochs. We use the EasyR15 as the training framework of GRPO. Notably, for the model optimizer of all settings, we keep the vision encoder and multimodal projector fixed, and only update the parameters of the LLM backbone. All experiments are conducted on 8 NVIDIA A800 (80GB) GPUs. For the model evaluation, we use the greedy de...

work page 2025
[32]

Description Hallucination

We observe that when k is too small (k=1), the model suffers from over-calibration, truncating at very early prefix tokens and degrading sampling efficiency. Conversely, when k is too large (k=50), the model overlooks many critical tokens, reducing the effectiveness of prefix resampling. With k=5, the model achieves the best overall performance, and we th...

work page arXiv 2024