pith. sign in

arxiv: 2606.27373 · v1 · pith:BLQD2IGJnew · submitted 2026-06-25 · 💻 cs.CV

Paying More Attention to Visual Tokens in Self-Evolving Large Multimodal Models

Pith reviewed 2026-06-26 04:59 UTC · model grok-4.3

classification 💻 cs.CV
keywords visual under-conditioningself-evolving LMMsVISEinvariance rewardsgeometric invariancesemantic invariancevisual conditioningunsupervised multimodal training
0
0 comments X

The pith

VISE uses geometric and semantic invariance rewards to make self-evolving multimodal models attend to visual tokens rather than language priors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Self-evolving large multimodal models often fail to use image content because they rely on language patterns for consistent answers. VISE fixes this by training the model on unlabeled images with two rewards that check consistency under image changes. The geometric reward ensures the model gives similar outputs after spatial transformations, while the semantic reward makes the model detect when important image parts are altered. This single-model unsupervised approach improves results on captioning and question answering tasks across different models.

Core claim

VISE is a purely unsupervised self-evolving framework that directly regularizes the model's visual conditioning policy through two complementary invariance-based rewards: a geometric invariance reward that enforces spatial consistency under known transformations, and a semantic invariance reward that penalizes evidence-agnostic generation by requiring the model to recognize the absence of evidence when predicted regions are perturbed. It operates within a single model without specialist roles, external reward models, or annotations, and is trained on raw unlabeled images, leading to gains on 18 benchmarks.

What carries the argument

The VISE framework applying geometric invariance reward for spatial consistency and semantic invariance reward for evidence recognition to regularize visual conditioning in self-evolving LMMs.

If this is right

  • Using Qwen3-VL-2B as base, achieves +16.85 CIDEr on COCO and +19.66 CIDEr on TextCaps.
  • Reduces object hallucination by 5.0 CHAIR-I points.
  • Generalizes across four model families and scales.
  • Trains effectively on raw unlabeled images without any annotations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The rewards could be adapted to other self-training setups to improve grounding in generated outputs.
  • Measuring attention weights on visual tokens before and after training would test if the mechanism works as intended.
  • Similar invariance ideas might help in text-only or other modality self-evolving systems.

Load-bearing premise

The geometric and semantic invariance rewards specifically cause the decoder to increase attention to visual tokens during generation.

What would settle it

If attention to visual tokens remains unchanged or decreases after training with VISE while metrics still improve, or if gains disappear when images are not provided.

read the original abstract

Recently, self-evolving large multimodal models (LMMs) have received attention for improving visual reasoning in a purely unsupervised setting. However, multi-role self-play and self-consistency reward schemes in existing self-evolving LMMs optimize answer agreement without ensuring the decoder attends to visual content, relying instead on statistical language priors to produce self consistent outputs. This leads to a persistent failure mode we term visual under-conditioning, where the decoder relies on language priors rather than the image during generation, manifesting as insufficient attention to visual tokens. As a result, current self-evolving LMMs struggle on vision--language understanding tasks such as image captioning and visual question answering. To address this, we propose VISE (Visual Invariance Self-Evolution), a purely unsupervised self-evolving framework that directly regularizes the model's visual conditioning policy through two complementary invariance-based rewards: a geometric invariance reward that enforces spatial consistency under known transformations, and a semantic invariance reward that penalizes evidence-agnostic generation by requiring the model to recognize the absence of evidence when predicted regions are perturbed. VISE operates within a single model without specialist roles, external reward models, or annotations, and is trained on raw unlabeled images. Experiments on 18 benchmarks demonstrate the efficacy of our approach. Using Qwen3-VL-2B as the base model, VISE achieves gains of $+16.85$ CIDEr on COCO and $+19.66$ CIDEr on TextCaps, reduces object hallucination by $5.0$ Chair-I points, and generalizes across four model families and scales. Our code and models are available at https://mbzuai-oryx.github.io/VISE

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that self-evolving LMMs suffer from visual under-conditioning because self-consistency rewards allow reliance on language priors rather than visual content. It introduces VISE, a single-model unsupervised framework using a geometric invariance reward (enforcing spatial consistency under transformations) and a semantic invariance reward (penalizing evidence-agnostic outputs on perturbed regions) to directly regularize the decoder's visual conditioning policy. Experiments with Qwen3-VL-2B and other models report gains of +16.85 CIDEr on COCO, +19.66 CIDEr on TextCaps, and -5.0 Chair-I points on object hallucination, with generalization across four model families on 18 benchmarks.

Significance. If the invariance rewards are shown to causally increase decoder attention to visual tokens (rather than acting as generic regularization), the result would be significant for unsupervised self-evolution of multimodal models. The purely single-model, annotation-free setting and cross-scale generalization are strengths; the magnitude of the reported CIDEr and hallucination reductions would be notable if the mechanism is isolated.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (method description): the central claim that the geometric and semantic invariance rewards increase decoder attention to visual tokens (addressing 'visual under-conditioning') lacks supporting evidence such as quantitative attention-weight statistics, before/after attention-map comparisons, or ablations that isolate attention change from self-consistency or regularization effects; without this, the reported metric gains cannot be attributed to the stated policy change.
  2. [Experiments] Experiments section (results on COCO/TextCaps/Chair-I): the +16.85 CIDEr, +19.66 CIDEr, and -5.0 Chair-I improvements are presented as evidence of better visual conditioning, yet no ablation or diagnostic (e.g., attention entropy on visual vs. text tokens, or controlled perturbation tests) rules out alternative explanations such as improved language-model consistency; this is load-bearing for the paper's interpretation.
minor comments (2)
  1. The abstract states results on '18 benchmarks' but provides no explicit list or summary table; adding one would improve clarity.
  2. [§3] Notation for the two rewards (geometric and semantic) should be introduced with explicit equations early in §3 to avoid ambiguity when describing the combined objective.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, acknowledging where the current manuscript provides only indirect support for the mechanism and committing to revisions that add the requested diagnostics.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (method description): the central claim that the geometric and semantic invariance rewards increase decoder attention to visual tokens (addressing 'visual under-conditioning') lacks supporting evidence such as quantitative attention-weight statistics, before/after attention-map comparisons, or ablations that isolate attention change from self-consistency or regularization effects; without this, the reported metric gains cannot be attributed to the stated policy change.

    Authors: We agree that the manuscript currently relies on indirect evidence: the reward formulations explicitly target visual content (spatial consistency under transformations and penalization of evidence-agnostic outputs on perturbed regions), together with large gains on captioning and hallucination benchmarks that require visual grounding. Direct attention statistics are absent. In the revised version we will add quantitative attention-entropy measurements on visual versus text tokens before and after training, plus controlled perturbation ablations, to isolate the claimed policy change from generic regularization. revision: yes

  2. Referee: [Experiments] Experiments section (results on COCO/TextCaps/Chair-I): the +16.85 CIDEr, +19.66 CIDEr, and -5.0 Chair-I improvements are presented as evidence of better visual conditioning, yet no ablation or diagnostic (e.g., attention entropy on visual vs. text tokens, or controlled perturbation tests) rules out alternative explanations such as improved language-model consistency; this is load-bearing for the paper's interpretation.

    Authors: We concur that alternative explanations must be ruled out for the interpretation to hold. The semantic invariance reward is intended to enforce visual dependence rather than mere consistency, but the manuscript does not yet contain the requested controlled comparisons. We will add (i) an ablation that removes the visual-perturbation component and (ii) a direct comparison against a pure self-consistency baseline, together with the attention-entropy diagnostics mentioned above, in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper proposes VISE as a new unsupervised self-evolving framework that defines geometric and semantic invariance rewards to regularize visual conditioning. These rewards are constructed as part of the method itself rather than derived from prior results, and the paper reports empirical performance on 18 external benchmarks using base models like Qwen3-VL-2B. No equations or claims reduce by construction to fitted inputs, self-citations, or renamed known results; the central claims rest on the explicit reward formulations and observed metric gains rather than self-referential loops. The method is self-contained against external validation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; full text required for any ledger entries.

pith-pipeline@v0.9.1-grok · 5862 in / 1173 out tokens · 37607 ms · 2026-06-26T04:59:10.798833+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

35 extracted references · 1 canonical work pages

  1. [1]

    Nocaps: Novel object captioning at scale

    Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Stefan Lee, and Peter Anderson. Nocaps: Novel object captioning at scale. In Proceedings of the IEEE/CVF international conference on computer vision, pages 8948–8957, 2019

  2. [2]

    Qwen3-vl technical report,

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

  3. [3]

    URLhttps://arxiv.org/abs/2511.21631

  4. [4]

    C2-evo: Co-evolving multimodal data and model for self-improving reasoning

    Xiuwei Chen, Wentao Hu, Hanhui Li, Jun Zhou, Zisheng Chen, Meng Cao, Yihan Zeng, Kui Zhang, Yu-Jie Yuan, Jianhua Han, et al. C2-evo: Co-evolving multimodal data and model for self-improving reasoning. arXiv preprint arXiv:2507.16518, 2025

  5. [5]

    Embspatial-bench: Benchmarking spatial understand- ing for embodied tasks with large vision-language models

    Mengfei Du, Binhao Wu, Zejun Li, Xuan-Jing Huang, and Zhongyu Wei. Embspatial-bench: Benchmarking spatial understand- ing for embodied tasks with large vision-language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume2: Short Papers), pages 346–355, 2024

  6. [6]

    Making the v in vqa matter: Elevating the role of image understanding in visual question answering

    Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017

  7. [7]

    Visplay: Self-evolving vision-language models

    Yicheng He, Chengsong Huang, Zongxia Li, Jiaxin Huang, and Yonghui Yang. Visplay: Self-evolving vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26274–26284, 2026

  8. [8]

    LoRA: Low-rank adaptation of large language models

    Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9

  9. [9]

    Gqa: A new dataset for real-world visual reasoning and compositional question answering

    Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019

  10. [10]

    A diagram is worth a dozen images

    Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. In European conference on computer vision, pages 235–251. Springer, 2016

  11. [11]

    Similarity of neural network representations revisited

    Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network representations revisited. In International Conference on Machine Learning, pages 3519–3529. PMLR, 2019

  12. [12]

    Decouple to generalize: Context-first self- evolving learning for data-scarce vision-language reasoning

    Tingyu Li, Zheng Sun, Jingxuan Wei, Conghui He, Lijun Wu, and Cheng Tan. Decouple to generalize: Context-first self- evolving learning for data-scarce vision-language reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 29357–29366, 2026. 12

  13. [13]

    Evaluating object hallucination in large vision- language models

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision- language models. In Proceedings of the 2023 conference on empirical methods in natural language processing, pages 292–305, 2023

  14. [14]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll ´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014

  15. [15]

    Agent0-vl: Exploring self-evolving agent for tool-integrated vision-language reasoning

    Jiaqi Liu, Kaiwen Xiong, Peng Xia, Yiyang Zhou, Haonian Ji, Lu Feng, Siwei Han, Mingyu Ding, and Huaxiu Yao. Agent0-vl: Exploring self-evolving agent for tool-integrated vision-language reasoning. arXiv preprint arXiv:2511.19900, 2025

  16. [16]

    Diving into self-evolving training for multimodal reasoning

    Wei Liu, Junlong Li, Xiwen Zhang, Fan Zhou, Yu Cheng, and Junxian He. Diving into self-evolving training for multimodal reasoning. arXiv preprint arXiv:2412.17451, 2024

  17. [17]

    Mmbench: Is your multi-modal model an all-around player? In European conference on computer vision, pages 216–233

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? In European conference on computer vision, pages 216–233. Springer, 2024

  18. [18]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017

  19. [19]

    Learn to explain: Multimodal reasoning via thought chains for science question answering

    Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in neural information processing systems, 35:2507–2521, 2022

  20. [20]

    Ok-vqa: A visual question answering benchmark requiring external knowledge

    Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204, 2019

  21. [21]

    Chartqa: A benchmark for question answering about charts with visual and logical reasoning

    Ahmed Masry, Jia Qing Tan, Shafiq Joty, Enamul Hoque, et al. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. In Findings of the association for computational linguistics: ACL 2022, pages 2263–2279, 2022

  22. [22]

    Infographicvqa

    Minesh Mathew, Viraj Bagal, Rub `en Tito, Dimosthenis Karatzas, Ernest Valveny, and CV Jawahar. Infographicvqa. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1697–1706, 2022

  23. [23]

    Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models

    Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision, pages 2641–2649, 2015

  24. [24]

    Generalized intersection over union: A metric and a loss for bounding box regression

    Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 658–666, 2019

  25. [25]

    Object hallucination in image captioning

    Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object hallucination in image captioning. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii, editors,Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4035–4045, Brussels, Belgium, October-November 2018. Associ...

  26. [26]

    Textcaps: a dataset for image captioning with reading comprehension

    Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, and Amanpreet Singh. Textcaps: a dataset for image captioning with reading comprehension. In European conference on computer vision, pages 742–758. Springer, 2020

  27. [27]

    Beyond human data: Scaling self-training for problem-solving with language models

    Avi Singh, John D Co-Reyes, Rishabh Agarwal, Ankesh Anand, Piyush Patil, Xavier Garcia, Peter J Liu, James Harrison, Jaehoon Lee, Kelvin Xu, et al. Beyond human data: Scaling self-training for problem-solving with language models. arXiv preprint arXiv:2312.06585, 2023

  28. [28]

    ireasoner: Trajectory-aware intrinsic reasoning supervision for self-evolving large multimodal models

    Meghana Sunil, Manikandarajan Venmathimaran, and Muthu Subash Kavitha. ireasoner: Trajectory-aware intrinsic reasoning supervision for self-evolving large multimodal models. arXiv preprint arXiv:2601.05877, 2026

  29. [29]

    Evolmm: Self-evolving large multimodal models with continuous rewards

    Omkar Thawakar, Shravan Venkatraman, Ritesh Thawkar, Abdelrahman Shaker, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, and Fahad Khan. Evolmm: Self-evolving large multimodal models with continuous rewards. arXiv preprint arXiv:2511.16672, 2025

  30. [30]

    Vision- zero: Scalable VLM self-evolution via multi-agent self-play

    Qinsi Wang, Bo Liu, Tianyi Zhou, Jing Shi, Yueqian Lin, Yiran Chen, Hai Helen Li, Kun Wan, and Wentian Zhao. Vision- zero: Scalable VLM self-evolution via multi-agent self-play. In The Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=s00SNXREV6. 13

  31. [31]

    Enhancing visual-language modality alignment in large vision language models via self- improvement

    Xiyao Wang, Jiuhai Chen, Zhaoyang Wang, Yuhang Zhou, Yiyang Zhou, Huaxiu Yao, Tianyi Zhou, Tom Goldstein, Parminder Bhatia, Taha Kass-Hout, et al. Enhancing visual-language modality alignment in large vision language models via self- improvement. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 268–282, 2025

  32. [32]

    Realworldqa

    xAI and visheratin. Realworldqa. https://huggingface.co/datasets/visheratin/realworldqa, 2024. URL https: //huggingface.co/datasets/visheratin/realworldqa

  33. [33]

    Captionqa: Is your caption as useful as the image itself? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23741–23750, 2026

    Shijia Yang, Yunong Liu, Bohan Zhai, Ximeng Sun, Zicheng Liu, Emad Barsoum, Manling Li, and Chenfeng Xu. Captionqa: Is your caption as useful as the image itself? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23741–23750, 2026

  34. [34]

    Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

    Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9556–9567, 2024

  35. [35]

    the red car

    Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, et al. Lmms-eval: Reality check on the evaluation of large multimodal models. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 881–916, 2025. 14 Supplementary Material S1 Hyperparameter Sensi...