pith. sign in

arxiv: 2607.01707 · v1 · pith:R23N5BQ5new · submitted 2026-07-02 · 💻 cs.CV

LASER: A Corrective Lens for LVLMs via Visual Attention Preservation and Sink Suppression

Pith reviewed 2026-07-03 16:51 UTC · model grok-4.3

classification 💻 cs.CV
keywords visual forgettinglarge vision-language modelsattention preservationsink suppressionvisual grounding rewardpost-trainingLVLMsattention trajectory
0
0 comments X

The pith

LASER uses two rewards to keep large vision-language models focused on visual evidence instead of drifting away during long reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large vision-language models lose access to visual evidence as decoding proceeds because attention moves away from images early on and piles up on a few unhelpful visual tokens. The paper identifies these two patterns through attention analysis and introduces LASER, a post-training method that adds a Visual Grounding Reward to sustain attention on salient image tokens and a Sink Suppression Reward to reduce focus on irrelevant ones. The approach regulates both the overall attention path and the distribution among visual tokens. Experiments show consistent gains over baselines on eight datasets. A sympathetic reader would see this as a direct way to embed visual grounding into the training signal rather than patching it at inference time.

Core claim

The central claim is that visual forgetting in LVLMs stems mainly from early-stage attention decay that blocks evidence acquisition and from attention concentrating on task-irrelevant visual sink tokens, and that applying a Visual Grounding Reward to maintain attention on salient visual tokens together with a Sink Suppression Reward to penalize excessive concentration on sinks during post-training corrects both issues, preserves the attention trajectory, and raises performance on downstream tasks.

What carries the argument

The pair of Visual Grounding Reward and Sink Suppression Reward that together enforce sustained attention on salient visual tokens and limit collapse onto sink tokens.

If this is right

  • The model keeps attention on semantically salient visual tokens across the full decoding sequence.
  • Attention no longer collapses onto a small set of task-irrelevant visual sink tokens.
  • Performance rises above strong baselines on all eight evaluated benchmark datasets.
  • Attention-aware post-training becomes a practical remedy for visual forgetting without altering model architecture.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same reward structure could be tested on models that process longer image sequences or video to check whether early decay is equally dominant.
  • Embedding these rewards might reduce reliance on inference-time attention interventions in other multimodal settings.
  • The method opens a route to compare attention regularization against data curation as ways to improve grounding.
  • If the sink tokens prove consistent across models, a shared suppression term could be developed once rather than per model.

Load-bearing premise

That early attention decay and concentration on sink tokens are the primary drivers of performance loss when visual forgetting occurs.

What would settle it

Training an LVLM with the two rewards yet finding that early-stage attention still decays away from visual tokens or that performance gains disappear on the same eight benchmarks.

Figures

Figures reproduced from arXiv: 2607.01707 by Bowen Yuan, Shijie Wang, Yadan Luo, Zi Huang, Zijian Wang.

Figure 1
Figure 1. Figure 1: Illustration of visual forgetting during long-horizon reasoning in LVLMs. Left: an input image and the corresponding question. Middle: the Visual Attention Proportion (VAP) over generation steps shows that attention to visual tokens peaks early and progressively decays during reasoning. Right: difference in attention maps reveals that later decoding stages could attend less to task relevant visual regions.… view at source ↗
Figure 2
Figure 2. Figure 2: Accuracy on MMStar under visual KV￾cache scaling factor at various reasoning stages. decays [26, 49, 53]. However, exist￾ing characterizations primarily doc￾ument the overall decay of visual at￾tention, without examining its tem￾poral sensitivity or internal alloca￾tion patterns. To uncover these finer￾grained dynamics, we re-examine attention trajectories throughout the reasoning process. Through case stu… view at source ↗
Figure 3
Figure 3. Figure 3: Steps Visualizations of attention maps over reasoning steps. We identify image regions that receive high attention score despite lacking semantic relevance to the task. grounding errors happen at this stage, errors in evidence acquisition propagate autoregressively throughout the remaining reasoning process. At the same time, we observe that the visual attention that persists is not always directed toward … view at source ↗
Figure 4
Figure 4. Figure 4: Analysis of visual sink token attention. Left. Redistributing attention away from sink tokens yields consistent accuracy gains over uniform scaling. Right. Visual sink attention ratio across generation steps for GRPO vanilla vs. GRPO + Rvis. 4.1 Preliminaries We build on Group Relative Policy Optimization (GRPO) [46], which eliminates the value function and estimates advantages by comparing rewards within … view at source ↗
Figure 5
Figure 5. Figure 5: Overview of LASER. Visual token attention is extracted during generation to compute two complementary rewards: Rvis sustains overall VAP across reasoning steps, while Rsupp identifies and suppresses sink tokens to rectify visual attention from visual sink tokens. process consists of two stages. First, we isolate a subset of massive activation dimensions D∗ ⊆ [D], where [D] denotes the full set of hidden st… view at source ↗
Figure 6
Figure 6. Figure 6: Analysis of VAP during generation on various datasets. As reported in Tab. 2, both Rvis and Rsupp yield consistent performance gains across all benchmarks. Specifically, integrating Rvis improves the average score from 51.1 to 52.2 by reducing visual forgetting and maintaining attention on task￾relevant regions. The subsequent addition of Rsupp further improves performance, particularly in general reasonin… view at source ↗
Figure 7
Figure 7. Figure 7: Left. Visual sink token attention across generation steps for model trained by LASER. Right. Response length during training process. 5.4 Further Analysis LASER preserves visual attention throughout generation [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: A qualitative comparison between Qwen2.5-VL-Instruct and LASER on a visual perception task. Premature visual disengagement in the base model leads to an erroneous inference, while LASER maintains visual attention and answers correctly. example, the base Qwen2.5-VL-Instruct model makes an erroneous summarization in its early-stage reasoning chain, resulting in an incorrect answer after long￾horizon reasonin… view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative results of LASER [PITH_FULL_IMAGE:figures/full_fig_p028_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative results of LASER [PITH_FULL_IMAGE:figures/full_fig_p029_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Qualitative results of LASER [PITH_FULL_IMAGE:figures/full_fig_p030_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Qualitative results of LASER [PITH_FULL_IMAGE:figures/full_fig_p031_12.png] view at source ↗
read the original abstract

Large vision-language models (LVLMs) exhibit strong reasoning ability but suffer from visual forgetting during long-horizon decoding, where attention progressively drifts away from visual evidence. Existing methods largely treat this issue as a late-stage attention decay problem or attempt to mitigate it through heuristic reminders or post-hoc attention lifting. Through systematic empirical analysis, we find that performance degradation under visual forgetting is largely driven by two overlooked factors: early-stage attention decay disrupts evidence acquisition, and attention concentration on a subset of task-irrelevant visual sink tokens. Motivated by these insights, we propose LASER, a post-training framework that regulates both the visual attention trajectory and intra-visual token attention distribution during reasoning. Technically, LASER introduces two complementary rewards: a Visual Grounding Reward, which encourages the model to maintain attention on semantically salient visual tokens throughout decoding, and a Sink Suppression Reward, which penalizes excessive attention concentration on visual sink tokens. Together, these rewards preserve early-stage grounding while preventing attention collapse onto uninformative regions. Extensive experiments on eight benchmark datasets demonstrate that LASER consistently outperforms strong baselines, validating attention-aware training as an effective remedy for visual forgetting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that visual forgetting in LVLMs during long-horizon decoding is largely driven by two factors identified via systematic empirical analysis: early-stage attention decay that disrupts evidence acquisition, and attention concentration on task-irrelevant visual sink tokens. Motivated by this, it introduces LASER, a post-training framework using two rewards—a Visual Grounding Reward to maintain attention on semantically salient visual tokens and a Sink Suppression Reward to penalize excessive concentration on sinks—resulting in consistent outperformance over strong baselines on eight benchmark datasets.

Significance. If the empirical analysis and results hold under scrutiny, the work would demonstrate that attention-aware post-training can effectively mitigate visual forgetting in LVLMs, offering a practical remedy beyond heuristic reminders or post-hoc fixes. The dual-reward design targeting both trajectory and distribution is a targeted contribution if the causal drivers are isolated.

major comments (2)
  1. [Abstract] Abstract: the central motivation rests on the claim that performance degradation 'is largely driven' by early-stage attention decay and sink concentration, identified through 'systematic empirical analysis.' However, the abstract supplies no description of the analysis (e.g., whether it includes controlled interventions that isolate these factors while holding model, prompt, and decoding fixed, versus purely observational attention-map vs. accuracy correlations). This is load-bearing because the two rewards are directly derived from these factors; if the analysis is only correlational, the rewards may target symptoms rather than primary causes.
  2. [Abstract] Abstract: the claim of 'consistent outperformance on eight benchmark datasets' is presented without any quantitative results, baseline specifications, ablation studies, dataset sizes, or error analysis. This prevents assessment of effect sizes or whether the gains are robust, directly undermining evaluation of the central claim that the proposed rewards are an 'effective remedy.'
minor comments (1)
  1. [Abstract] Abstract: 'visual sink tokens' is used without a definition or citation to prior work on attention sinks; this notation should be clarified on first use.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed feedback on our manuscript. We address each major comment point by point below, focusing on the abstract's presentation of our empirical analysis and results. Where revisions to the abstract can improve clarity without exceeding length constraints, we indicate this explicitly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central motivation rests on the claim that performance degradation 'is largely driven' by early-stage attention decay and sink concentration, identified through 'systematic empirical analysis.' However, the abstract supplies no description of the analysis (e.g., whether it includes controlled interventions that isolate these factors while holding model, prompt, and decoding fixed, versus purely observational attention-map vs. accuracy correlations). This is load-bearing because the two rewards are directly derived from these factors; if the analysis is only correlational, the rewards may target symptoms rather than primary causes.

    Authors: The abstract is a high-level summary, but the full manuscript details the systematic empirical analysis in Section 3. There, we conduct controlled interventions that hold the model, prompts, and decoding strategy fixed while isolating the effects of early-stage attention decay (via attention trajectory interventions) and sink token concentration (via targeted masking experiments), directly measuring causal impacts on accuracy. This establishes the factors as primary drivers rather than mere correlations, motivating the dual-reward design. We will revise the abstract to briefly note that the analysis involves such controlled interventions. revision: yes

  2. Referee: [Abstract] Abstract: the claim of 'consistent outperformance on eight benchmark datasets' is presented without any quantitative results, baseline specifications, ablation studies, dataset sizes, or error analysis. This prevents assessment of effect sizes or whether the gains are robust, directly undermining evaluation of the central claim that the proposed rewards are an 'effective remedy.'

    Authors: Abstracts are space-constrained and conventionally omit specific numbers, tables, and detailed ablations, which are instead provided in the main text (Section 4, Table 1, and ablation studies in Section 4.3) along with dataset details and error analysis. These elements demonstrate the consistent gains and robustness. We can partially revise the abstract to include one or two key quantitative highlights (e.g., average improvement margins) if space permits, while preserving its summary nature. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical post-training method with independent motivation

full rationale

The provided abstract and description contain no equations, fitted parameters, or derivations. The two rewards are explicitly introduced as new constructs motivated by (but not defined in terms of) the empirical observations. No self-citation chains, uniqueness theorems, or ansatzes are referenced. The central claim rests on experimental validation rather than any reduction of outputs to inputs by construction. This is the most common honest non-finding for applied post-training papers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; no free parameters, axioms, or invented entities are described in sufficient detail to populate the ledger.

pith-pipeline@v0.9.1-grok · 5742 in / 1158 out tokens · 19651 ms · 2026-07-03T16:51:33.010940+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

87 extracted references · 34 canonical work pages · 19 internal anchors

  1. [1]

    In: NeurIPS (2022)

    Alayrac, J., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., Ring, R., Rutherford, E., Cabi, S., Han, T., Gong, Z., Samangooei, S., Monteiro, M., Menick, J.L., Borgeaud, S., Brock, A., Nematzadeh, A., Sharifzadeh, S., Binkowski, M., Barreira, R., Vinyals, O., Zisserman, A., Si- monyan, K.: Flamingo:...

  2. [2]

    Qwen2.5-VL Technical Report

    Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report. CoRR (2025).https://doi.org/10.48550/ ARXIV.2502.13923,https://doi.org/1...

  3. [3]

    PaliGemma: A versatile 3B VLM for transfer

    Beyer, L., Steiner, A., Pinto, A.S., Kolesnikov, A., Wang, X., Salz, D., Neumann, M., Alabdulmohsin, I., Tschannen, M., Bugliarello, E., Unterthiner, T., Keysers, D., Koppula, S., Liu, F., Grycner, A., Gritsenko, A.A., Houlsby, N., Kumar, M., Rong, K., Eisenschlos, J., Kabra, R., Bauer, M., Bosnjak, M., Chen, X., Minderer, M., Voigtlaender, P., Bica, I., ...

  4. [4]

    arXiv preprint arXiv:2502.00698 (2025)

    Cai, H., Yang, Y., Hu, W.: Mm-iq: Benchmarking human-like abstraction and reasoning in multimodal models. arXiv preprint arXiv:2502.00698 (2025)

  5. [5]

    InternLM2 Technical Report

    Cai, Z., Cao, M., Chen, H., Chen, K., Chen, K., Chen, X., et al.: Internlm2 technical report. CoRR (2024),https://doi.org/10.48550/arXiv.2403.17297

  6. [6]

    In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

    Chen, J., Tang, J., Qin, J., Liang, X., Liu, L., Xing, E., Lin, L.: Geoqa: A geo- metric question answering benchmark towards multimodal numerical reasoning. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. pp. 513–523 (2021)

  7. [7]

    In: ECCV

    Chen, L., Zhao, H., Liu, T., Bai, S., Lin, J., Zhou, C., Chang, B.: An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. In: ECCV. pp. 19–35. Springer (2024)

  8. [8]

    Chen, L., Li, J., Dong, X., Zhang, P., Zang, Y., Chen, Z., Duan, H., Wang, J., Qiao, Y., Lin, D., Zhao, F.: Are we on the right way for evaluating large vision-language models? In: NeurIPS (2024)

  9. [9]

    ReVisual-R1: An open-source 7B multimodal large language model for deep reasoning

    Chen, S., Guo, Y., Su, Z., Li, Y., Wu, Y., Chen, J., Chen, J., Wang, W., Qu, X., Cheng, Y.: Advancing multimodal reasoning: From optimized cold start to staged reinforcement learning. arXiv preprint arXiv:2506.04207 (2025)

  10. [10]

    SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

    Chu, T., Zhai, Y., Yang, J., Tong, S., Xie, S., Schuurmans, D., Le, Q.V., Levine, S., Ma, Y.: Sft memorizes, rl generalizes: A comparative study of foundation model post-training. arXiv preprint arXiv:2501.17161 (2025) 16 B. Yuan et al

  11. [12]

    OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles

    Deng, Y., Bansal, H., Yin, F., Peng, N., Wang, W., Chang, K.: Openvlthinker: An early exploration to complex vision-language reasoning via iterative self- improvement. CoRR (2025),https://doi.org/10.48550/arXiv.2503.17352

  12. [13]

    Frontiers of Computer Science (2026)

    Du, J., Li, Y., Li, Y., Liao, L., Zhao, Z., Ye, G.: Medfuse: a multi-source data fusion framework for diabetic retinopathy lesion segmentation. Frontiers of Computer Science (2026)

  13. [14]

    CoRR (2025),https://doi.org/10.48550/arXiv.2501.01904

    Du, Y., Liu, Z., Li, Y., Zhao, W.X., Huo, Y., Wang, B., Chen, W., Liu, Z., Wang, Z., Wen, J.: Virgo: A preliminary exploration on reproducing o1-like MLLM. CoRR (2025),https://doi.org/10.48550/arXiv.2501.01904

  14. [15]

    In: Proceedings of the 32nd ACM International Conference on Multimedia

    Duan, H., Yang, J., Qiao, Y., Fang, X., Chen, L., Liu, Y., Dong, X., Zang, Y., Zhang, P., Wang, J., et al.: Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. In: Proceedings of the 32nd ACM International Conference on Multimedia. pp. 11198–11201 (2024)

  15. [16]

    com/verl-project/verl(2025)

    Engine, V.: verl: Volcano engine reinforcement learning for llms.https://github. com/verl-project/verl(2025)

  16. [17]

    In: CVPR

    Favero, A., Zancato, L., Trager, M., Choudhary, S., Perera, P., Achille, A., Swami- nathan, A., Soatto, S.: Multi-modal hallucination control by visual information grounding. In: CVPR. pp. 14303–14312. IEEE (2024)

  17. [18]

    In: CVPR (2026)

    Fu, Y., Zhang, Z., Zhang, Y., Wang, Z., Huang, Z., Luo, Y.: Mergevla: Cross-skill model merging toward a generalist vision-language-action agent. In: CVPR (2026)

  18. [19]

    In: EMNLP

    Gong, X., Ming, T., Wang, X., Wei, Z.: DAMRO: dive into the attention mechanism of LVLM to reduce object hallucination. In: EMNLP. pp. 7696–7712. Association for Computational Linguistics (2024)

  19. [20]

    In: CVPR

    Guan, T., Liu, F., Wu, X., Xian, R., Li, Z., Liu, X., Wang, X., Chen, L., Huang, F., Yacoob, Y., Manocha, D., Zhou, T.: Hallusionbench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. In: CVPR. IEEE (2024)

  20. [21]

    In: The Fourteenth International Conference on Learning Representations (ICLR) (2026)

    Hao, Z., Li, Z., Li, W., Liu, F., Zhang, M., Li, J.: Echoes as anchors: Probabilistic costs and attention refocusing in llm reasoning. In: The Fourteenth International Conference on Learning Representations (ICLR) (2026)

  21. [22]

    In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

    He, C., Luo, R., Bai, Y., Hu, S., Thai, Z., Shen, J., Hu, J., Han, X., Huang, Y., Zhang, Y., et al.: Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 3828–3850 (2024)

  22. [23]

    In: NeurIPS (2024)

    Hu, Y., Shi, W., Fu, X., Roth, D., Ostendorf, M., Zettlemoyer, L., Smith, N.A., Krishna, R.: Visual sketchpad: Sketching as a visual chain of thought for multimodal language models. In: NeurIPS (2024)

  23. [24]

    Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

    Huang, W., Jia, B., Zhai, Z., Cao, S., Ye, Z., Zhao, F., Xu, Z., Hu, Y., Lin, S.: Vision-r1: Incentivizing reasoning capability in multimodal large language models. CoRR (2025),https://doi.org/10.48550/arXiv.2503.06749

  24. [25]

    OpenAI o1 System Card

    Jaech, A., Kalai, A., Lerer, A., Richardson, A., El-Kishky, A., Low, A., Helyar, A., Madry, A., Beutel, A., Carney, A., et al.: Openai o1 system card. arXiv preprint arXiv:2412.16720 (2024)

  25. [26]

    CoRR (2025),https://doi

    Jian, P., Wu, J., Sun, W., Wang, C., Ren, S., Zhang, J.: Look again, think slowly: Enhancing visual reflection in vision-language models. CoRR (2025),https://doi. org/10.48550/arXiv.2509.12132 LASER 17

  26. [27]

    In: ICLR (2025)

    Kang, S., Kim, J., Kim, J., Hwang, S.J.: See what you are told: Visual attention sink in large multimodal models. In: ICLR (2025)

  27. [28]

    Mmr1: Enhancing multimodal reasoning with variance-aware sampling and open resources.ArXiv, abs/2509.21268, 2025

    Leng, S., Wang, J., Li, J., Zhang, H., Hu, Z., Zhang, B., Jiang, Y., Zhang, H., Li, X., Bing, L., et al.: Mmr1: Enhancing multimodal reasoning with variance-aware sampling and open resources. arXiv preprint arXiv:2509.21268 (2025)

  28. [29]

    In: CVPR

    Leng, S., Zhang, H., Chen, G., Li, X., Lu, S., Miao, C., Bing, L.: Mitigating object hallucinations in large vision-language models through visual contrastive decoding. In: CVPR. pp. 13872–13882. IEEE (2024)

  29. [30]

    Advances in Neural Information Processing Systems (2026)

    Li, Q., Zhang, Y., Mai, Z., Chen, Y., Huang, H., Zhang, J., Zhang, Z., Wen, Y., Li, W., Fu, H., et al.: Can large multimodal models understand agricultural scenes? benchmarking with agromind. Advances in Neural Information Processing Systems (2026)

  30. [31]

    In: ACL (2026)

    Li, X., et al.: Llm inductive reasoning through multi-agent enhanced monte carlo tree search. In: ACL (2026)

  31. [32]

    Self-Rewarding Vision-Language Model via Reasoning Decomposition

    Li, Z., Yu, W., Huang, C., Liu, R., Liang, Z., Liu, F., Che, J., Yu, D., Boyd- Graber, J.L., Mi, H., Yu, D.: Self-rewarding vision-language model via reasoning decomposition. CoRR (2025),https://doi.org/10.48550/arXiv.2508.19652

  32. [33]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Lin, Z., Lin, M., Lin, L., Ji, R.: Boosting multimodal large language models with visual tokens withdrawal for rapid inference. In: Proceedings of the AAAI Conference on Artificial Intelligence. pp. 5334–5342 (2025)

  33. [34]

    In: CVPR

    Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. In: CVPR. IEEE (2024)

  34. [35]

    In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S

    Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. (eds.) NeurIPS (2023)

  35. [36]

    In: Proceedings of the 33rd ACM International Conference on Multimedia

    Liu, W., Pan, Q., Zhang, Y., Liu, Z., Wu, J., Zhou, J., Zhou, A., Chen, Q., Jiang, B., He, L.: Cmm-math: A chinese multimodal math dataset to evaluate and enhance the mathematics reasoning of large multimodal models. In: Proceedings of the 33rd ACM International Conference on Multimedia. pp. 12585–12591 (2025)

  36. [37]

    Infi-MMR: Curriculum-based unlocking multimodal reasoning via phased reinforcement learning in multimodal small language models

    Liu, Z., Liu, Y., Zhu, G., Xie, C., Li, Z., Yuan, J., Wang, X., Li, Q., Cheung, S.C., Zhang, S., et al.: Infi-mmr: Curriculum-based unlocking multimodal reasoning via phased reinforcement learning in multimodal small language models. arXiv preprint arXiv:2505.23091 (2025)

  37. [38]

    In: ICLR

    Lu, P., Bansal, H., Xia, T., Liu, J., Li, C., Hajishirzi, H., Cheng, H., Chang, K., Galley, M., Gao, J.: Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. In: ICLR. OpenReview.net (2024)

  38. [39]

    Lu, P., Gong, R., Jiang, S., Qiu, L., Huang, S., Liang, X., Zhu, S.C.: Inter-gps: In- terpretable geometry problem solving with formal language and symbolic reasoning. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers...

  39. [40]

    IconQA: A new benchmark for abstract diagram understanding and visual language reasoning.arXiv preprint arXiv:2110.13214, 2021

    Lu, P., Qiu, L., Chen, J., Xia, T., Zhao, Y., Zhang, W., Yu, Z., Liang, X., Zhu, S.C.: Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning. arXiv preprint arXiv:2110.13214 (2021)

  40. [41]

    MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning

    Meng, F., Du, L., Liu, Z., Zhou, Z., Lu, Q., Fu, D., Han, T., Shi, B., Wang, W., He, J., et al.: Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforcement learning. arXiv preprint arXiv:2503.07365 (2025)

  41. [42]

    In: AAAI

    Mondal, D., Modi, S., Panda, S., Singh, R., Rao, G.S.: Kam-cot: Knowledge augmented multimodal chain-of-thoughts reasoning. In: AAAI. pp. 18798–18806. AAAI Press (2024)

  42. [43]

    OpenAI GPT-5 System Card

    OpenAI: Openai gpt-5 system card (2025),https://arxiv.org/abs/2601.03267 18 B. Yuan et al

  43. [44]

    Qiao, R., Tan, Q., Dong, G., Wu, M., Sun, C., Song, X., Wang, J., Gongque, Z., Lei, S., Zhang, Y., Wei, Z., Zhang, M., Qiao, R., Zong, X., Xu, Y., Yang, P., Bao, Z., Diao, M., Li, C., Zhang, H.: We-math: Does your large multimodal model achieve human-like mathematical reasoning? In: ACL. pp. 20023–20070. Association for Computational Linguistics (2025)

  44. [45]

    In: Riloff, E., Chiang, D., Hockenmaier, J., Tsujii, J

    Rohrbach, A., Hendricks, L.A., Burns, K., Darrell, T., Saenko, K.: Object halluci- nation in image captioning. In: Riloff, E., Chiang, D., Hockenmaier, J., Tsujii, J. (eds.) EMNLP. pp. 4035–4045. Association for Computational Linguistics (2018)

  45. [46]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Zhang, M., Li, Y.K., Wu, Y., Guo, D.: Deepseekmath: Pushing the limits of mathematical reasoning in open language models. CoRR (2024),https://doi.org/10.48550/arXiv.2402.03300

  46. [47]

    CoRR (2025), https://doi.org/10.48550/arXiv.2504

    Shen, H., Liu, P., Li, J., Fang, C., Ma, Y., Liao, J., Shen, Q., Zhang, Z., Zhao, K., Zhang, Q., Xu, R., Zhao, T.: VLM-R1: A stable and generalizable r1-style large vision-language model. CoRR (2025), https://doi.org/10.48550/arXiv.2504. 07615

  47. [48]

    In: Findings of the Association for Computational Linguistics: EMNLP 2024

    Shi, W., Hu, Z., Bin, Y., Liu, J., Yang, Y., Ng, S.K., Bing, L., Lee, R.K.W.: Math- llava: Bootstrapping mathematical reasoning for multimodal large language models. In: Findings of the Association for Computational Linguistics: EMNLP 2024. pp. 4663–4680 (2024)

  48. [49]

    Sun, H., Sun, Z., Peng, H., Ye, H.: Mitigating visual forgetting via take-along visual conditioning for multi-modal long cot reasoning. In: ACL. pp. 5158–5171. Association for Computational Linguistics (2025)

  49. [50]

    arXiv (2024)

    Sun, M., et al.: Massive activations in large language models. arXiv (2024)

  50. [51]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Team, G.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. CoRRabs/2507.06261 (2025),https://doi.org/10.48550/arXiv.2507.06261

  51. [52]

    Thawakar, O., Dissanayake, D., More, K.P., Thawkar, R., Heakl, A., Ahsan, N., Li, Y., Zumri, M., Lahoud, J., Anwer, R.M., Cholakkal, H., Laptev, I., Shah, M., Khan, F.S., Khan, S.H.: Llamav-o1: Rethinking step-by-step visual reasoning in llms. In: ACL. pp. 24290–24315. Association for Computational Linguistics (2025)

  52. [53]

    CoRR (2025),https://doi.org/10.48550/arXiv.2509.25848

    Tian,X.,Zou,S.,Yang,Z.,He,M.,Waschkowski,F.,Wesemann,L.,Tu,P.H.,Zhang, J.: More thought, less accuracy? on the dual nature of reasoning in vision-language models. CoRR (2025),https://doi.org/10.48550/arXiv.2509.25848

  53. [54]

    CoRR (2025),https://doi.org/10.48550/arXiv.2503.08342

    Tu, C., Ye, P., Zhou, D., Bai, L., Yu, G., Chen, T., Ouyang, W.: Attention reallocation: Towards zero-cost and controllable hallucination mitigation of mllms. CoRR (2025),https://doi.org/10.48550/arXiv.2503.08342

  54. [55]

    arXiv preprint arXiv:2506.11595 (2025)

    Unsal, M., Akkus, A.: Easyarc: Evaluating vision language models on true visual reasoning. arXiv preprint arXiv:2506.11595 (2025)

  55. [56]

    When to Commit? Towards Variable-Size Self-Contained Blocks for Discrete Diffusion Language Models

    Wang, D., Qiu, R., Huang, Z.: When to commit? towards variable-size self-contained blocks for discrete diffusion language models. arXiv preprint arXiv:2604.23994 (2026)

  56. [57]

    VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

    Wang, H., Qu, C., Huang, Z., Chu, W., Lin, F., Chen, W.: Vl-rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning. CoRR (2025), https://doi.org/10.48550/arXiv.2504.08837

  57. [58]

    In: NeurIPS (2024)

    Wang, K., Pan, J., Shi, W., Lu, Z., Ren, H., Zhou, A., Zhan, M., Li, H.: Measuring multimodal mathematical reasoning with math-vision dataset. In: NeurIPS (2024)

  58. [59]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Wang, S., Yu, X., Luo, Y., Wang, Z., Zhang, P., Huang, Z.: Language-driven fine-grained retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2682–2692 (2026)

  59. [60]

    In: Proceedings of LASER 19 the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

    Wang, Y., Wang, S., Cheng, Q., Fei, Z., Ding, L., Guo, Q., Tao, D., Qiu, X.: Visuo- think: Empowering lvlm reasoning with multimodal tree search. In: Proceedings of LASER 19 the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 21707–21719 (2025)

  60. [61]

    NeurIPS35, 24824–24837 (2022)

    Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. NeurIPS35, 24824–24837 (2022)

  61. [62]

    CoRR (2025), https://doi.org/10.48550/arXiv.2502.11740

    Wu, J., Xiong, Y., Li, X., Xia, Y., Wang, R., Wang, Y., Yu, T., Kim, S., Rossi, R.A., Yao, L., Shang, J., McAuley, J.J.: Mitigating visual knowledge forgetting in MLLM instruction-tuning via modality-decoupled gradient descent. CoRR (2025), https://doi.org/10.48550/arXiv.2502.11740

  62. [63]

    Visionary-R1: Mitigating shortcuts in visual reasoning with reinforcement learning.arXiv preprint arXiv:2505.14677, 2025

    Xia, J., Zang, Y., Gao, P., Li, Y., Zhou, K.: Visionary-r1: Mitigating shortcuts in visual reasoning with reinforcement learning. CoRR (2025),https://doi.org/10. 48550/arXiv.2505.14677

  63. [64]

    Efficient Streaming Language Models with Attention Sinks

    Xiao, G., Tian, Y., Chen, B., Han, S., Lewis, M.: Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453 (2023)

  64. [65]

    LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts

    Xiao, Y., Sun, E., Liu, T., Wang, W.: Logicvista: Multimodal LLM logical reasoning benchmark in visual contexts. CoRRabs/2407.04973(2024), https://doi.org/ 10.48550/arXiv.2407.04973

  65. [66]

    LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

    Xu, G., Jin, P., Li, H., Song, Y., Sun, L., Yuan, L.: Llava-cot: Let vision language models reason step-by-step. CoRR (2024), https://doi.org/10.48550/arXiv. 2411.10440

  66. [67]

    arXiv preprint arXiv:2507.03019 (2025)

    Yang, S., Niu, Y., Liu, Y., Ye, Y., Lin, B., Yuan, L.: Look-back: Implicit visual re-focusing in mllm reasoning. arXiv preprint arXiv:2507.03019 (2025)

  67. [68]

    R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

    Yang, Y., He, X., Pan, H., Jiang, X., Deng, Y., Yang, X., Lu, H., Yin, D., Rao, F., Zhu, M., Zhang, B., Chen, W.: R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization. CoRR (2025),https://doi.org/10. 48550/arXiv.2503.10615

  68. [69]

    arXiv preprint arXiv:2505.16673 (2025)

    Yao, H., Yin, Q., Zhang, J., Yang, M., Wang, Y., Wu, W., Su, F., Shen, L., Qiu, M., Tao, D., et al.: R1-sharevl: Incentivizing reasoning capability of multimodal large language models via share-grpo. arXiv preprint arXiv:2505.16673 (2025)

  69. [70]

    In: Proceedings of the 33rd ACM International Conference on Multimedia (2025)

    Yuan,B.,Song,S.,Fernandez,J.,Luo,Y.,Baktashmotlagh,M.,Wang,Z.:Wiswheat: A three-tiered vision-language dataset for wheat management. In: Proceedings of the 33rd ACM International Conference on Multimedia (2025)

  70. [71]

    In: CVPR

    Yue, X., Ni, Y., Zheng, T., Zhang, K., Liu, R., Zhang, G., Stevens, S., Jiang, D., Ren, W., Sun, Y., Wei, C., Yu, B., Yuan, R., Sun, R., Yin, M., Zheng, B., Yang, Z., Liu, Y., Huang, W., Sun, H., Su, Y., Chen, W.: MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for expert AGI. In: CVPR. pp. 9556–9567. IEEE (2024)

  71. [72]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Zhang, C., Gao, F., Jia, B., Zhu, Y., Zhu, S.C.: Raven: A dataset for relational and analogical visual reasoning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5317–5327 (2019)

  72. [73]

    Zhang, R., Jiang, D., Zhang, Y., Lin, H., Guo, Z., Qiu, P., Zhou, A., Lu, P., Chang, K., Qiao, Y., Gao, P., Li, H.: MATHVERSE: does your multi-modal LLM truly see the diagrams in visual math problems? In: ECCV. pp. 169–186. Springer (2024)

  73. [74]

    In: Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (2025)

    Zhao, Z., Chen, Z., Huang, Z., Sadiq, S., Chen, T.: Continual text-to-video retrieval with frame fusion and task-aware routing. In: Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (2025)

  74. [75]

    In: NeurIPS (2023) 20 B

    Zheng, G., Yang, B., Tang, J., Zhou, H., Yang, S.: Ddcot: Duty-distinct chain- of-thought prompting for multimodal reasoning in language models. In: NeurIPS (2023) 20 B. Yuan et al

  75. [76]

    Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

    Zhou, D., Schärli, N., Hou, L., Wei, J., Scales, N., Wang, X., Schuurmans, D., Cui, C., Bousquet, O., Le, Q., et al.: Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625 (2022)

  76. [77]

    Zhou, Y., Li, W., Lu, Y., Li, J., Liu, F., Zhang, M., Wang, Y., He, D., LIU, H., Zhang, M.: Reflection on knowledge graph for large language models reasoning. In: Findings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL) (2025) LASER 21 This supplementary material provides additional descriptions of the main paper. It is o...

  77. [78]

    The image prominently features buses, including a pink double-decker bus in the foreground, which is a clear indicator of public transportation

  78. [79]

    There are other buses visible in the background, suggesting that the image is related to a scene involving buses

  79. [80]

    The presence of a car does not overshadow the buses in the image, as the buses are more central and occupy a larger portion of the frame

  80. [81]

    Considering the options provided: A

    The question asks for the main topic, and given the prominence and number of buses in the image, the focus is clearly on buses rather than cars. Considering the options provided: A. Driving cars - This is not the main topic as cars are not the primary focus. B. Driving buses - This could be a strong contender, but the image shows stationary buses rather t...

Showing first 80 references.