pith. sign in

arxiv: 2606.29847 · v1 · pith:AVV73EN2new · submitted 2026-06-29 · 💻 cs.CV

See Only When Needed: Context-Aware Attention Intervention for Mitigating Hallucinations in LVLMs

Pith reviewed 2026-06-30 06:44 UTC · model grok-4.3

classification 💻 cs.CV
keywords hallucination mitigationlarge vision-language modelsattention interventioncontext-aware selectivitytraining-free inferenceobject hallucinationsvisual grounding
0
0 comments X

The pith

Context-aware attention intervention reduces object hallucinations in vision-language models by intervening selectively on uncertain tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Context-aware Attention Intervention (CAI) as a training-free method to mitigate hallucinations in large vision-language models. It enforces a see only when needed principle through two-axis selectivity that determines both where to focus visual attention and when to apply changes during decoding. Token-specific relevance drawn from early layers identifies aligned image regions, and an entropy- and depth-gated tilt adjusts attention only for tokens showing uncertainty spikes in deeper layers. This leaves confident tokens and irrelevant regions untouched to avoid spurious evidence or fluency loss. Experiments across backbones and benchmarks report state-of-the-art gains even without optional contrastive decoding.

Core claim

CAI derives token-specific visual relevance from early-layer representations to localize semantically aligned regions and applies a conservative entropy- and depth-gated attention tilt only to uncertainty-spiking tokens in deeper layers, enforcing see-only-when-needed selectivity that strengthens grounding while preserving fluency and achieving state-of-the-art hallucination mitigation as a KL-minimal reweighting with bounded interference under inactive gates or small tilts.

What carries the argument

Context-aware Attention Intervention (CAI), the two-axis selectivity mechanism that uses early-layer visual relevance for localization and applies gated attention reweighting only on uncertain deeper-layer tokens.

If this is right

  • The method yields consistent hallucination reductions across multiple LVLM backbones without any training.
  • Linguistic fluency is maintained because confident tokens and irrelevant regions receive no change.
  • Contrastive decoding remains optional rather than required for the gains.
  • The intervention is characterized as KL-minimal reweighting with interference bounded by inactive gates or small tilts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach implies that visual grounding errors concentrate in deeper layers specifically around high-entropy tokens rather than uniformly.
  • It could extend to other generation flaws in multimodal models where selective correction might avoid side effects.
  • The early-to-late layer distinction suggests a natural division of labor that future architectures might exploit directly.

Load-bearing premise

Early-layer representations accurately identify which image regions matter for each specific token, and selectively tilting attention on uncertain tokens strengthens correct grounding without adding new errors or reducing output quality.

What would settle it

A controlled test on standard benchmarks where the attention tilt is applied only to tokens flagged as uncertain shows no reduction in hallucination rates relative to the unmodified model.

Figures

Figures reproduced from arXiv: 2606.29847 by Cees G.M. Snoek, Ling Shao, Wenbo Lyu, Xiantong Zhen, Yingjun Du, Yuqing Lei.

Figure 1
Figure 1. Figure 1: Visualization of token-image similarity. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Given visual input (a) and the query “Please describe the image in detail”, (b) shows the evolution of token entropy across decoding layers. In deeper layers, hallucination-prone tokens (e.g., “traffic light”) exhibit markedly higher entropy than grounded tokens (e.g., “man”), whereas tokens dominated by language priors (e.g., “The”, “a”, “at”) remain low-entropy. failure mode critically undermines reliabi… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of CAI. At each decoding step, the lowest layer evaluates the similarity between the generated token and each patch token (Eq. 4), yielding a visual grounding signal. Attention interventions are applied to deep decoding layers with high entropy (Eq. 6), redirecting attention toward patch tokens in proportion to their similarity (Eq. 5). Integrated with contrastive decoding (Eq. 7), CAI mitigates r… view at source ↗
Figure 5
Figure 5. Figure 5: Case study: failure mode of confident hallucination. Given the prompt “Please describe this image in detail”, the model generates the hal￾lucinated token “plane”, which is not present in (a). Note that its predictive entropy across multiple lay￾ers remains below threshold, illustrating a failure mode where confident hallucinations evade entropy￾based intervention [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Ablation study on the starting in￾tervention layer ls and the entropy thresh￾old γ under the random setting of A￾OKVQA in POPE [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Visualization of token-image similarity across decoding layers. [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Case study of visual input (a) with the query “Is there only one zipper in the picture?”. The intervention layers are shown in (b), and token-image similarity in (c) establishes grounding between the response token and the visual input. The attention maps from the 0-th head in layer 31, before and after the similarity-guided intervention, are visualized in (d) and (e). ("dog"), successfully preserving fine… view at source ↗
Figure 10
Figure 10. Figure 10: Number of tokens requiring intervention at each layer [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Case study of CHAIR in long text generation. [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗
read the original abstract

Large Vision-Language Models (LVLMs) excel at multimodal tasks but remain prone to object hallucinations. Prior training-free remedies often uniformly strengthen visual signals, which may also amplify irrelevant regions and introduce spurious evidence, harming fluency. We propose Context-aware Attention Intervention (CAI), a training-free inference-time mechanism that enforces a see only when needed principle via two-axis selectivity: where to look and when to intervene. At each decoding step, CAI derives token-specific visual relevance from early-layer representations to localize semantically aligned regions, and applies a conservative, entropy- and depth-gated attention tilt only for uncertainty-spiking tokens in deeper layers where visual grounding degrades, leaving confident tokens and irrelevant regions largely unchanged. This targeted intervention strengthens visual grounding while preserving linguistic fluency, and it yields consistent improvements even without contrastive decoding, which remains optional as an auxiliary bias-suppression module. Extensive experiments across multiple LVLM backbones and benchmarks show that CAI achieves state-of-the-art hallucination mitigation, and our analysis characterizes CAI as a KL-minimal attention reweighting with bounded interference under inactive gates or small tilts. Code is available at https://github.com/Iris1946/CAI.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes Context-aware Attention Intervention (CAI), a training-free inference-time mechanism for mitigating object hallucinations in LVLMs. It enforces a 'see only when needed' principle through two-axis selectivity: deriving token-specific visual relevance from early-layer representations to localize semantically aligned regions, then applying entropy- and depth-gated attention tilts only to uncertainty-spiking tokens in deeper layers. The method claims consistent SOTA improvements across multiple LVLM backbones and benchmarks, even without contrastive decoding (which is optional), and characterizes the intervention as KL-minimal reweighting with bounded interference under inactive gates or small tilts. Code is released publicly.

Significance. If the empirical results and the KL-minimal characterization hold under the stated assumptions, this would represent a meaningful advance by providing a targeted, training-free intervention that avoids the fluency degradation and spurious evidence risks of uniform visual strengthening approaches. The public code release supports reproducibility and enables direct follow-up work on LVLM reliability.

major comments (2)
  1. [Experiments] The central claim that two-axis selectivity strengthens grounding without amplifying irrelevant regions or harming fluency rests on the accuracy of token-specific visual relevance derived from early-layer representations. No quantitative validation of this localization (e.g., overlap metrics with ground-truth object regions or ablation isolating localization quality from the gating) appears in the experiments, leaving the bounded-interference guarantee unsupported.
  2. [Analysis] The analysis section characterizes CAI as KL-minimal attention reweighting, but the abstract provides no derivation, closed-form expression, or controlled measurement showing that the reweighting is minimal or that interference remains bounded when gates are inactive; this property is load-bearing for the 'conservative' and 'parameter-free' framing of the intervention.
minor comments (1)
  1. [Abstract] The abstract states 'extensive experiments across multiple LVLM backbones and benchmarks' but does not name the specific models or datasets in the opening paragraph; moving this detail forward would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address each major comment below with clarifications and indicate planned revisions to strengthen the empirical and analytical support.

read point-by-point responses
  1. Referee: [Experiments] The central claim that two-axis selectivity strengthens grounding without amplifying irrelevant regions or harming fluency rests on the accuracy of token-specific visual relevance derived from early-layer representations. No quantitative validation of this localization (e.g., overlap metrics with ground-truth object regions or ablation isolating localization quality from the gating) appears in the experiments, leaving the bounded-interference guarantee unsupported.

    Authors: We acknowledge the value of quantitative validation for the early-layer localization. The current manuscript provides qualitative attention visualizations demonstrating semantic alignment, but we agree that overlap metrics and isolating ablations would offer stronger evidence. In the revised version, we will add mean IoU overlap with ground-truth object regions on a benchmark subset and an ablation separating localization quality from the entropy/depth gates to better support the bounded-interference property. revision: yes

  2. Referee: [Analysis] The analysis section characterizes CAI as KL-minimal attention reweighting, but the abstract provides no derivation, closed-form expression, or controlled measurement showing that the reweighting is minimal or that interference remains bounded when gates are inactive; this property is load-bearing for the 'conservative' and 'parameter-free' framing of the intervention.

    Authors: The analysis section derives the KL-minimal characterization and states the bounded-interference property under inactive gates or small tilts. To make this more explicit as requested, the revision will include a dedicated subsection with the closed-form KL divergence expression between original and intervened distributions, plus controlled empirical measurements of KL values across gate states and tilt sizes. This will reinforce the conservative framing without altering the core method. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained empirical intervention

full rationale

The paper introduces CAI as a training-free inference-time mechanism that derives token-specific visual relevance from early-layer representations and applies entropy- and depth-gated attention tilt. The central result (SOTA hallucination mitigation with KL-minimal reweighting characterization) is supported by experiments across multiple LVLM backbones and benchmarks, with the characterization presented as an analysis of the proposed method rather than a reduction to fitted inputs or self-citations. No load-bearing steps reduce by construction to the paper's own definitions, predictions, or prior self-citations. The method remains independent of its inputs and externally falsifiable via the reported benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard transformer attention behavior plus the unverified effectiveness of the proposed early-layer relevance extraction and gated tilt; no free parameters, new entities, or ad-hoc axioms are explicitly introduced in the abstract.

axioms (1)
  • standard math Standard multi-head attention and layer-wise representation properties in transformer-based LVLMs
    The method assumes early layers encode usable visual relevance and deeper layers suffer grounding degradation, which are background properties of the architecture.

pith-pipeline@v0.9.1-grok · 5764 in / 1270 out tokens · 37240 ms · 2026-06-30T06:44:33.260201+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

53 extracted references · 19 canonical work pages · 14 internal anchors

  1. [1]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    An, W., Tian, F., Leng, S., Nie, J., Lin, H., Wang, Q., Chen, P., Zhang, X., Lu, S.: Mitigating object hallucinations in large vision-language models with assembly of global and local attention. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 29915–29926 (2025)

  2. [2]

    Qwen Technical Report

    Bai, J., Bai, S., Chu, Y., Cui, Z., Dang, K., Deng, X., Fan, Y., Ge, W., Han, Y., Huang, F., et al.: Qwen technical report. arXiv preprint arXiv:2309.16609 (2023)

  3. [3]

    Hallucination of Multimodal Large Language Models: A Survey

    Bai, Z., Wang, P., Xiao, T., He, T., Han, Z., Zhang, Z., Shou, M.Z.: Hallucination of multimodal large language models: A survey. arXiv preprint arXiv:2404.18930 (2024)

  4. [4]

    Advances in Neural Information Processing Systems37, 7400–7426 (2024)

    Basu, S., Grayson, M., Morrison, C., Nushi, B., Feizi, S., Massiceti, D.: Under- standing information storage and transfer in multi-modal large language models. Advances in Neural Information Processing Systems37, 7400–7426 (2024)

  5. [5]

    Advances in neural information processing systems33, 1877–1901 (2020)

    Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Nee- lakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems33, 1877–1901 (2020)

  6. [6]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Chen, J., Zhang, T., Huang, S., Niu, Y., Zhang, L., Wen, L., Hu, X.: Ict: Image- object cross-level trusted intervention for mitigating object hallucination in large vision-language models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 4209–4221 (2025)

  7. [7]

    arXiv preprint arXiv:2311.16479 (2023) 16 Lei et al

    Chen, Z., Zhu, Y., Zhan, Y., Li, Z., Zhao, C., Wang, J., Tang, M.: Mitigating hallucination in visual language models with visual supervision. arXiv preprint arXiv:2311.16479 (2023) 16 Lei et al

  8. [8]

    Journal of Machine Learning Research24(240), 1–113 (2023)

    Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H.W., Sutton, C., Gehrmann, S., et al.: Palm: Scaling lan- guage modeling with pathways. Journal of Machine Learning Research24(240), 1–113 (2023)

  9. [9]

    In: International Conference on Learning Representations

    Chuang, Y.S., Xie, Y., Luo, H., Kim, Y., Glass, J.R., He, P.: Dola: Decoding by contrasting layers improves factuality in large language models. In: International Conference on Learning Representations. vol. 2024, pp. 54158–54183 (2024)

  10. [10]

    Advances in neural information processing systems36, 49250–49267 (2023)

    Dai, W., Li, J., Li, D., Tiong, A., Zhao, J., Wang, W., Li, B., Fung, P.N., Hoi, S.: Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in neural information processing systems36, 49250–49267 (2023)

  11. [11]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Favero, A., Zancato, L., Trager, M., Choudhary, S., Perera, P., Achille, A., Swami- nathan, A., Soatto, S.: Multi-modal hallucination control by visual information grounding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14303–14312 (2024)

  12. [12]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Guan, T., Liu, F., Wu, X., Xian, R., Li, Z., Liu, X., Wang, X., Chen, L., Huang, F., Yacoob, Y., et al.: Hallusionbench: an advanced diagnostic suite for entan- gled language hallucination and visual illusion in large vision-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14375–14385 (2024)

  13. [13]

    Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

    Huang, W., Jia, B., Zhai, Z., Cao, S., Ye, Z., Zhao, F., Xu, Z., Hu, Y., Lin, S.: Vision-r1: Incentivizing reasoning capability in multimodal large language models. arXiv preprint arXiv:2503.06749 (2025)

  14. [14]

    In: Proceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition

    Hudson, D.A., Manning, C.D.: Gqa: A new dataset for real-world visual reasoning and compositional question answering. In: Proceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition. pp. 6700–6709 (2019)

  15. [15]

    arXiv preprint arXiv:2408.02032 (2024)

    Huo, F., Xu, W., Zhang, Z., Wang, H., Chen, Z., Zhao, P.: Self-introspective de- coding: Alleviating hallucinations for large vision-language models. arXiv preprint arXiv:2408.02032 (2024)

  16. [16]

    Jiang, C., Xu, H., Dong, M., Chen, J., Ye, W., Yan, M., Ye, Q., Zhang, J., Huang, F., Zhang, S.: Hallucination augmented contrastive learning for multimodal large languagemodel.In:ProceedingsoftheIEEE/CVFConferenceonComputerVision and Pattern Recognition. pp. 27036–27046 (2024)

  17. [17]

    Proceedings of the 2023 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (2023)

    Lee, S., Park, S.H., Jo, Y., Seo, M.: Volcano: mitigating multimodal hallucination through self-feedback guided revision. Proceedings of the 2023 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (2023)

  18. [18]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Leng, S., Zhang, H., Chen, G., Li, X., Lu, S., Miao, C., Bing, L.: Mitigating object hallucinations in large vision-language models through visual contrastive decoding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13872–13882 (2024)

  19. [19]

    In: International conference on machine learning

    Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In: International conference on machine learning. pp. 19730–19742. PMLR (2023)

  20. [20]

    VisualBERT: A Simple and Performant Baseline for Vision and Language

    Li, L.H., Yatskar, M., Yin, D., Hsieh, C.J., Chang, K.W.: Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019)

  21. [21]

    Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (2023) Context-Aware Attention Intervention for Mitigating Hallucinations 17

    Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, W.X., Wen, J.R.: Evaluating object hallucination in large vision-language models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (2023) Context-Aware Attention Intervention for Mitigating Hallucinations 17

  22. [22]

    In: European conference on computer vision

    Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: European conference on computer vision. pp. 740–755. Springer (2014)

  23. [23]

    Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

    Liu, F., Lin, K., Li, L., Wang, J., Yacoob, Y., Wang, L.: Mitigating hallucina- tion in large multi-modal models via robust instruction tuning. arXiv preprint arXiv:2306.14565 (2023)

  24. [24]

    A Survey on Hallucination in Large Vision-Language Models

    Liu, H., Xue, W., Chen, Y., Chen, D., Zhao, X., Wang, K., Hou, L., Li, R., Peng, W.: A survey on hallucination in large vision-language models. arXiv preprint arXiv:2402.00253 (2024)

  25. [25]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tun- ing. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 26296–26306 (2024)

  26. [26]

    Advances in neural information processing systems36, 34892–34916 (2023)

    Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36, 34892–34916 (2023)

  27. [27]

    In: The Thirteenth International Conference on Learning Representations (2025)

    Liu, S., Ye, H., Zou, J.: Reducing hallucinations in large vision-language models via latent space steering. In: The Thirteenth International Conference on Learning Representations (2025)

  28. [28]

    In: European Conference on Com- puter Vision

    Liu, S., Zheng, K., Chen, W.: Paying more attention to image: A training-free method for alleviating hallucination in lvlms. In: European Conference on Com- puter Vision. pp. 125–140. Springer (2024)

  29. [29]

    Visual-RFT: Visual Reinforcement Fine-Tuning

    Liu, Z., Sun, Z., Zang, Y., Dong, X., Cao, Y., Duan, H., Lin, D., Wang, J.: Visual- rft: Visual reinforcement fine-tuning. arXiv preprint arXiv:2503.01785 (2025)

  30. [30]

    In: The Thirteenth International Conference on Learning Representations (2025)

    Neo, C., Ong, L., Torr, P., Geva, M., Krueger, D., Barez, F.: Towards interpret- ing visual information processing in vision-language models. In: The Thirteenth International Conference on Learning Representations (2025)

  31. [31]

    Advances in neural information processing sys- tems35, 27730–27744 (2022)

    Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al.: Training language models to follow instructions with human feedback. Advances in neural information processing sys- tems35, 27730–27744 (2022)

  32. [32]

    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing p

    Rohrbach, A., Hendricks, L.A., Burns, K., Darrell, T., Saenko, K.: Object hallu- cination in image captioning. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing p. 4035–4045 (2018)

  33. [33]

    In: European conference on computer vision

    Schwenk, D., Khandelwal, A., Clark, C., Marino, K., Mottaghi, R.: A-okvqa: A benchmark for visual question answering using world knowledge. In: European conference on computer vision. pp. 146–162. Springer (2022)

  34. [34]

    VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

    Shen, H., Liu, P., Li, J., Fang, C., Ma, Y., Liao, J., Shen, Q., Zhang, Z., Zhao, K., Zhang, Q., et al.: Vlm-r1: A stable and generalizable r1-style large vision-language model. arXiv preprint arXiv:2504.07615 (2025)

  35. [35]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Sun,C.,Myers,A.,Vondrick,C.,Murphy,K.,Schmid,C.:Videobert:Ajointmodel for video and language representation learning. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 7464–7473 (2019)

  36. [36]

    Aligning Large Multimodal Models with Factually Augmented RLHF

    Sun, Z., Shen, S., Cao, S., Liu, H., Li, C., Shen, Y., Gan, C., Gui, L.Y., Wang, Y.X., Yang, Y., et al.: Aligning large multimodal models with factually augmented rlhf. arXiv preprint arXiv:2309.14525 (2023)

  37. [37]

    arXiv preprint arXiv:2503.20752 (2025)

    Tan, H., Ji, Y., Hao, X., Lin, M., Wang, P., Wang, Z., Zhang, S.: Reason-rft: Reinforcement fine-tuning for visual reasoning. arXiv preprint arXiv:2503.20752 (2025)

  38. [38]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Tang, F., Liu, C., Xu, Z., Hu, M., Huang, Z., Xue, H., Chen, Z., Peng, Z., Yang, Z., Zhou, S., et al.: Seeing far and clearly: Mitigating hallucinations in mllms with attention causal decoding. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 26147–26159 (2025) 18 Lei et al

  39. [39]

    Team, Q.: Qwen3.5: Accelerating productivity with native multimodal agents (February 2026),https://qwen.ai/blog?id=qwen3.5

  40. [40]

    LLaMA: Open and Efficient Foundation Language Models

    Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)

  41. [41]

    Proceedings of the International Confer- ence on Computer Vision (2025)

    Wan, Z., Zhang, C., Yong, S., Ma, M.Q., Stepputtis, S., Morency, L.P., Ramanan, D., Sycara, K., Xie, Y.: Only: One-layer intervention sufficiently mitigates halluci- nations in large vision-language models. Proceedings of the International Confer- ence on Computer Vision (2025)

  42. [42]

    Proceedings of the 29th International Conference on Computational Linguistics p

    Wu, Y., Zhao, Y., Zhao, S., Zhang, Y., Yuan, X., Zhao, G., Jiang, N.: Overcoming language priors in visual question answering via distinguishing superficially simi- lar instances. Proceedings of the 29th International Conference on Computational Linguistics p. 5721–5729 (2022)

  43. [43]

    mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

    Ye, Q., Xu, H., Xu, G., Ye, J., Yan, M., Zhou, Y., Wang, J., Hu, A., Shi, P., Shi, Y., et al.: mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178 (2023)

  44. [44]

    National Science Review11(12), nwae403 (2024)

    Yin, S., Fu, C., Zhao, S., Li, K., Sun, X., Xu, T., Chen, E.: A survey on multimodal large language models. National Science Review11(12), nwae403 (2024)

  45. [45]

    In: ProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecog- nition

    Yu, Q., Li, J., Wei, L., Pang, L., Ye, W., Qin, B., Tang, S., Tian, Q., Zhuang, Y.: Hallucidoctor: Mitigating hallucinatory toxicity in visual instruction data. In: ProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecog- nition. pp. 12944–12953 (2024)

  46. [46]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Yu, T., Yao, Y., Zhang, H., He, T., Han, Y., Cui, G., Hu, J., Liu, Z., Zheng, H.T., Sun, M., et al.: Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13807–13816 (2024)

  47. [47]

    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics pp

    Yue, Z., Zhang, L., Jin, Q.: Less is more: Mitigating multimodal hallucination from an eos decision perspective. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics pp. 11766–11781 (2024)

  48. [48]

    arXiv preprint arXiv:2601.20279 (2026)

    Zhang, X., Zhu, Y., Gu, C., Yuan, X., Zhao, Q., Cao, J., Tang, F., Fan, S., Shen, Y., Shen, C., et al.: Hallucination begins where saliency drops. arXiv preprint arXiv:2601.20279 (2026)

  49. [49]

    Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization

    Zhao, Z., Wang, B., Ouyang, L., Dong, X., Wang, J., He, C.: Beyond hallucinations: Enhancing lvlms through hallucination-aware direct preference optimization. arXiv preprint arXiv:2311.16839 (2023)

  50. [50]

    R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model

    Zhou, H., Li, X., Wang, R., Cheng, M., Zhou, T., Hsieh, C.J.: R1-zero’s" aha mo- ment" in visual reasoning on a 2b non-sft model. arXiv preprint arXiv:2503.05132 (2025)

  51. [51]

    The Eleventh International Conference on Learning Representations (2023)

    Zhou, Y., Cui, C., Yoon, J., Zhang, L., Deng, Z., Finn, C., Bansal, M., Yao, H.: Analyzing and mitigating object hallucination in large vision-language models. The Eleventh International Conference on Learning Representations (2023)

  52. [52]

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: Minigpt-4: Enhancing vision- language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023)

  53. [53]

    Please describe this image in detail

    Zou, X., Wang, Y., Yan, Y., Lyu, Y., Zheng, K., Huang, S., Chen, J., Jiang, P., Liu, J., Tang, C., Hu, X.: Look twice before you answer: Memory-space visual retracing for hallucination mitigation in multimodal large language models. The Forty-second International Conference on Machine Learning (ICML) (2025) Context-Aware Attention Intervention for Mitigat...