pith. sign in

arxiv: 2605.31429 · v1 · pith:4CPINE4Cnew · submitted 2026-05-29 · 💻 cs.CV

YARD: Y-Architecture Register Decoding for Efficient Hallucination Mitigation in Large Vision-Language Models

Pith reviewed 2026-06-28 23:17 UTC · model grok-4.3

classification 💻 cs.CV
keywords hallucination mitigationcontrastive decodinglarge vision-language modelsregister tokensy-architecturedecoder branchingtraining-free
0
0 comments X

The pith

Y-architecture branches decoder at middle layers and uses register tokens to create an efficient contrastive signal that reduces hallucinations in large vision-language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces YARD, a training-free method for contrastive decoding in LVLMs to mitigate hallucinations. It observes that text-to-vision grounding happens mainly in middle decoder layers and builds the degraded branch by branching there, replacing detailed visual tokens with register tokens that keep global meaning but lack local detail. This avoids the problems of previous methods like full image corruption or token dropping, which either cause language issues or require two slow passes. A sympathetic reader would care because it promises more reliable multimodal outputs at lower computational cost.

Core claim

YARD constructs the degraded branch internally by sharing shallow-layer computations and branching exactly at the critical middle stage. For the degraded branch, YARD replaces patch-level visual tokens with register tokens, which preserve global image semantics but lack fine-grained local evidence. This image-aware yet locally under-grounded design provides a faithful contrastive signal without extreme modality mismatch, while the Y-architecture strictly avoids a costly second forward pass.

What carries the argument

The Y-architecture that shares shallow decoder layers and branches at middle layers for a register-token degraded branch to generate the contrastive signal.

If this is right

  • YARD achieves state-of-the-art hallucination mitigation on generative and discriminative benchmarks across multiple LVLMs.
  • It reduces inference latency significantly by avoiding a second full forward pass.
  • The method prevents language hallucinations that arise from extreme visual degradation.
  • Register tokens provide a balanced contrastive signal that maintains global semantics without fine-grained evidence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach suggests that internal branching in decoder layers can generalize to other efficiency techniques in multimodal models.
  • Future work could test if similar register-based degradation works in pure language models for other contrastive tasks.
  • Ablations on different layer branching points might reveal more about where grounding occurs in various architectures.

Load-bearing premise

Reliable text-to-vision grounding predominantly emerges in the middle decoder layers of LVLMs.

What would settle it

An experiment that applies YARD but branches at early or late layers instead of middle ones and measures if hallucination reduction and latency benefits disappear.

Figures

Figures reproduced from arXiv: 2605.31429 by Geng Li, Guan Huang, Guohao Chen, Jun Du, Langsheng Lei, Mai Chen, Ting Chen, Yu Hu.

Figure 1
Figure 1. Figure 1: Comparison between (a) pixel-level degrada [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Motivating analysis of visual evidence flow. (a) Zero-out intervention shows that late-layer predictions [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of Y-Architecture Register Decoding. The results show that pixel-level degradation produces the largest increase in CHs/CHi under the w/o α setting, indicating that its degraded branch is contaminated by residual visual semantics and fails to precisely isolate hallucination components. Text-only degradation leads to a smaller increase, but this mainly comes from its image-agnostic na￾ture after co… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison between baseline and Yard. Hallucinated sentences are highlighted in red. 6.3 Qualitative Analysis Are YARD responses more visually grounded? [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of inference time and halluci [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Next-token case study comparing clean and degraded branches. Pixel-level degradation stays close to the [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative comparison between the baseline and YARD. The baseline repeatedly hallucinates [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative comparison between the baseline and YARD. The baseline hallucinates unsupported [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative comparison between the baseline and YARD. The baseline hallucinates an unsupported [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative comparison between the baseline and YARD. The baseline hallucinates unsupported [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Qualitative comparison between the baseline and YARD. The baseline hallucinates an unsupported [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Qualitative comparison between the baseline and YARD. The baseline hallucinates unsupported [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗
read the original abstract

Contrastive decoding (CD) seeks to mitigate hallucinations in Large Vision-Language Models (LVLMs) by contrasting the output distributions of a standard model and a visually degraded model. However, existing training-free CD methods suffer from sub-optimal degraded branches: completely dropping visual tokens is too extreme and induces language hallucinations, while corrupting input images offers coarse control over visual evidence and suffers from high inference latency due to requiring two full forward passes. To address these dilemmas, we propose YARD, a training-free Y-Architecture Register Decoding framework. Motivated by the observation that reliable text-to-vision grounding predominantly emerges in the middle decoder layers, YARD constructs the degraded branch internally by sharing shallow-layer computations and branching exactly at this critical stage. For the degraded branch, YARD replaces patch-level visual tokens with register tokens, which preserve global image semantics but lack fine-grained local evidence. This image-aware yet locally under-grounded design provides a faithful contrastive signal without extreme modality mismatch, while the Y-architecture strictly avoids a costly second forward pass. Extensive experiments on generative and discriminative hallucination benchmarks demonstrate that YARD consistently achieves state-of-the-art hallucination mitigation across multiple LVLMs, alongside a significant reduction in inference latency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper proposes YARD, a training-free Y-Architecture Register Decoding method for hallucination mitigation in LVLMs. It constructs a degraded branch by sharing shallow decoder layers and branching at a middle layer, replacing patch tokens with register tokens (preserving global semantics but lacking fine-grained evidence) to generate a contrastive signal. This avoids the extremes of full visual dropout or image corruption and eliminates a second full forward pass, claiming SOTA performance on generative and discriminative hallucination benchmarks across multiple LVLMs plus reduced inference latency.

Significance. If the results hold, the Y-architecture provides a practical efficiency gain over prior contrastive decoding approaches by internalizing the degraded branch rather than requiring dual full passes. The design choice of register-token substitution for controlled local under-grounding is a targeted contribution to the contrastive signal quality.

major comments (1)
  1. [Abstract] Abstract: The central design decision to branch exactly at the middle decoder layer rests on the claim that 'reliable text-to-vision grounding predominantly emerges in the middle decoder layers,' yet the manuscript supplies no quantitative localization of grounding emergence, no layer-wise ablation of the branching point, and no cross-model consistency verification. This assumption is load-bearing for the validity of the contrastive signal; if the chosen layer does not mark the onset of stable grounding, the register substitution may either fail to provide sufficient contrast or introduce uncontrolled artifacts.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and the substantive comment on the grounding-layer assumption. We address the point directly below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central design decision to branch exactly at the middle decoder layer rests on the claim that 'reliable text-to-vision grounding predominantly emerges in the middle decoder layers,' yet the manuscript supplies no quantitative localization of grounding emergence, no layer-wise ablation of the branching point, and no cross-model consistency verification. This assumption is load-bearing for the validity of the contrastive signal; if the chosen layer does not mark the onset of stable grounding, the register substitution may either fail to provide sufficient contrast or introduce uncontrolled artifacts.

    Authors: We agree that the manuscript as submitted does not contain a dedicated quantitative analysis of when text-to-vision grounding emerges across layers. The middle-layer branching choice was selected on the basis of preliminary internal measurements that are not reported in the current version. In the revised manuscript we will add a new subsection (with accompanying figure) that (i) quantifies grounding emergence via layer-wise probing of attention to visual tokens, (ii) reports a full ablation of branching depth on the primary benchmarks, and (iii) repeats the ablation on all three LVLMs evaluated in the paper. These additions will directly address the requested localization, ablation, and cross-model verification. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper motivates the Y-architecture from an observation about grounding emergence in middle decoder layers and introduces register-token substitution in the degraded branch as a structural choice to avoid a second forward pass. No load-bearing step reduces by construction to a fitted parameter, self-citation chain, or definitional equivalence; the central claims rest on empirical results across hallucination benchmarks rather than internal renaming or ansatz smuggling. The layer-choice premise is presented as motivation, not derived from the method itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The ledger is constructed from the abstract description; additional details may exist in the full paper.

axioms (1)
  • domain assumption Reliable text-to-vision grounding predominantly emerges in the middle decoder layers
    This observation motivates the choice of branching point in the Y-architecture.
invented entities (1)
  • Y-Architecture Register Decoding no independent evidence
    purpose: To enable efficient contrastive decoding by internal branching and register tokens
    New framework introduced in the paper.

pith-pipeline@v0.9.1-grok · 5768 in / 1116 out tokens · 39690 ms · 2026-06-28T23:17:28.106005+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

56 extracted references · 18 canonical work pages · 9 internal anchors

  1. [1]

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, and 1 others. 2025. Qwen3-vl technical report. arXiv preprint arXiv:2511.21631

  2. [2]

    Zechen Bai, Pichao Wang, Tianjun Xiao, Tong He, Zongbo Han, Zheng Zhang, and Mike Zheng Shou. 2024. Hallucination of multimodal large language models: A survey. arXiv preprint arXiv:2404.18930

  3. [3]

    Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. 2024 a . An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. In European Conference on Computer Vision, pages 19--35. Springer

  4. [4]

    Zhaorun Chen, Zhuokai Zhao, Hongyin Luo, Huaxiu Yao, Bo Li, and Jiawei Zhou. 2024 b . Halc: Object hallucination reduction via adaptive focal-contrast decoding. arXiv preprint arXiv:2403.00425

  5. [5]

    Yung-Sung Chuang, Yujia Xie, Hongyin Luo, Yoon Kim, James R Glass, and Pengcheng He. 2024. Dola: Decoding by contrasting layers improves factuality in large language models. In International Conference on Learning Representations, volume 2024, pages 54158--54183

  6. [6]

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. 2023. Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in neural information processing systems, 36:49250--49267

  7. [7]

    Timoth \'e e Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. 2024. Vision transformers need registers. In International Conference on Learning Representations, volume 2024, pages 2632--2652

  8. [8]

    Jingyuan Deng and Yujiu Yang. 2025. Maskcd: Mitigating lvlm hallucinations by image head masked contrastive decoding. arXiv preprint arXiv:2510.02790

  9. [9]

    Yingqi Fan, Anhao Zhao, Jinlan Fu, Junlong Tong, Hui Su, Yijie Pan, Wei Zhang, and Xiaoyu Shen. 2025. Visipruner: Decoding discontinuous cross-modal dynamics for efficient multimodal llms. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 18896--18913

  10. [10]

    Alessandro Favero, Luca Zancato, Matthew Trager, Siddharth Choudhary, Pramuditha Perera, Alessandro Achille, Ashwin Swaminathan, and Stefano Soatto. 2024. Multi-modal hallucination control by visual information grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14303--14312

  11. [11]

    Laura Fieback, Nishilkumar Balar, Jakob Spiegelberg, and Hanno Gottschalk. 2025. Efficient contrastive decoding with probabilistic hallucination detection-mitigating hallucinations in large vision language models. arXiv preprint arXiv:2504.12137

  12. [12]

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, and 1 others. 2025. Mme: A comprehensive evaluation benchmark for multimodal large language models. Advances in Neural Information Processing Systems, 38

  13. [13]

    Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, and 1 others. 2024. Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. In Proceedings of the IEEE/CVF conference on computer vision and pattern re...

  14. [14]

    Qidong Huang, Xiaoyi Dong, Pan Zhang, Bin Wang, Conghui He, Jiaqi Wang, Dahua Lin, Weiming Zhang, and Nenghai Yu. 2024. Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13418--13427

  15. [15]

    Drew A Hudson and Christopher D Manning. 2019. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700--6709

  16. [16]

    Fushuo Huo, Wenchao Xu, Zhong Zhang, Haozhao Wang, Zhicheng Chen, and Peilin Zhao. 2024. Self-introspective decoding: Alleviating hallucinations for large vision-language models. arXiv preprint arXiv:2408.02032

  17. [17]

    Nicholas Jiang, Amil Dravid, Alexei Efros, and Yossi Gandelsman. 2025 a . Vision transformers don't need trained registers. Advances in neural information processing systems, 38:56557--56595

  18. [18]

    Zhangqi Jiang, Junkai Chen, Beier Zhu, Tingjin Luo, Yankun Shen, and Xu Yang. 2025 b . Devils in middle layers of large vision-language models: Interpreting, detecting and mitigating object hallucinations via attention lens. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 25004--25014

  19. [19]

    Jieun Kim, Jinmyeong Kim, Yoonji Kim, and Sung-Bae Cho. 2025. Fuzzy contrastive decoding to alleviate object hallucination in large vision-language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 20572--20581

  20. [20]

    Sihyeon Kim, Boryeong Cho, Sangmin Bae, Sumyeong Ahn, and Se-Young Yun. 2024. Vacode: Visual augmented contrastive decoding. arXiv preprint arXiv:2408.05337

  21. [21]

    Sicong Leng, Yun Xing, Zesen Cheng, Yang Zhou, Hang Zhang, Xin Li, Deli Zhao, Shijian Lu, Chunyan Miao, and Lidong Bing. 2025. The curse of multi-modalities: Evaluating hallucinations of large multimodal models across language, visual, and audio. Advances in Neural Information Processing Systems, 38

  22. [22]

    Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, and Lidong Bing. 2024. Mitigating object hallucinations in large vision-language models through visual contrastive decoding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13872--13882

  23. [23]

    Jiaming Li, Jiacheng Zhang, Zequn Jie, Lin Ma, and Guanbin Li. 2025 a . Mitigating hallucination for large vision language model by inter-modality correlation calibration decoding. arXiv preprint arXiv:2501.01926

  24. [24]

    Xiang Lisa Li, Ari Holtzman, Daniel Fried, Percy Liang, Jason Eisner, Tatsunori B Hashimoto, Luke Zettlemoyer, and Mike Lewis. 2023 a . Contrastive decoding: Open-ended text generation as optimization. In Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: Long papers), pages 12286--12312

  25. [25]

    Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, and Jiaya Jia. 2025 b . Mini-gemini: Mining the potential of multi-modality vision language models. IEEE Transactions on Pattern Analysis and Machine Intelligence

  26. [26]

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen. 2023 b . Evaluating object hallucination in large vision-language models. In Proceedings of the 2023 conference on empirical methods in natural language processing, pages 292--305

  27. [27]

    Hanchao Liu, Wenyuan Xue, Yifei Chen, Dapeng Chen, Xiutian Zhao, Ke Wang, Liping Hou, Rongjun Li, and Wei Peng. 2024 a . A survey on hallucination in large vision-language models. arXiv preprint arXiv:2402.00253

  28. [28]

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. 2024 b . Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296--26306

  29. [29]

    Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. 2024 c . Llavanext: Improved reasoning, ocr, and world knowledge

  30. [30]

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, and 1 others. 2024 d . Mmbench: Is your multi-modal model an all-around player? In European conference on computer vision, pages 216--233. Springer

  31. [31]

    Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. 2022. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in neural information processing systems, 35:2507--2521

  32. [32]

    Avshalom Manevich and Reut Tsarfaty. 2024. Mitigating hallucinations in large vision-language models (lvlms) via language-contrastive decoding (lcd). In Findings of the Association for Computational Linguistics: ACL 2024, pages 6008--6022

  33. [33]

    Sean O'Brien and Mike Lewis. 2023. Contrastive decoding improves reasoning in large language models. arXiv preprint arXiv:2309.09117

  34. [34]

    Maxime Oquab, Timoth \'e e Darcet, Th \'e o Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, and 1 others. 2023. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193

  35. [35]

    Nhi Pham and Michael Schott. 2024. H-pope: Hierarchical polling-based probing evaluation of hallucinations in large vision-language models. arXiv preprint arXiv:2411.04077

  36. [36]

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, and 1 others. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748--8763. PmLR

  37. [37]

    Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. 2018. Object hallucination in image captioning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4035--4045

  38. [38]

    Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. 2019. Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317--8326

  39. [39]

    Mingjie Sun, Xinlei Chen, J Zico Kolter, and Zhuang Liu. 2024. Massive activations in large language models. arXiv preprint arXiv:2402.17762

  40. [40]

    Wei Suo, Lijun Zhang, Mengyang Sun, Lin Yuanbo Wu, Peng Wang, and Yanning Zhang. 2025. Octopus: Alleviating hallucination via dynamic contrastive decoding. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 29904--29914

  41. [41]

    Barrett Tang, Zile Huang, Chengzhi Liu, Qiang Sun, Harry Yang, and Ser-Nam Lim. 2025. Intervening anchor token: Decoding strategy in alleviating hallucinations for mllms. In International Conference on Learning Representations, volume 2025, pages 27745--27776

  42. [42]

    Bingkui Tong, Jiaer Xia, and Kaiyang Zhou. 2025. Mitigating hallucination in multimodal llms with layer contrastive decoding. arXiv preprint arXiv:2509.25177

  43. [43]

    Junyang Wang, Yuhang Wang, Guohai Xu, Jing Zhang, Yukai Gu, Haitao Jia, Jiaqi Wang, Haiyang Xu, Ming Yan, Ji Zhang, and 1 others. 2023. Amber: An llm-free multi-dimensional benchmark for mllms hallucination evaluation. arXiv preprint arXiv:2311.07397

  44. [44]

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, and 1 others. 2024 a . Qwen2-vl: Enhancing vision-language model's perception of the world at any resolution. arXiv preprint arXiv:2409.12191

  45. [45]

    Xintong Wang, Jingheng Pan, Liang Ding, and Chris Biemann. 2024 b . Mitigating hallucinations in large vision-language models with instruction contrastive decoding. In Findings of the Association for Computational Linguistics: ACL 2024, pages 15840--15853

  46. [46]

    Sangmin Woo, Donguk Kim, Jaehyuk Jang, Yubin Choi, and Changick Kim. 2025. Don’t miss the forest for the trees: Attentional vision calibration for large vision language models. In Findings of the Association for Computational Linguistics: ACL 2025, pages 1927--1951

  47. [47]

    Jiulong Wu, Yucheng Shen, Haixin Sun, and Min Cao. 2025. Mitigating hallucinations in large vision-language models via dual contrastive decoding. In Proceedings of the 7th ACM International Conference on Multimedia in Asia, pages 1--8

  48. [48]

    Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. 2024. Efficient streaming language models with attention sinks. In International Conference on Learning Representations, volume 2024, pages 21875--21895

  49. [49]

    Dingchen Yang, Bowen Cao, Guang Chen, and Changjun Jiang. 2024. Pensieve: Retrospect-then-compare mitigates visual hallucination. arXiv preprint arXiv:2403.14401

  50. [50]

    Tianyu Yu, Yuan Yao, Haoye Zhang, Taiwen He, Yifeng Han, Ganqu Cui, Jinyi Hu, Zhiyuan Liu, Hai-Tao Zheng, Maosong Sun, and 1 others. 2024. Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13807--13816

  51. [51]

    Xiaofeng Zhang, Yihao Quan, Chen Shen, Chaochen Gu, Xiaosong Yuan, Shaotian Yan, Jiawei Cao, Hao Cheng, Kaijie Wu, and Jieping Ye. 2025 a . Shallow focus, deep fixes: Enhancing shallow layers vision attention sinks to alleviate hallucination in lvlms. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 3512--3534

  52. [52]

    Yuhui Zhang, Alyssa Unell, Xiaohan Wang, Dhruba Ghosh, Yuchang Su, Ludwig Schmidt, and Serena Yeung-Levy. 2024. Why are visually-grounded language models bad at image classification? Advances in Neural Information Processing Systems, 37:51727--51753

  53. [53]

    Zhi Zhang, Srishti Yadav, Fengze Han, and Ekaterina Shutova. 2025 b . Cross-modal information flow in multimodal large language models. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 19781--19791

  54. [54]

    Jianfei Zhao, Feng Zhang, Xin Sun, Lingxing Kong, Zhixing Tan, and Chong Feng. 2025. Cross-image contrastive decoding: Precise, lossless suppression of language priors in large vision-language models. arXiv preprint arXiv:2505.10634

  55. [55]

    Deyao Zhu, Xiaoqian Shen, Xiang Li, Mohamed Elhoseiny, and 1 others. 2024. Minigpt-4: Enhancing vision-language understanding with advanced large language models. In International Conference on Learning Representations, volume 2024, pages 18378--18394

  56. [56]

    Lanyun Zhu, Deyi Ji, Tianrun Chen, Peng Xu, Jieping Ye, and Jun Liu. 2025. Ibd: Alleviating hallucinations in large vision-language models via image-biased decoding. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 1624--1633