pith. machine review for the scientific record. sign in

arxiv: 2605.14621 · v1 · submitted 2026-05-14 · 💻 cs.CV · cs.AI· cs.CL

Recognition: no theorem link

Do We Really Need External Tools to Mitigate Hallucinations? SIRA: Shared-Prefix Internal Reconstruction of Attribution

Authors on Pith no claims yet

Pith reviewed 2026-05-15 05:52 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL
keywords hallucinationsvision-language modelscontrastive decodinginternal reconstructionattention maskingmultimodal transformerslarge vision-language models
0
0 comments X

The pith

Masking attention to image tokens after a shared prefix in vision-language transformers reduces hallucinations by contrasting against an internal language-prior reference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large vision-language models often generate text that follows language priors even when visual evidence is weak or ambiguous. Existing contrastive methods create external references by perturbing inputs or running extra passes, but these can introduce off-manifold artifacts and raise compute cost. SIRA instead keeps everything inside one model: image and text tokens first interact through a shared prefix that preserves alignment, history, and early grounding, then a forked branch in later layers masks attention to image positions. The resulting reference stays language-prior dominated yet retains the original decoding context, so token-level contrast can down-weight predictions that do not rely on continued visual access. Experiments on POPE, CHAIR, and AMBER with Qwen2.5-VL and LLaVA-v1.5 show lower hallucination rates, preserved coverage, and reduced overhead compared with two-pass baselines.

Core claim

SIRA constructs a counterfactual reference inside the same LVLM by first allowing image and text tokens to interact through a shared prefix that forms an aligned multimodal state, then forking a branch in later transformer layers where attention to image-token positions is masked. This branch retains the shared context and decoding history but lacks continued fine-grained visual evidence, producing a language-prior-dominated reference that enables token-level contrast without external perturbations or additional forward passes.

What carries the argument

Shared-prefix internal reconstruction: an early multimodal interaction stage followed by a masked-image-attention branch in later layers that isolates language priors while preserving prompt interpretation and positional structure.

If this is right

  • Hallucination rates drop on POPE, CHAIR, and AMBER for both Qwen2.5-VL and LLaVA-v1.5 while descriptive coverage stays intact.
  • Overhead stays below that of two-pass contrastive decoding because only one forward pass plus an internal branch is required.
  • No training, external verifier, or perturbed input is needed, so the method applies directly to open-weight LVLMs with white-box access.
  • Suppression targets tokens whose strength persists even without late visual access, favoring outputs that depend on the full visual pathway.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The staged flow assumption suggests similar internal forking could be tested in other transformer-based multimodal systems where early and late layers handle distinct information types.
  • If the shared-prefix stage successfully locks in prompt semantics, the same pattern might help control factual drift in text-only models facing weak evidence.
  • Deployment on resource-limited devices could benefit because the internal branch replaces separate external passes.

Load-bearing premise

That masking attention to image tokens only in later layers produces a clean language-prior reference that preserves the original prompt interpretation and decoding history without introducing new artifacts.

What would settle it

Run SIRA on a benchmark where hallucinations arise mainly from early-layer visual mis-grounding rather than late-stage language dominance; the method would then show no reduction or an increase in error rates.

Figures

Figures reproduced from arXiv: 2605.14621 by Junzhe Chen, Lijie Wen, Qiang Ju, Tian Qin, Tianshu Zhang, Yuqing Shi.

Figure 1
Figure 1. Figure 1: Comparison between representative prior inference-time mitigation methods and S [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: SIRA overview. B in the figure denotes boundary b=L−K in the text. Layer numbers are illustrative; L is backbone-dependent. same model, without launching a second full forward pass over a perturbed image or relying on an external reference model. We describe the branch design (Section 3.1), the masked counterfactual construction (Section 3.2), and the contrastive decoding rule (Section 3.3). 3.1 Internal B… view at source ↗
Figure 3
Figure 3. Figure 3: Effect of split boundary [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Contrastive reference analysis: (a) layer-wise drift; (b) next-token KL; (c) stage-wise drift; [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Case study and edge-case analysis on the AMBER benchmark using Qwen2.5-VL. Bars [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Additional case studies on the AMBER benchmark. Eight AMBER cases are shown; for [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
read the original abstract

Large vision-language models (LVLMs) often hallucinate when language priors dominate weak or ambiguous visual evidence. Existing contrastive decoding methods mitigate this problem by comparing predictions from the original image with those from externally perturbed visual inputs, but such references can introduce off-manifold artifacts and require costly extra forward passes. We propose SIRA, a training-free internal contrastive decoding framework that constructs a counterfactual reference inside the same LVLM by exploiting the staged information flow of multimodal transformers. Instead of removing visual information from the input, SIRA first lets image and text tokens interact through a shared prefix, forming an aligned multimodal state that preserves prompt interpretation, decoding history, positional structure, and early visual grounding. It then forks a counterfactual branch in later transformer layers, where attention to image-token positions is masked. This branch retains the shared multimodal context but lacks continued access to fine-grained visual evidence, yielding a language-prior-dominated internal reference for token-level contrast. During decoding, SIRA suppresses tokens that remain strong without late visual access and favors predictions whose advantage depends on the full visual pathway. Experiments on POPE, CHAIR, and AMBER with Qwen2.5-VL and LLaVA-v1.5 show that SIRA consistently reduces hallucinations while preserving descriptive coverage and incurring lower overhead than two-pass contrastive decoding. SIRA requires no training, external verifier, or perturbed input, and applies to open-weight LVLMs with white-box inference access.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes SIRA, a training-free internal contrastive decoding framework for mitigating hallucinations in large vision-language models. It constructs a counterfactual reference by first allowing full image-text interaction through a shared prefix (preserving prompt interpretation, decoding history, and early visual grounding) and then forking a branch in later transformer layers where attention to image-token positions is masked, yielding a language-prior-dominated internal reference for token-level contrast during decoding. Experiments on POPE, CHAIR, and AMBER with Qwen2.5-VL and LLaVA-v1.5 are claimed to show consistent hallucination reduction while preserving descriptive coverage and lower overhead than two-pass contrastive decoding.

Significance. If the internal masking mechanism produces a valid language-prior reference without residual visual artifacts or off-distribution effects, SIRA would represent a meaningful efficiency gain over external-perturbation methods by eliminating extra forward passes, perturbed inputs, and external verifiers while remaining applicable to open-weight LVLMs with white-box access.

major comments (1)
  1. [Method (forked-branch construction)] The central mechanism (shared-prefix fusion followed by attention masking on image tokens in later layers) assumes that the forked branch yields logits dominated purely by language priors. However, because the shared prefix permits full cross-modal interaction, later-layer hidden states for text tokens carry fused visual features via residual connections and value projections rather than solely via attention; masking attention scores alone therefore cannot excise this embedded information. This raises the risk that any contrastive benefit arises from the masking artifact itself rather than from removal of visual evidence. The manuscript should include hidden-state analysis, comparison to a no-image baseline, or ablation on residual pathways to substantiate the assumption.
minor comments (2)
  1. [Abstract] The abstract states that experiments show consistent gains but provides no quantitative metrics, ablation details, or error analysis; including at least headline numbers (e.g., POPE accuracy deltas) would strengthen the summary.
  2. [Decoding procedure] Notation for the contrastive decoding step (e.g., how the internal reference logits are combined with the original branch) should be formalized with an equation for reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comments on our work. We address the major comment regarding the forked-branch construction in SIRA below, and we will revise the manuscript accordingly to include additional analyses.

read point-by-point responses
  1. Referee: [Method (forked-branch construction)] The central mechanism (shared-prefix fusion followed by attention masking on image tokens in later layers) assumes that the forked branch yields logits dominated purely by language priors. However, because the shared prefix permits full cross-modal interaction, later-layer hidden states for text tokens carry fused visual features via residual connections and value projections rather than solely via attention; masking attention scores alone therefore cannot excise this embedded information. This raises the risk that any contrastive benefit arises from the masking artifact itself rather than from removal of visual evidence. The manuscript should include hidden-state analysis, comparison to a no-image baseline, or ablation on residual pathways to substantiate the assumption.

    Authors: We acknowledge that residual connections from the shared prefix do carry some fused multimodal information into the later layers. However, the attention masking in the forked branch is designed to halt further visual token integration, allowing the branch to rely more on language priors for subsequent predictions. This creates a meaningful contrast with the full visual pathway. To substantiate this, we will add hidden-state similarity analysis between the branches, a comparison to a no-image input baseline, and an ablation removing residual pathways where possible. These additions will clarify that the contrastive benefit stems from differential visual access rather than artifacts. revision: yes

Circularity Check

0 steps flagged

SIRA's shared-prefix masking construction is an independent algorithmic proposal with no self-referential reductions

full rationale

The paper defines SIRA directly as a training-free procedure: run a shared prefix through early layers to form aligned multimodal states, then fork a later-layer branch that masks attention to image-token positions while retaining the shared context. This yields an internal language-prior reference for contrastive decoding. No equations are presented whose outputs are algebraically identical to their inputs by construction; no parameters are fitted to data and then relabeled as predictions; no load-bearing uniqueness theorem or ansatz is imported via self-citation. The central claim rests on the architectural assumption that attention masking after early fusion produces a usable counterfactual, which is tested empirically on POPE, CHAIR, and AMBER rather than derived tautologically. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests on the domain assumption that multimodal transformers exhibit staged information flow with early alignment and later refinement; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Multimodal transformers process information in stages where early layers align image and text tokens while later layers refine predictions.
    Invoked to justify that a shared prefix preserves context and that late masking isolates language priors without breaking decoding history.

pith-pipeline@v0.9.0 · 5589 in / 1134 out tokens · 37064 ms · 2026-05-15T05:52:51.707966+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 2 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  2. [2]

    Mitigating object hallucinations in large vision-language models with assembly of global and local attention

    Wenbin An, Feng Tian, Sicong Leng, Jiahao Nie, Haonan Lin, Qianying Wang, Ping Chen, Xiaoqin Zhang, and Shijian Lu. Mitigating object hallucinations in large vision-language models with assembly of global and local attention. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 29915–29926. IEEE, 2025

  3. [3]

    Qwen2.5-vl technical report, 2025

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report, 2025

  4. [4]

    Mask what matters: Mitigating object hallucinations in multimodal large language models with object-aligned visual contrastive decoding

    Boqi Chen, Xudong Liu, and Jianing Qiu. Mask what matters: Mitigating object hallucinations in multimodal large language models with object-aligned visual contrastive decoding. In Selene Baez Santamaria, Sai Ashish Somayajula, and Atsuki Yamaguchi, editors,Proceedings of the 19th Conference of the European Chapter of the Association for Computational Ling...

  5. [5]

    Ict: Image-object cross-level trusted intervention for mitigating object hallucination in large vision-language models

    Junzhe Chen, Tianshu Zhang, Shiyu Huang, Yuwei Niu, Linfeng Zhang, Lijie Wen, and Xuming Hu. Ict: Image-object cross-level trusted intervention for mitigating object hallucination in large vision-language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 4209–4221, 2025

  6. [6]

    Glass, and Pengcheng He

    Yung-Sung Chuang, Yujia Xie, Hongyin Luo, Yoon Kim, James R. Glass, and Pengcheng He. Dola: Decoding by contrasting layers improves factuality in large language models. InThe Twelfth International Conference on Learning Representations, 2024

  7. [7]

    MaskCD: Mitigating LVLM hallucinations by image head masked contrastive decoding

    Jingyuan Deng and Yujiu Yang. MaskCD: Mitigating LVLM hallucinations by image head masked contrastive decoding. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Findings of the Association for Computational Linguistics: EMNLP 2025, pages 18854–18866, Suzhou, China, November 2025. Association for Computa- tional Li...

  8. [8]

    Multi-modal hallucination control by visual information grounding

    Alessandro Favero, Luca Zancato, Matthew Trager, Siddharth Choudhary, Pramuditha Perera, Alessandro Achille, Ashwin Swaminathan, and Stefano Soatto. Multi-modal hallucination control by visual information grounding. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14303–14312. IEEE, 2024

  9. [9]

    Detecting and preventing hallucinations in large vision language models.Proceedings of the AAAI Conference on Artificial Intelligence, 38(16): 18135–18143, 2024

    Anisha Gunjal, Jihan Yin, and Erhan Bas. Detecting and preventing hallucinations in large vision language models.Proceedings of the AAAI Conference on Artificial Intelligence, 38(16): 18135–18143, 2024

  10. [10]

    A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Transac- tions on Information Systems, 43(2):1–55, 2025

    Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qiang- long Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Transac- tions on Information Systems, 43(2):1–55, 2025

  11. [11]

    Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation

    Qidong Huang, Xiaoyi Dong, Pan Zhang, Bin Wang, Conghui He, Jiaqi Wang, Dahua Lin, Weiming Zhang, and Nenghai Yu. Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13418–13427, 2024

  12. [12]

    Gqa: A new dataset for real-world visual reasoning and compositional question answering

    Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6700–6709, 2019. 10

  13. [13]

    Self-introspective decoding: Alleviating hallucinations for large vision-language models

    Fushuo Huo, Wenchao Xu, Zhong Zhang, Haozhao Wang, Zhicheng Chen, and Peilin Zhao. Self-introspective decoding: Alleviating hallucinations for large vision-language models. In The Thirteenth International Conference on Learning Representations, 2025

  14. [14]

    Hallucination augmented contrastive learning for multimodal large language model

    Chaoya Jiang, Haiyang Xu, Mengfan Dong, Jiaxing Chen, Wei Ye, Ming Yan, Qinghao Ye, Ji Zhang, Fei Huang, and Shikun Zhang. Hallucination augmented contrastive learning for multimodal large language model. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 27026–27036. IEEE, 2024

  15. [15]

    What’s in the image? a deep-dive into the vision of vision language models

    Omri Kaduri, Shai Bagon, and Tali Dekel. What’s in the image? a deep-dive into the vision of vision language models. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14549–14558. IEEE, 2025. doi: 10.1109/CVPR52734.2025.01356

  16. [16]

    Code: Contrasting self-generated description to combat hallucination in large multi-modal models

    Hyun Kim, Junho Kim, Yeon Kim, and Yong Ro. Code: Contrasting self-generated description to combat hallucination in large multi-modal models. InAdvances in Neural Information Pro- cessing Systems 37, pages 133571–133599. Neural Information Processing Systems Foundation, Inc. (NeurIPS), 2024

  17. [17]

    Vlind-bench: Measuring language priors in large vision-language models

    Kang-il Lee, Minbeom Kim, Seunghyun Yoon, Minsung Kim, Dongryeol Lee, Hyukhun Koh, and Kyomin Jung. Vlind-bench: Measuring language priors in large vision-language models. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 4129–4144. Association for Computational Linguistics, 2025

  18. [18]

    Mitigating object hallucinations in large vision-language models through visual contrastive decoding

    Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, and Lidong Bing. Mitigating object hallucinations in large vision-language models through visual contrastive decoding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13872–13882, 2024

  19. [19]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023

  20. [20]

    Contrastive decoding: Open-ended text generation as optimization

    Xiang Lisa Li, Ari Holtzman, Daniel Fried, Percy Liang, Jason Eisner, Tatsunori Hashimoto, Luke Zettlemoyer, and Mike Lewis. Contrastive decoding: Open-ended text generation as optimization. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12286–12312. Association for Computational Li...

  21. [21]

    Evaluating object hallucination in large vision-language models

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 292–305, 2023

  22. [22]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InEuropean Conference on Computer Vision, pages 740–755. Springer, 2014

  23. [23]

    Multi-frequency contrastive decod- ing: Alleviating hallucinations for large vision-language models

    Bingqian Liu, Fu Zhang, Guoqing Chen, and Jingwei Cheng. Multi-frequency contrastive decod- ing: Alleviating hallucinations for large vision-language models. In Christos Christodoulopou- los, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Con- ference on Empirical Methods in Natural Language Processing, pages 28568–2858...

  24. [24]

    Improved baselines with visual instruction tuning, 2024

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning, 2024

  25. [25]

    Springer Nature Switzerland, 2024

    Shi Liu, Kecheng Zheng, and Wei Chen.Paying More Attention to Image: A Training-Free Method for Alleviating Hallucination in LVLMs, pages 125–140. Springer Nature Switzerland, 2024

  26. [26]

    Locating and editing factual associations in gpt, 2022

    Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in gpt, 2022. 11

  27. [27]

    Object hallucination in image captioning

    Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object hallucination in image captioning. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4035–4045, 2018

  28. [28]

    A-okvqa: A benchmark for visual question answering using world knowledge

    Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-okvqa: A benchmark for visual question answering using world knowledge. In European Conference on Computer Vision, pages 146–162. Springer, 2022

  29. [29]

    Aligning large multimodal models with factually augmented rlhf

    Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liangyan Gui, Yu-Xiong Wang, Yiming Yang, Kurt Keutzer, and Trevor Darrell. Aligning large multimodal models with factually augmented rlhf. InFindings of the Association for Computational Linguistics ACL 2024, pages 13088–13110. Association for Computational Linguist...

  30. [30]

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy Lillicrap, Angeliki Lazaridou, Orhan Firat, et al. Gemini: A family of highly capab...

  31. [31]

    Mitigating hallucination in multimodal llms with layer contrastive decoding, 2025

    Bingkui Tong, Jiaer Xia, and Kaiyang Zhou. Mitigating hallucination in multimodal llms with layer contrastive decoding, 2025

  32. [32]

    Eyes wide shut? exploring the visual shortcomings of multimodal llms

    Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9568–9578. IEEE, 2024

  33. [33]

    Llama: Open and efficient foundation language models, 2023

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models, 2023

  34. [34]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

  35. [35]

    Only: One-layer intervention sufficiently mitigates hallucinations in large vision-language models

    Zifu Wan, Ce Zhang, Silong Yong, Martin Q Ma, Simon Stepputtis, Louis-Philippe Morency, Deva Ramanan, Katia Sycara, and Yaqi Xie. Only: One-layer intervention sufficiently mitigates hallucinations in large vision-language models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3225–3234, 2025

  36. [36]

    Amber: An llm-free multi-dimensional benchmark for mllms hallucination evaluation.arXiv preprint arXiv:2311.07397, 2023

    Junyang Wang, Yuhang Wang, Guohai Xu, Jing Zhang, Yukai Gu, Haitao Jia, Jiaqi Wang, Haiyang Xu, Ming Yan, Ji Zhang, et al. Amber: An llm-free multi-dimensional benchmark for mllms hallucination evaluation.arXiv preprint arXiv:2311.07397, 2023

  37. [37]

    Mitigating hallucinations in large vision-language models with instruction contrastive decoding

    Xintong Wang, Jingheng Pan, Liang Ding, and Chris Biemann. Mitigating hallucinations in large vision-language models with instruction contrastive decoding. InFindings of the Association for Computational Linguistics ACL 2024, pages 15840–15853, 2024

  38. [38]

    Don’t miss the forest for the trees: Attentional vision calibration for large vision language models

    Sangmin Woo, Donguk Kim, Jaehyuk Jang, Yubin Choi, and Changick Kim. Don’t miss the forest for the trees: Attentional vision calibration for large vision language models. InFindings of the Association for Computational Linguistics: ACL 2025, pages 1927–1951. Association for Computational Linguistics, 2025

  39. [39]

    mplug-owi2: Revolutionizing multi-modal large language model with modality collaboration

    Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Anwen Hu, Haowei Liu, Qi Qian, Ji Zhang, and Fei Huang. mplug-owi2: Revolutionizing multi-modal large language model with modality collaboration. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13040–13051. IEEE, 2024

  40. [40]

    Lifting the veil on visual information flow in mllms: Unlocking pathways to faster inference

    Hao Yin, Guangzong Si, and Zilei Wang. Lifting the veil on visual information flow in mllms: Unlocking pathways to faster inference. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9382–9391. IEEE, 2025. doi: 10.1109/CVPR52734. 2025.00876. 12

  41. [41]

    Hao Yin, Guangzong Si, and Zilei Wang. The mirage of performance gains: Why contrastive decoding fails to mitigate object hallucinations in MLLMs? InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

  42. [42]

    Woodpecker: hallucination correction for multimodal large language models.Science China Information Sciences, 67(12), 2024

    Shukang Yin, Chaoyou Fu, Sirui Zhao, Tong Xu, Hao Wang, Dianbo Sui, Yunhang Shen, Ke Li, Xing Sun, and Enhong Chen. Woodpecker: hallucination correction for multimodal large language models.Science China Information Sciences, 67(12), 2024

  43. [43]

    Hallucidoctor: Mitigating hallucinatory toxicity in visual instruction data

    Qifan Yu, Juncheng Li, Longhui Wei, Liang Pang, Wentao Ye, Bosheng Qin, Siliang Tang, Qi Tian, and Yueting Zhuang. Hallucidoctor: Mitigating hallucinatory toxicity in visual instruction data. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12944–12953. IEEE, 2024

  44. [44]

    Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback

    Tianyu Yu, Yuan Yao, Haoye Zhang, Taiwen He, Yifeng Han, Ganqu Cui, Jinyi Hu, Zhiyuan Liu, Hai-Tao Zheng, and Maosong Sun. Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13807–13816. IEEE, 2024

  45. [45]

    Less is more: Mitigating multimodal hallucination from an eos decision perspective

    Zihao Yue, Liang Zhang, and Qin Jin. Less is more: Mitigating multimodal hallucination from an eos decision perspective. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11766–11781. Association for Computational Linguistics, 2024

  46. [46]

    Springer Nature Switzerland, 2024

    Jinrui Zhang, Teng Wang, Haigang Zhang, Ping Lu, and Feng Zheng.Reflective Instruction Tuning: Mitigating Hallucinations in Large Vision-Language Models, pages 196–213. Springer Nature Switzerland, 2024

  47. [47]

    Investigating and mitigating the multimodal hallucination snowballing in large vision-language models

    Weihong Zhong, Xiaocheng Feng, Liang Zhao, Qiming Li, Lei Huang, Yuxuan Gu, Weitao Ma, Yuan Xu, and Bing Qin. Investigating and mitigating the multimodal hallucination snowballing in large vision-language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11991–12011. Associatio...