arxiv: 2605.14621 · v1 · submitted 2026-05-14 · 💻 cs.CV · cs.AI· cs.CL

Recognition: no theorem link

Do We Really Need External Tools to Mitigate Hallucinations? SIRA: Shared-Prefix Internal Reconstruction of Attribution

Tian Qin , Junzhe Chen , Yuqing Shi , Tianshu Zhang , Qiang Ju , Lijie Wen

Authors on Pith no claims yet

Pith reviewed 2026-05-15 05:52 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL

keywords hallucinationsvision-language modelscontrastive decodinginternal reconstructionattention maskingmultimodal transformerslarge vision-language models

0 comments

The pith

Masking attention to image tokens after a shared prefix in vision-language transformers reduces hallucinations by contrasting against an internal language-prior reference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large vision-language models often generate text that follows language priors even when visual evidence is weak or ambiguous. Existing contrastive methods create external references by perturbing inputs or running extra passes, but these can introduce off-manifold artifacts and raise compute cost. SIRA instead keeps everything inside one model: image and text tokens first interact through a shared prefix that preserves alignment, history, and early grounding, then a forked branch in later layers masks attention to image positions. The resulting reference stays language-prior dominated yet retains the original decoding context, so token-level contrast can down-weight predictions that do not rely on continued visual access. Experiments on POPE, CHAIR, and AMBER with Qwen2.5-VL and LLaVA-v1.5 show lower hallucination rates, preserved coverage, and reduced overhead compared with two-pass baselines.

Core claim

SIRA constructs a counterfactual reference inside the same LVLM by first allowing image and text tokens to interact through a shared prefix that forms an aligned multimodal state, then forking a branch in later transformer layers where attention to image-token positions is masked. This branch retains the shared context and decoding history but lacks continued fine-grained visual evidence, producing a language-prior-dominated reference that enables token-level contrast without external perturbations or additional forward passes.

What carries the argument

Shared-prefix internal reconstruction: an early multimodal interaction stage followed by a masked-image-attention branch in later layers that isolates language priors while preserving prompt interpretation and positional structure.

If this is right

Hallucination rates drop on POPE, CHAIR, and AMBER for both Qwen2.5-VL and LLaVA-v1.5 while descriptive coverage stays intact.
Overhead stays below that of two-pass contrastive decoding because only one forward pass plus an internal branch is required.
No training, external verifier, or perturbed input is needed, so the method applies directly to open-weight LVLMs with white-box access.
Suppression targets tokens whose strength persists even without late visual access, favoring outputs that depend on the full visual pathway.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The staged flow assumption suggests similar internal forking could be tested in other transformer-based multimodal systems where early and late layers handle distinct information types.
If the shared-prefix stage successfully locks in prompt semantics, the same pattern might help control factual drift in text-only models facing weak evidence.
Deployment on resource-limited devices could benefit because the internal branch replaces separate external passes.

Load-bearing premise

That masking attention to image tokens only in later layers produces a clean language-prior reference that preserves the original prompt interpretation and decoding history without introducing new artifacts.

What would settle it

Run SIRA on a benchmark where hallucinations arise mainly from early-layer visual mis-grounding rather than late-stage language dominance; the method would then show no reduction or an increase in error rates.

Figures

Figures reproduced from arXiv: 2605.14621 by Junzhe Chen, Lijie Wen, Qiang Ju, Tian Qin, Tianshu Zhang, Yuqing Shi.

**Figure 2.** Figure 2: SIRA overview. B in the figure denotes boundary b=L−K in the text. Layer numbers are illustrative; L is backbone-dependent. same model, without launching a second full forward pass over a perturbed image or relying on an external reference model. We describe the branch design (Section 3.1), the masked counterfactual construction (Section 3.2), and the contrastive decoding rule (Section 3.3). 3.1 Internal B… view at source ↗

**Figure 3.** Figure 3: Effect of split boundary [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Contrastive reference analysis: (a) layer-wise drift; (b) next-token KL; (c) stage-wise drift; [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Case study and edge-case analysis on the AMBER benchmark using Qwen2.5-VL. Bars [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Additional case studies on the AMBER benchmark. Eight AMBER cases are shown; for [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

read the original abstract

Large vision-language models (LVLMs) often hallucinate when language priors dominate weak or ambiguous visual evidence. Existing contrastive decoding methods mitigate this problem by comparing predictions from the original image with those from externally perturbed visual inputs, but such references can introduce off-manifold artifacts and require costly extra forward passes. We propose SIRA, a training-free internal contrastive decoding framework that constructs a counterfactual reference inside the same LVLM by exploiting the staged information flow of multimodal transformers. Instead of removing visual information from the input, SIRA first lets image and text tokens interact through a shared prefix, forming an aligned multimodal state that preserves prompt interpretation, decoding history, positional structure, and early visual grounding. It then forks a counterfactual branch in later transformer layers, where attention to image-token positions is masked. This branch retains the shared multimodal context but lacks continued access to fine-grained visual evidence, yielding a language-prior-dominated internal reference for token-level contrast. During decoding, SIRA suppresses tokens that remain strong without late visual access and favors predictions whose advantage depends on the full visual pathway. Experiments on POPE, CHAIR, and AMBER with Qwen2.5-VL and LLaVA-v1.5 show that SIRA consistently reduces hallucinations while preserving descriptive coverage and incurring lower overhead than two-pass contrastive decoding. SIRA requires no training, external verifier, or perturbed input, and applies to open-weight LVLMs with white-box inference access.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SIRA's shared-prefix internal fork for contrastive decoding is a fresh angle on avoiding external perturbations, but the masking step may not cleanly isolate language priors due to residual visual carryover.

read the letter

The one thing to know about this paper is that SIRA builds an internal counterfactual reference for contrastive decoding by letting image and text tokens interact in a shared prefix through early layers, then forking a branch in later layers where attention to image tokens gets masked. This keeps everything inside one forward pass on the same model and avoids the off-manifold issues that come with external image perturbations or extra passes in prior work. The abstract positions it as training-free and directly applicable to open-weight LVLMs like Qwen2.5-VL and LLaVA-v1.5, with reported reductions in hallucinations on POPE, CHAIR, and AMBER while holding descriptive coverage steady and cutting overhead versus two-pass baselines. That internal construction is the genuinely new piece relative to the contrastive decoding literature it cites. The practical upside is clear if the efficiency and coverage claims hold: it targets exactly the setting where people want to deploy these models without adding verifiers or retraining. The soft spot sits in the central mechanism. Masking attention scores on image positions in late layers does not automatically remove visual information already fused into text-token hidden states through residual connections and value projections from the shared prefix. The forked reference could therefore remain contaminated rather than purely language-prior dominated, which would make the contrastive effect partly an artifact of the mask itself. The abstract gives no representation probes or targeted ablations to address this, so the soundness of the separation rests on untested assumptions about information flow. This paper is for researchers focused on lightweight hallucination fixes in deployed vision-language systems. The idea is coherent on its own terms and the internal approach deserves a serious referee to check the masking assumption against actual hidden-state behavior and to see the full quantitative results with error analysis.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes SIRA, a training-free internal contrastive decoding framework for mitigating hallucinations in large vision-language models. It constructs a counterfactual reference by first allowing full image-text interaction through a shared prefix (preserving prompt interpretation, decoding history, and early visual grounding) and then forking a branch in later transformer layers where attention to image-token positions is masked, yielding a language-prior-dominated internal reference for token-level contrast during decoding. Experiments on POPE, CHAIR, and AMBER with Qwen2.5-VL and LLaVA-v1.5 are claimed to show consistent hallucination reduction while preserving descriptive coverage and lower overhead than two-pass contrastive decoding.

Significance. If the internal masking mechanism produces a valid language-prior reference without residual visual artifacts or off-distribution effects, SIRA would represent a meaningful efficiency gain over external-perturbation methods by eliminating extra forward passes, perturbed inputs, and external verifiers while remaining applicable to open-weight LVLMs with white-box access.

major comments (1)

[Method (forked-branch construction)] The central mechanism (shared-prefix fusion followed by attention masking on image tokens in later layers) assumes that the forked branch yields logits dominated purely by language priors. However, because the shared prefix permits full cross-modal interaction, later-layer hidden states for text tokens carry fused visual features via residual connections and value projections rather than solely via attention; masking attention scores alone therefore cannot excise this embedded information. This raises the risk that any contrastive benefit arises from the masking artifact itself rather than from removal of visual evidence. The manuscript should include hidden-state analysis, comparison to a no-image baseline, or ablation on residual pathways to substantiate the assumption.

minor comments (2)

[Abstract] The abstract states that experiments show consistent gains but provides no quantitative metrics, ablation details, or error analysis; including at least headline numbers (e.g., POPE accuracy deltas) would strengthen the summary.
[Decoding procedure] Notation for the contrastive decoding step (e.g., how the internal reference logits are combined with the original branch) should be formalized with an equation for reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comments on our work. We address the major comment regarding the forked-branch construction in SIRA below, and we will revise the manuscript accordingly to include additional analyses.

read point-by-point responses

Referee: [Method (forked-branch construction)] The central mechanism (shared-prefix fusion followed by attention masking on image tokens in later layers) assumes that the forked branch yields logits dominated purely by language priors. However, because the shared prefix permits full cross-modal interaction, later-layer hidden states for text tokens carry fused visual features via residual connections and value projections rather than solely via attention; masking attention scores alone therefore cannot excise this embedded information. This raises the risk that any contrastive benefit arises from the masking artifact itself rather than from removal of visual evidence. The manuscript should include hidden-state analysis, comparison to a no-image baseline, or ablation on residual pathways to substantiate the assumption.

Authors: We acknowledge that residual connections from the shared prefix do carry some fused multimodal information into the later layers. However, the attention masking in the forked branch is designed to halt further visual token integration, allowing the branch to rely more on language priors for subsequent predictions. This creates a meaningful contrast with the full visual pathway. To substantiate this, we will add hidden-state similarity analysis between the branches, a comparison to a no-image input baseline, and an ablation removing residual pathways where possible. These additions will clarify that the contrastive benefit stems from differential visual access rather than artifacts. revision: yes

Circularity Check

0 steps flagged

SIRA's shared-prefix masking construction is an independent algorithmic proposal with no self-referential reductions

full rationale

The paper defines SIRA directly as a training-free procedure: run a shared prefix through early layers to form aligned multimodal states, then fork a later-layer branch that masks attention to image-token positions while retaining the shared context. This yields an internal language-prior reference for contrastive decoding. No equations are presented whose outputs are algebraically identical to their inputs by construction; no parameters are fitted to data and then relabeled as predictions; no load-bearing uniqueness theorem or ansatz is imported via self-citation. The central claim rests on the architectural assumption that attention masking after early fusion produces a usable counterfactual, which is tested empirically on POPE, CHAIR, and AMBER rather than derived tautologically. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests on the domain assumption that multimodal transformers exhibit staged information flow with early alignment and later refinement; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Multimodal transformers process information in stages where early layers align image and text tokens while later layers refine predictions.
Invoked to justify that a shared prefix preserves context and that late masking isolates language priors without breaking decoding history.

pith-pipeline@v0.9.0 · 5589 in / 1134 out tokens · 37064 ms · 2026-05-15T05:52:51.707966+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 2 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Mitigating object hallucinations in large vision-language models with assembly of global and local attention

Wenbin An, Feng Tian, Sicong Leng, Jiahao Nie, Haonan Lin, Qianying Wang, Ping Chen, Xiaoqin Zhang, and Shijian Lu. Mitigating object hallucinations in large vision-language models with assembly of global and local attention. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 29915–29926. IEEE, 2025

work page 2025
[3]

Qwen2.5-vl technical report, 2025

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report, 2025

work page 2025
[4]

Mask what matters: Mitigating object hallucinations in multimodal large language models with object-aligned visual contrastive decoding

Boqi Chen, Xudong Liu, and Jianing Qiu. Mask what matters: Mitigating object hallucinations in multimodal large language models with object-aligned visual contrastive decoding. In Selene Baez Santamaria, Sai Ashish Somayajula, and Atsuki Yamaguchi, editors,Proceedings of the 19th Conference of the European Chapter of the Association for Computational Ling...

work page 2026
[5]

Ict: Image-object cross-level trusted intervention for mitigating object hallucination in large vision-language models

Junzhe Chen, Tianshu Zhang, Shiyu Huang, Yuwei Niu, Linfeng Zhang, Lijie Wen, and Xuming Hu. Ict: Image-object cross-level trusted intervention for mitigating object hallucination in large vision-language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 4209–4221, 2025

work page 2025
[6]

Glass, and Pengcheng He

Yung-Sung Chuang, Yujia Xie, Hongyin Luo, Yoon Kim, James R. Glass, and Pengcheng He. Dola: Decoding by contrasting layers improves factuality in large language models. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[7]

MaskCD: Mitigating LVLM hallucinations by image head masked contrastive decoding

Jingyuan Deng and Yujiu Yang. MaskCD: Mitigating LVLM hallucinations by image head masked contrastive decoding. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Findings of the Association for Computational Linguistics: EMNLP 2025, pages 18854–18866, Suzhou, China, November 2025. Association for Computa- tional Li...

work page doi:10.18653/v1/2025.findings-emnlp.1025 2025
[8]

Multi-modal hallucination control by visual information grounding

Alessandro Favero, Luca Zancato, Matthew Trager, Siddharth Choudhary, Pramuditha Perera, Alessandro Achille, Ashwin Swaminathan, and Stefano Soatto. Multi-modal hallucination control by visual information grounding. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14303–14312. IEEE, 2024

work page 2024
[9]

Detecting and preventing hallucinations in large vision language models.Proceedings of the AAAI Conference on Artificial Intelligence, 38(16): 18135–18143, 2024

Anisha Gunjal, Jihan Yin, and Erhan Bas. Detecting and preventing hallucinations in large vision language models.Proceedings of the AAAI Conference on Artificial Intelligence, 38(16): 18135–18143, 2024

work page 2024
[10]

A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Transac- tions on Information Systems, 43(2):1–55, 2025

Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qiang- long Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Transac- tions on Information Systems, 43(2):1–55, 2025

work page 2025
[11]

Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation

Qidong Huang, Xiaoyi Dong, Pan Zhang, Bin Wang, Conghui He, Jiaqi Wang, Dahua Lin, Weiming Zhang, and Nenghai Yu. Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13418–13427, 2024

work page 2024
[12]

Gqa: A new dataset for real-world visual reasoning and compositional question answering

Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6700–6709, 2019. 10

work page 2019
[13]

Self-introspective decoding: Alleviating hallucinations for large vision-language models

Fushuo Huo, Wenchao Xu, Zhong Zhang, Haozhao Wang, Zhicheng Chen, and Peilin Zhao. Self-introspective decoding: Alleviating hallucinations for large vision-language models. In The Thirteenth International Conference on Learning Representations, 2025

work page 2025
[14]

Hallucination augmented contrastive learning for multimodal large language model

Chaoya Jiang, Haiyang Xu, Mengfan Dong, Jiaxing Chen, Wei Ye, Ming Yan, Qinghao Ye, Ji Zhang, Fei Huang, and Shikun Zhang. Hallucination augmented contrastive learning for multimodal large language model. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 27026–27036. IEEE, 2024

work page 2024
[15]

What’s in the image? a deep-dive into the vision of vision language models

Omri Kaduri, Shai Bagon, and Tali Dekel. What’s in the image? a deep-dive into the vision of vision language models. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14549–14558. IEEE, 2025. doi: 10.1109/CVPR52734.2025.01356

work page doi:10.1109/cvpr52734.2025.01356 2025
[16]

Code: Contrasting self-generated description to combat hallucination in large multi-modal models

Hyun Kim, Junho Kim, Yeon Kim, and Yong Ro. Code: Contrasting self-generated description to combat hallucination in large multi-modal models. InAdvances in Neural Information Pro- cessing Systems 37, pages 133571–133599. Neural Information Processing Systems Foundation, Inc. (NeurIPS), 2024

work page 2024
[17]

Vlind-bench: Measuring language priors in large vision-language models

Kang-il Lee, Minbeom Kim, Seunghyun Yoon, Minsung Kim, Dongryeol Lee, Hyukhun Koh, and Kyomin Jung. Vlind-bench: Measuring language priors in large vision-language models. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 4129–4144. Association for Computational Linguistics, 2025

work page 2025
[18]

Mitigating object hallucinations in large vision-language models through visual contrastive decoding

Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, and Lidong Bing. Mitigating object hallucinations in large vision-language models through visual contrastive decoding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13872–13882, 2024

work page 2024
[19]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023

work page 2023
[20]

Contrastive decoding: Open-ended text generation as optimization

Xiang Lisa Li, Ari Holtzman, Daniel Fried, Percy Liang, Jason Eisner, Tatsunori Hashimoto, Luke Zettlemoyer, and Mike Lewis. Contrastive decoding: Open-ended text generation as optimization. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12286–12312. Association for Computational Li...

work page 2023
[21]

Evaluating object hallucination in large vision-language models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 292–305, 2023

work page 2023
[22]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InEuropean Conference on Computer Vision, pages 740–755. Springer, 2014

work page 2014
[23]

Multi-frequency contrastive decod- ing: Alleviating hallucinations for large vision-language models

Bingqian Liu, Fu Zhang, Guoqing Chen, and Jingwei Cheng. Multi-frequency contrastive decod- ing: Alleviating hallucinations for large vision-language models. In Christos Christodoulopou- los, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Con- ference on Empirical Methods in Natural Language Processing, pages 28568–2858...

work page doi:10.18653/v1/2025.emnlp-main.1452 2025
[24]

Improved baselines with visual instruction tuning, 2024

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning, 2024

work page 2024
[25]

Springer Nature Switzerland, 2024

Shi Liu, Kecheng Zheng, and Wei Chen.Paying More Attention to Image: A Training-Free Method for Alleviating Hallucination in LVLMs, pages 125–140. Springer Nature Switzerland, 2024

work page 2024
[26]

Locating and editing factual associations in gpt, 2022

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in gpt, 2022. 11

work page 2022
[27]

Object hallucination in image captioning

Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object hallucination in image captioning. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4035–4045, 2018

work page 2018
[28]

A-okvqa: A benchmark for visual question answering using world knowledge

Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-okvqa: A benchmark for visual question answering using world knowledge. In European Conference on Computer Vision, pages 146–162. Springer, 2022

work page 2022
[29]

Aligning large multimodal models with factually augmented rlhf

Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liangyan Gui, Yu-Xiong Wang, Yiming Yang, Kurt Keutzer, and Trevor Darrell. Aligning large multimodal models with factually augmented rlhf. InFindings of the Association for Computational Linguistics ACL 2024, pages 13088–13110. Association for Computational Linguist...

work page 2024
[30]

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy Lillicrap, Angeliki Lazaridou, Orhan Firat, et al. Gemini: A family of highly capab...

work page 2025
[31]

Mitigating hallucination in multimodal llms with layer contrastive decoding, 2025

Bingkui Tong, Jiaer Xia, and Kaiyang Zhou. Mitigating hallucination in multimodal llms with layer contrastive decoding, 2025

work page 2025
[32]

Eyes wide shut? exploring the visual shortcomings of multimodal llms

Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9568–9578. IEEE, 2024

work page 2024
[33]

Llama: Open and efficient foundation language models, 2023

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models, 2023

work page 2023
[34]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[35]

Only: One-layer intervention sufficiently mitigates hallucinations in large vision-language models

Zifu Wan, Ce Zhang, Silong Yong, Martin Q Ma, Simon Stepputtis, Louis-Philippe Morency, Deva Ramanan, Katia Sycara, and Yaqi Xie. Only: One-layer intervention sufficiently mitigates hallucinations in large vision-language models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3225–3234, 2025

work page 2025
[36]

Amber: An llm-free multi-dimensional benchmark for mllms hallucination evaluation.arXiv preprint arXiv:2311.07397, 2023

Junyang Wang, Yuhang Wang, Guohai Xu, Jing Zhang, Yukai Gu, Haitao Jia, Jiaqi Wang, Haiyang Xu, Ming Yan, Ji Zhang, et al. Amber: An llm-free multi-dimensional benchmark for mllms hallucination evaluation.arXiv preprint arXiv:2311.07397, 2023

work page arXiv 2023
[37]

Mitigating hallucinations in large vision-language models with instruction contrastive decoding

Xintong Wang, Jingheng Pan, Liang Ding, and Chris Biemann. Mitigating hallucinations in large vision-language models with instruction contrastive decoding. InFindings of the Association for Computational Linguistics ACL 2024, pages 15840–15853, 2024

work page 2024
[38]

Don’t miss the forest for the trees: Attentional vision calibration for large vision language models

Sangmin Woo, Donguk Kim, Jaehyuk Jang, Yubin Choi, and Changick Kim. Don’t miss the forest for the trees: Attentional vision calibration for large vision language models. InFindings of the Association for Computational Linguistics: ACL 2025, pages 1927–1951. Association for Computational Linguistics, 2025

work page 2025
[39]

mplug-owi2: Revolutionizing multi-modal large language model with modality collaboration

Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Anwen Hu, Haowei Liu, Qi Qian, Ji Zhang, and Fei Huang. mplug-owi2: Revolutionizing multi-modal large language model with modality collaboration. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13040–13051. IEEE, 2024

work page 2024
[40]

Lifting the veil on visual information flow in mllms: Unlocking pathways to faster inference

Hao Yin, Guangzong Si, and Zilei Wang. Lifting the veil on visual information flow in mllms: Unlocking pathways to faster inference. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9382–9391. IEEE, 2025. doi: 10.1109/CVPR52734. 2025.00876. 12

work page doi:10.1109/cvpr52734 2025
[41]

Hao Yin, Guangzong Si, and Zilei Wang. The mirage of performance gains: Why contrastive decoding fails to mitigate object hallucinations in MLLMs? InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

work page 2026
[42]

Woodpecker: hallucination correction for multimodal large language models.Science China Information Sciences, 67(12), 2024

Shukang Yin, Chaoyou Fu, Sirui Zhao, Tong Xu, Hao Wang, Dianbo Sui, Yunhang Shen, Ke Li, Xing Sun, and Enhong Chen. Woodpecker: hallucination correction for multimodal large language models.Science China Information Sciences, 67(12), 2024

work page 2024
[43]

Hallucidoctor: Mitigating hallucinatory toxicity in visual instruction data

Qifan Yu, Juncheng Li, Longhui Wei, Liang Pang, Wentao Ye, Bosheng Qin, Siliang Tang, Qi Tian, and Yueting Zhuang. Hallucidoctor: Mitigating hallucinatory toxicity in visual instruction data. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12944–12953. IEEE, 2024

work page 2024
[44]

Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback

Tianyu Yu, Yuan Yao, Haoye Zhang, Taiwen He, Yifeng Han, Ganqu Cui, Jinyi Hu, Zhiyuan Liu, Hai-Tao Zheng, and Maosong Sun. Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13807–13816. IEEE, 2024

work page 2024
[45]

Less is more: Mitigating multimodal hallucination from an eos decision perspective

Zihao Yue, Liang Zhang, and Qin Jin. Less is more: Mitigating multimodal hallucination from an eos decision perspective. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11766–11781. Association for Computational Linguistics, 2024

work page 2024
[46]

Springer Nature Switzerland, 2024

Jinrui Zhang, Teng Wang, Haigang Zhang, Ping Lu, and Feng Zheng.Reflective Instruction Tuning: Mitigating Hallucinations in Large Vision-Language Models, pages 196–213. Springer Nature Switzerland, 2024

work page 2024
[47]

Investigating and mitigating the multimodal hallucination snowballing in large vision-language models

Weihong Zhong, Xiaocheng Feng, Liang Zhao, Qiming Li, Lei Huang, Yuxuan Gu, Weitao Ma, Yuan Xu, and Bing Qin. Investigating and mitigating the multimodal hallucination snowballing in large vision-language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11991–12011. Associatio...

work page 2024