arxiv: 2604.25642 · v1 · submitted 2026-04-28 · 💻 cs.CV · cs.AI

Recognition: unknown

Prefill-Time Intervention for Mitigating Hallucination in Large Vision-Language Models

Chenghao Sun, Chengsheng Zhang, Wei Li, Xinmei Tian, Xinyan Jiang

Pith reviewed 2026-05-07 16:45 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords hallucinationsvision-language modelsprefill interventionKV cachesteering vectorsmultimodal understandingerror mitigation

0 comments

The pith

Prefill-Time Intervention corrects hallucination-prone KV cache entries in vision-language models before decoding errors accumulate.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large vision-language models often generate responses that contradict the input image, a problem known as hallucination. Previous attempts to steer the model using vectors during text generation actually make remaining hallucinations worse because mistakes build up step by step. This paper introduces Prefill-Time Intervention, which applies a one-time adjustment to the model's initial memory cache right after processing the image and prompt. The adjustment uses separate directions for image and text parts, focusing keys on relevant objects and cleaning values of irrelevant noise. If this works, it provides a foundation for more reliable multimodal outputs that does not depend on changing how the model generates text afterward.

Core claim

The paper claims that intervening once during the prefill stage by deriving modality-specific directions and decoupling the steering of keys toward visually-grounded objects and values to filter background noise enhances the initial Key-Value cache, thereby mitigating hallucinations at their source prior to any autoregressive generation.

What carries the argument

Prefill-Time Intervention (PTI), which performs a single modality-aware correction on the initial KV cache to steer keys toward grounded objects and values toward noise reduction.

If this is right

The intervention leads to better hallucination mitigation than methods applied only during decoding.
PTI maintains effectiveness across various LVLMs and different decoding approaches.
Combining PTI with decoding-stage techniques yields further improvements in performance.
The method avoids amplifying residual hallucinations by addressing issues early.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Early intervention on internal states could apply to reducing inconsistencies in other generative AI systems beyond vision-language models.
The decoupled key-value approach suggests a general way to separate content focus from noise suppression in attention mechanisms.
Testing PTI in combination with other grounding techniques might reveal ways to strengthen visual-text alignment further.

Load-bearing premise

That adjusting the KV cache only once at the beginning reliably prevents hallucinations from developing later without causing other inconsistencies in the model's responses.

What would settle it

Measuring hallucination rates on image-captioning benchmarks before and after applying PTI; if rates do not decrease or if new errors appear, the claim would be falsified.

Figures

Figures reproduced from arXiv: 2604.25642 by Chenghao Sun, Chengsheng Zhang, Wei Li, Xinmei Tian, Xinyan Jiang.

**Figure 1.** Figure 1: Comparative analysis. (a): Decoding-Time Intervention methods continuously intervene in the hidden states of the prefill and generated token. (b): Our method applies modal-specific interventions to the KV cache only once in the prefill phase. main prone to hallucinations [3, 36, 51, 55, 56], generating factually inconsistent outputs that contradict the visual input. Common manifestations include imaginar… view at source ↗

**Figure 2.** Figure 2: Quantitative analysis of LLAVA-1.5 on CHAIR Bench view at source ↗

**Figure 3.** Figure 3: Pipeline overview of our PTI. PTI consists of two stages. view at source ↗

**Figure 4.** Figure 4: Performance comparison on MMHal-Bench, with results disaggregated by its eight question categories: attributes (ATTR), view at source ↗

**Figure 5.** Figure 5: Internal interpretability analysis of visual cache intervention on LLAVA-1.5 across 300 randomly selected images from view at source ↗

**Figure 6.** Figure 6: Ablation matrices for multi-modal KV cache intervention strength on LLAVA-1.5 with greedy decoding strategy. Brighter colors view at source ↗

**Figure 7.** Figure 7: Ablation matrices for multi-modal KV cache intervention strength on Qwen-VL-Chat with greedy decoding strategy. view at source ↗

**Figure 8.** Figure 8: Ablation matrices for multi-modal KV cache intervention strength on DeepSeek-VL-Chat with greedy decoding strategy. view at source ↗

**Figure 9.** Figure 9: Visual analysis of cross-modal attention maps on LLAVA-1.5. For each sample, the hallucinated content is highlighted in view at source ↗

**Figure 10.** Figure 10: Qualitative examples of LLAVA-1.5. Hallucinated contents are marked in view at source ↗

**Figure 11.** Figure 11: Qualitative examples of Qwen-VL-Chat. Hallucinated contents are marked in view at source ↗

**Figure 12.** Figure 12: Qualitative examples of DeepSeek-VL-Chat. Hallucinated contents are marked in view at source ↗

read the original abstract

Large Vision-Language Models (LVLMs) have achieved remarkable progress in visual-textual understanding, yet their reliability is critically undermined by hallucinations, i.e., the generation of factually incorrect or inconsistent responses. While recent studies using steering vectors demonstrated promise in reducing hallucinations, a notable challenge remains: they inadvertently amplify the severity of residual hallucinations. We attribute this to their exclusive focus on the decoding stage, where errors accumulate autoregressively and progressively worsen subsequent hallucinatory outputs. To address this, we propose Prefill-Time Intervention (PTI), a novel steering paradigm that intervenes only once during the prefill stage, enhancing the initial Key-Value (KV) cache before error accumulation occurs. Specifically, PTI is modality-aware, deriving distinct directions for visual and textual representations. This intervention is decoupled to steer keys toward visually-grounded objects and values to filter background noise, correcting hallucination-prone representations at their source. Extensive experiments demonstrate PTI's significant performance in mitigating hallucinations and its generalizability across diverse decoding strategies, LVLMs, and benchmarks. Moreover, PTI is orthogonal to existing decoding-stage methods, enabling plug-and-play integration and further boosting performance. Code is available at: https://github.com/huaiyi66/PTI.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Prefill-Time Intervention (PTI) as a novel steering method for large vision-language models (LVLMs) to reduce hallucinations. Unlike prior decoding-stage steering vectors that can amplify residual errors, PTI applies a single modality-aware intervention to the initial KV cache during the prefill stage. It derives separate directions for visual and textual tokens and decouples the edit so that keys are steered toward visually-grounded objects while values filter background noise, thereby correcting hallucination-prone representations before autoregressive generation begins. The method is claimed to be orthogonal to existing decoding techniques, enabling plug-and-play combination, and is supported by extensive experiments showing gains across models, decoding strategies, and benchmarks.

Significance. If the central empirical claims hold, PTI offers a practically useful shift in the timing of hallucination mitigation for LVLMs, addressing a documented weakness of post-prefill interventions. The public code release is a clear strength that supports reproducibility. The orthogonality result, if robust, would allow incremental gains on top of existing methods without retraining.

major comments (2)

[§3] §3 (Method): The core mechanistic claim—that decoupled key/value steering at prefill corrects representations 'at their source' by directing keys to visually-grounded objects and values to background filtering—rests on an unverified functional-role assumption. No attention-map analysis, object-grounding probe, or ablation that isolates key-only versus value-only interventions is reported to confirm these specific effects. If the assumed roles do not hold, the single prefill edit may merely shift the initial cache without preventing later reintroduction of inconsistencies.
[§4] §4 (Experiments): The abstract states 'significant performance' and 'generalizability across diverse decoding strategies, LVLMs, and benchmarks,' yet the manuscript provides no details on exact baseline implementations, statistical controls (e.g., multiple-comparison correction), or variance across random seeds. Without these, the strength of evidence for the central claim that PTI reliably outperforms and combines with decoding-stage methods remains moderate.

minor comments (2)

Notation for the modality-aware direction vectors and the decoupled steering operators should be introduced with explicit equations rather than prose descriptions to improve clarity.
Figure captions and axis labels in the experimental results should explicitly state the metrics used (e.g., CHAIR, POPE) and whether error bars represent standard deviation or standard error.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and positive review, which highlights both the potential of PTI and areas where the manuscript can be strengthened. We address each major comment below and will revise the paper accordingly to improve clarity and rigor.

read point-by-point responses

Referee: [§3] §3 (Method): The core mechanistic claim—that decoupled key/value steering at prefill corrects representations 'at their source' by directing keys to visually-grounded objects and values to background filtering—rests on an unverified functional-role assumption. No attention-map analysis, object-grounding probe, or ablation that isolates key-only versus value-only interventions is reported to confirm these specific effects. If the assumed roles do not hold, the single prefill edit may merely shift the initial cache without preventing later reintroduction of inconsistencies.

Authors: We appreciate the referee's point on the need for direct verification of the mechanistic assumptions. The decoupled key/value design is motivated by the standard roles in attention mechanisms (keys for content matching and grounding, values for information aggregation), and the empirical gains across benchmarks provide supporting evidence for the overall approach. However, we acknowledge that the original submission lacks explicit ablations or attention analyses isolating these effects. In the revision, we will add (i) key-only vs. value-only vs. combined intervention ablations and (ii) qualitative attention-map comparisons before and after PTI to better substantiate or refine the functional-role interpretation. revision: yes
Referee: [§4] §4 (Experiments): The abstract states 'significant performance' and 'generalizability across diverse decoding strategies, LVLMs, and benchmarks,' yet the manuscript provides no details on exact baseline implementations, statistical controls (e.g., multiple-comparison correction), or variance across random seeds. Without these, the strength of evidence for the central claim that PTI reliably outperforms and combines with decoding-stage methods remains moderate.

Authors: We agree that greater experimental transparency is warranted to support the claims of performance and generalizability. The revised manuscript will include: detailed descriptions of baseline reproductions (including any hyperparameter choices for prior steering methods), results with means and standard deviations computed over multiple random seeds, and an explicit discussion of statistical practices (including the rationale for not applying multiple-comparison corrections in the primary baseline comparisons). These additions will provide a more robust foundation for the reported improvements and orthogonality findings. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical intervention with independent experimental validation

full rationale

The paper presents PTI as an empirical steering method applied once at prefill to the KV cache, with modality-aware and decoupled key/value directions. No derivation chain, equations, or first-principles results are shown that reduce the claimed performance gains to a fitted parameter, self-defined quantity, or self-citation whose content is itself unverified. Experiments across models, benchmarks, and decoding strategies are reported as external validation. The decoupling of key/value roles is an ansatz justified by the observed outcomes rather than by tautological construction. This matches the default case of a self-contained empirical contribution with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on the standard transformer KV-cache mechanics and the empirical observation that decoding-stage steering can amplify residuals; no new mathematical axioms or invented physical entities are introduced.

free parameters (1)

steering directions for visual and textual representations
Directions are derived from data during the method; exact fitting procedure and any hyperparameters are not specified in the abstract.

axioms (1)

domain assumption Errors accumulate autoregressively during decoding and progressively worsen hallucinatory outputs
This attribution for why prior steering methods fail is stated directly in the abstract as the motivation for moving the intervention to prefill time.

pith-pipeline@v0.9.0 · 5528 in / 1398 out tokens · 48383 ms · 2026-05-07T16:45:36.428530+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

61 extracted references · 25 canonical work pages · 9 internal anchors

[1]

Principal component analysis.Wiley interdisciplinary reviews: computational statistics, 2(4):433–459, 2010

Herv ´e Abdi and Lynne J Williams. Principal component analysis.Wiley interdisciplinary reviews: computational statistics, 2(4):433–459, 2010. 4

2010
[2]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review arXiv
[3]

Mitigating object hallucinations in large vision- language models with assembly of global and local attention

Wenbin An, Feng Tian, Sicong Leng, Jiahao Nie, Haonan Lin, QianYing Wang, Ping Chen, Xiaoqin Zhang, and Shi- jian Lu. Mitigating object hallucinations in large vision- language models with assembly of global and local attention. InProceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 29915–29926, 2025. 1

2025
[4]

Qwen-vl: A versatile vision-language model for un- derstanding, localization, text reading, and beyond, 2023

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for un- derstanding, localization, text reading, and beyond, 2023. 1, 2, 3, 5

2023
[5]

Kv cache steering for inducing reasoning in small language models.arXiv e-prints, pages arXiv–2507, 2025

Max Belitsky, Dawid J Kopiczko, Michael Dorkenwald, M Jehanzeb Mirza, Cees GM Snoek, and Yuki M Asano. Kv cache steering for inducing reasoning in small language models.arXiv e-prints, pages arXiv–2507, 2025. 2, 3

2025
[6]

Llava steering: Vi- sual instruction tuning with 500x fewer parameters through modality linear representation-steering

Jinhe Bi, Yujun Wang, Haokun Chen, Xun Xiao, Artur Hecker, V olker Tresp, and Yunpu Ma. Llava steering: Vi- sual instruction tuning with 500x fewer parameters through modality linear representation-steering. InProceedings of the 63rd Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers), pages 15230– 15250, 2025. 2

2025
[7]

Ict: Image-object cross-level trusted intervention for mitigating object halluci- nation in large vision-language models

Junzhe Chen, Tianshu Zhang, Shiyu Huang, Yuwei Niu, Lin- feng Zhang, Lijie Wen, and Xuming Hu. Ict: Image-object cross-level trusted intervention for mitigating object halluci- nation in large vision-language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 4209–4221, 2025. 1, 2, 3

2025
[8]

Driving with llms: Fusing object-level vec- tor modality for explainable autonomous driving

Long Chen, Oleg Sinavski, Jan H ¨unermann, Alice Karnsund, Andrew James Willmott, Danny Birch, Daniel Maund, and Jamie Shotton. Driving with llms: Fusing object-level vec- tor modality for explainable autonomous driving. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 14093–14100. IEEE, 2024. 1

2024
[9]

Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.See https://vicuna

Wei-Lin Chiang, Zhuohan Li, Ziqing Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.See https://vicuna. lmsys. org (accessed 14 April 2023), 2(3):6, 2023. 1

2023
[10]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 3, 4

work page internal anchor Pith review arXiv 2010
[11]

Not all heads matter: A head-level KV cache compression method with integrated retrieval and reasoning

Yu Fu, Zefan Cai, Abedelkadir Asi, Wayne Xiong, Yue Dong, and Wen Xiao. Not all heads matter: A head-level KV cache compression method with integrated retrieval and reasoning. InThe Thirteenth International Conference on Learning Representations, 2025. 2

2025
[12]

Textual steering vectors can improve visual understanding in multimodal large language models.arXiv preprint arXiv:2505.14071, 2025

Woody Haosheng Gan, Deqing Fu, Julian Asilis, Ollie Liu, Dani Yogatama, Vatsal Sharan, Robin Jia, and Willie Neiswanger. Textual steering vectors can improve visual understanding in multimodal large language models.arXiv preprint arXiv:2505.14071, 2025. 4

work page arXiv 2025
[13]

Opera: Alleviating hallucination in multi- modal large language models via over-trust penalty and retrospection-allocation

Qidong Huang, Xiaoyi Dong, Pan Zhang, Bin Wang, Con- ghui He, Jiaqi Wang, Dahua Lin, Weiming Zhang, and Nenghai Yu. Opera: Alleviating hallucination in multi- modal large language models via over-trust penalty and retrospection-allocation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13418–13427, 2024. 5, 6

2024
[14]

How good are low-bit quantized llama3 models? an empirical study.CoRR, 2024

Wei Huang, Xudong Ma, Haotong Qin, Xingyu Zheng, Chengtao Lv, Hong Chen, Jie Luo, Xiaojuan Qi, Xianglong Liu, and Michele Magno. How good are low-bit quantized llama3 models? an empirical study.CoRR, 2024. 1

2024
[15]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 3

work page internal anchor Pith review arXiv 2024
[16]

Segment any- thing

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProceedings of the IEEE/CVF international confer- ence on computer vision, pages 4015–4026, 2023. 3

2023
[17]

Mitigating object hal- lucinations in large vision-language models through visual contrastive decoding

Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, and Lidong Bing. Mitigating object hal- lucinations in large vision-language models through visual contrastive decoding. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 13872–13882, 2024. 3, 5, 6, 13, 14

2024
[18]

Otter: A multi-modal model with in-context instruction tuning.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Fanyi Pu, Joshua Adrian Cahyono, Jingkang Yang, Chunyuan Li, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 1

2025
[19]

Inference-time intervention: Elicit- ing truthful answers from a language model.Advances in Neural Information Processing Systems, 36:41451–41530,

Kenneth Li, Oam Patel, Fernanda Vi ´egas, Hanspeter Pfister, and Martin Wattenberg. Inference-time intervention: Elicit- ing truthful answers from a language model.Advances in Neural Information Processing Systems, 36:41451–41530,
[20]

Cai: Caption-sensitive attention in- tervention for mitigating object hallucination in large vision- language models.arXiv preprint arXiv:2506.23590, 2025

Qiming Li, Zekai Ye, Xiaocheng Feng, Weihong Zhong, Libo Qin, Ruihan Chen, Baohang Li, Kui Jiang, Yaowei Wang, Ting Liu, et al. Cai: Caption-sensitive attention in- tervention for mitigating object hallucination in large vision- language models.arXiv preprint arXiv:2506.23590, 2025. 2, 12

work page arXiv 2025
[21]

Evaluating Object Hallucination in Large Vision-Language Models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucina- tion in large vision-language models.arXiv preprint arXiv:2305.10355, 2023. 1, 3, 5

work page internal anchor Pith review arXiv 2023
[22]

FairSteer: Inference time debiasing for LLMs with dynamic activation steering

Yichen Li, Zhiting Fan, Ruizhe Chen, Xiaotang Gai, Luqi Gong, Yan Zhang, and Zuozhu Liu. FairSteer: Inference time debiasing for LLMs with dynamic activation steering. InFindings of the Association for Computational Linguis- tics: ACL 2025, pages 11293–11312, Vienna, Austria, 2025. Association for Computational Linguistics. 2

2025
[23]

arXiv preprint arXiv:2502.03628 , year=

Zhuowei Li, Haizhou Shi, Yunhe Gao, Di Liu, Zhenting Wang, Yuxiao Chen, Ting Liu, Long Zhao, Hao Wang, and Dimitris N Metaxas. The hidden life of tokens: Reducing hallucination of large vision-language models via visual in- formation steering.arXiv preprint arXiv:2502.03628, 2025. 1, 2, 3, 4, 5, 6, 7, 8, 13, 14, 15

work page arXiv 2025
[24]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014. 3, 4, 5, 12

2014
[25]

Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Ya- coob, and Lijuan Wang. Mitigating hallucination in large multi-modal models via robust instruction tuning.arXiv preprint arXiv:2306.14565, 2023. 1

work page internal anchor Pith review arXiv 2023
[26]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 1, 2

2023
[27]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024. 1, 2, 5

2024
[28]

arXiv preprint arXiv:2412.17747 (2024)

Luyang Liu, Jonas Pfeiffer, Jiaxing Wu, Jun Xie, and Arthur Szlam. Deliberation in latent space via differentiable cache augmentation.arXiv preprint arXiv:2412.17747, 2024. 3

work page arXiv 2024
[29]

Reduc- ing hallucinations in vision-language models via latent space steering.arXiv preprint arXiv:2410.15778, 2024

Sheng Liu, Haotian Ye, Lei Xing, and James Zou. Reduc- ing hallucinations in vision-language models via latent space steering.arXiv preprint arXiv:2410.15778, 2024. 1, 2, 3, 4, 5, 6, 7, 13, 14, 15

work page arXiv 2024
[30]

Paying more at- tention to image: A training-free method for alleviating hal- lucination in lvlms

Shi Liu, Kecheng Zheng, and Wei Chen. Paying more at- tention to image: A training-free method for alleviating hal- lucination in lvlms. InEuropean Conference on Computer Vision, pages 125–140. Springer, 2024. 5, 6, 7, 8, 12, 13, 14

2024
[31]

DeepSeek-VL: Towards Real-World Vision-Language Understanding

Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, et al. Deepseek-vl: towards real-world vision- language understanding.arXiv preprint arXiv:2403.05525,

work page internal anchor Pith review arXiv
[32]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 3

2021
[33]

Object Hallucination in Image Captioning

Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object hallucination in image cap- tioning.arXiv preprint arXiv:1809.02156, 2018. 1, 2, 3, 5, 14

work page Pith review arXiv 2018
[34]

Video reasoning without training.arXiv preprint arXiv:2510.17045, 2025

Deepak Sridhar, Kartikeya Bhardwaj, Jeya Pradha Je- yaraj, Nuno Vasconcelos, Ankita Nayak, and Harris Teague. Video reasoning without training.arXiv preprint arXiv:2510.17045, 2025. 2

work page arXiv 2025
[35]

Aligning large multimodal models with factually augmented RLHF

Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liang-Yan Gui, Yu-Xiong Wang, Yiming Yang, et al. Aligning large multi- modal models with factually augmented rlhf.arXiv preprint arXiv:2309.14525, 2023. 1, 5, 6

work page arXiv 2023
[36]

Octopus: Alleviating hal- lucination via dynamic contrastive decoding

Wei Suo, Lijun Zhang, Mengyang Sun, Lin Yuanbo Wu, Peng Wang, and Yanning Zhang. Octopus: Alleviating hal- lucination via dynamic contrastive decoding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29904–29914, 2025. 1, 3

2025
[37]

Seeing far and clearly: Mitigating hallucinations in mllms with attention causal decoding

Feilong Tang, Chengzhi Liu, Zhongxing Xu, Ming Hu, Zile Huang, Haochen Xue, Ziyang Chen, Zelin Peng, Zhiwei Yang, Sijin Zhou, et al. Seeing far and clearly: Mitigating hallucinations in mllms with attention causal decoding. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 26147–26159, 2025. 2, 3, 12

2025
[38]

Gemma 3 Technical Report

Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Ta- tiana Matejovicova, Alexandre Ram ´e, Morgane Rivi `ere, et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025. 1

work page internal anchor Pith review arXiv 2025
[39]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023. 1, 2

work page internal anchor Pith review arXiv 2023
[40]

Vl-cache: Sparsity and modality-aware kv cachecompressionforvision-languagemodelinferenceacceleration.arXivpreprintarXiv:2410.23317,

Dezhan Tu, Danylo Vashchilenko, Yuzhe Lu, and Panpan Xu. Vl-cache: Sparsity and modality-aware kv cache com- pression for vision-language model inference acceleration. arXiv preprint arXiv:2410.23317, 2024. 2

work page arXiv 2024
[41]

No Starch Press, 2020

Yuli Vasiliev.Natural language processing with Python and spaCy: A practical introduction. No Starch Press, 2020. 4

2020
[42]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 2

2017
[43]

In-distribution steering: Bal- ancing control and coherence in language model generation

Arthur V ogels, Benjamin Wong, Yann Choho, Annabelle Blangero, and Milan Bhan. In-distribution steering: Bal- ancing control and coherence in language model generation. arXiv preprint arXiv:2510.13285, 2025. 2

work page arXiv 2025
[44]

Look-m: Look- once optimization in kv cache for efficient multimodal long- context inference.arXiv preprint arXiv:2406.18139, 2024

Zhongwei Wan, Ziang Wu, Che Liu, Jinfa Huang, Zhihong Zhu, Peng Jin, Longyue Wang, and Li Yuan. Look-m: Look- once optimization in kv cache for efficient multimodal long- context inference.arXiv preprint arXiv:2406.18139, 2024. 2

work page arXiv 2024
[45]

Meda: Dynamic kv cache allocation for efficient multimodal long-context inference.arXiv preprint arXiv:2502.17599,

Zhongwei Wan, Hui Shen, Xin Wang, Che Liu, Zheda Mai, and Mi Zhang. Meda: Dynamic kv cache allocation for ef- ficient multimodal long-context inference.arXiv preprint arXiv:2502.17599, 2025. 2

work page arXiv 2025
[46]

Amber: An llm-free multi- dimensional benchmark for mllms hallucination evaluation,

Junyang Wang, Yuhang Wang, Guohai Xu, Jing Zhang, Yukai Gu, Haitao Jia, Jiaqi Wang, Haiyang Xu, Ming Yan, Ji Zhang, and Jitao Sang. Amber: An llm-free multi- dimensional benchmark for mllms hallucination evaluation,
[47]

METok: Multi-stage event-based token compression for efficient long video understanding

Mengyue Wang, Shuo Chen, Kristian Kersting, V olker Tresp, and Yunpu Ma. METok: Multi-stage event-based token compression for efficient long video understanding. InProceedings of the 2025 Conference on Empirical Meth- ods in Natural Language Processing, pages 18881–18895, Suzhou, China, 2025. Association for Computational Lin- guistics. 2

2025
[48]

Mitigating hallucinations in large vision-language models with instruction contrastive decoding.arXiv preprint arXiv:2403.18715, 2024

Xintong Wang, Jingheng Pan, Liang Ding, and Chris Bie- mann. Mitigating hallucinations in large vision-language models with instruction contrastive decoding.arXiv preprint arXiv:2403.18715, 2024. 3

work page arXiv 2024
[49]

Model tells you where to merge: Adaptive kv cache merging for llms on long-context tasks.arXiv preprint arXiv:2407.08454, 2024

Zheng Wang, Boxiao Jin, Zhongzhi Yu, and Minjia Zhang. Model tells you where to merge: Adaptive kv cache merging for llms on long-context tasks.arXiv preprint arXiv:2407.08454, 2024. 2

work page arXiv 2024
[50]

Fast-dllm: Training-free acceleration of diffusion LLM by enabling KV cache and parallel decoding.CoRR, abs/2505.22618, 2025

Chengyue Wu, Hao Zhang, Shuchen Xue, Zhijian Liu, Shizhe Diao, Ligeng Zhu, Ping Luo, Song Han, and Enze Xie. Fast-dllm: Training-free acceleration of diffusion llm by enabling kv cache and parallel decoding.arXiv preprint arXiv:2505.22618, 2025. 2

work page arXiv 2025
[51]

Anti- dote: A unified framework for mitigating lvlm hallucinations in counterfactual presupposition and object perception

Yuanchen Wu, Lu Zhang, Hang Yao, Junlong Du, Ke Yan, Shouhong Ding, Yunsheng Wu, and Xiaoqiang Li. Anti- dote: A unified framework for mitigating lvlm hallucinations in counterfactual presupposition and object perception. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 14646–14656, 2025. 1

2025
[52]

Ote: Exploring accurate scene text recognition us- ing one token

Jianjun Xu, Yuxin Wang, Hongtao Xie, and Yongdong Zhang. Ote: Exploring accurate scene text recognition us- ing one token. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 28327– 28336, 2024. 4

2024
[53]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 1

work page internal anchor Pith review arXiv 2025
[54]

Improving factual- ity in large language models via decoding-time hallucinatory and truthful comparators

Dingkang Yang, Dongling Xiao, Jinjie Wei, Mingcheng Li, Zhaoyu Chen, Ke Li, and Lihua Zhang. Improving factual- ity in large language models via decoding-time hallucinatory and truthful comparators. InProceedings of the AAAI Con- ference on Artificial Intelligence, pages 25606–25614, 2025. 1

2025
[55]

Nullu: Mitigating object hallucinations in large vision-language models via halluspace projection

Le Yang, Ziwei Zheng, Boxu Chen, Zhengyu Zhao, Chenhao Lin, and Chao Shen. Nullu: Mitigating object hallucinations in large vision-language models via halluspace projection. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 14635–14645, 2025. 1, 2, 4

2025
[56]

Clearsight: Vi- sual signal enhancement for object hallucination mitigation in multimodal large language models

Hao Yin, Guangzong Si, and Zilei Wang. Clearsight: Vi- sual signal enhancement for object hallucination mitigation in multimodal large language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14625–14634, 2025. 1, 3

2025
[57]

A survey on multimodal large language models.National Science Review, 11(12): nwae403, 2024

Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models.National Science Review, 11(12): nwae403, 2024. 1, 5, 7

2024
[58]

Hallucidoctor: Mitigating hallucinatory toxicity in visual instruction data

Qifan Yu, Juncheng Li, Longhui Wei, Liang Pang, Wen- tao Ye, Bosheng Qin, Siliang Tang, Qi Tian, and Yueting Zhuang. Hallucidoctor: Mitigating hallucinatory toxicity in visual instruction data. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12944–12953, 2024. 1

2024
[59]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023. 3

2023
[60]

attention decay

Muru Zhang, Ofir Press, William Merrill, Alisa Liu, and Noah A. Smith. How language model hallucinations can snowball. InForty-first International Conference on Ma- chine Learning, 2024. 2 Prefill-Time Intervention for Mitigating Hallucination in Large Vision-Language Models Supplementary Material A. Details of Internal Interpretability Analysis. In this ...

work page arXiv 2024
[61]

and VISTA [23]), and our PTI for LLA V A-1.5, Qwen- VL-Chat, and DeepSeek-VL-Chat, respectively. As evident across these scenarios, while vanilla models and DTI meth- ods frequently suffer from severe object hallucinations and context misinterpretation, PTI effectively suppresses the generation of non-existent entities and erroneous attributes. These qual...