Recognition: no theorem link
Do We Really Need External Tools to Mitigate Hallucinations? SIRA: Shared-Prefix Internal Reconstruction of Attribution
Pith reviewed 2026-05-15 05:52 UTC · model grok-4.3
The pith
Masking attention to image tokens after a shared prefix in vision-language transformers reduces hallucinations by contrasting against an internal language-prior reference.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SIRA constructs a counterfactual reference inside the same LVLM by first allowing image and text tokens to interact through a shared prefix that forms an aligned multimodal state, then forking a branch in later transformer layers where attention to image-token positions is masked. This branch retains the shared context and decoding history but lacks continued fine-grained visual evidence, producing a language-prior-dominated reference that enables token-level contrast without external perturbations or additional forward passes.
What carries the argument
Shared-prefix internal reconstruction: an early multimodal interaction stage followed by a masked-image-attention branch in later layers that isolates language priors while preserving prompt interpretation and positional structure.
If this is right
- Hallucination rates drop on POPE, CHAIR, and AMBER for both Qwen2.5-VL and LLaVA-v1.5 while descriptive coverage stays intact.
- Overhead stays below that of two-pass contrastive decoding because only one forward pass plus an internal branch is required.
- No training, external verifier, or perturbed input is needed, so the method applies directly to open-weight LVLMs with white-box access.
- Suppression targets tokens whose strength persists even without late visual access, favoring outputs that depend on the full visual pathway.
Where Pith is reading between the lines
- The staged flow assumption suggests similar internal forking could be tested in other transformer-based multimodal systems where early and late layers handle distinct information types.
- If the shared-prefix stage successfully locks in prompt semantics, the same pattern might help control factual drift in text-only models facing weak evidence.
- Deployment on resource-limited devices could benefit because the internal branch replaces separate external passes.
Load-bearing premise
That masking attention to image tokens only in later layers produces a clean language-prior reference that preserves the original prompt interpretation and decoding history without introducing new artifacts.
What would settle it
Run SIRA on a benchmark where hallucinations arise mainly from early-layer visual mis-grounding rather than late-stage language dominance; the method would then show no reduction or an increase in error rates.
Figures
read the original abstract
Large vision-language models (LVLMs) often hallucinate when language priors dominate weak or ambiguous visual evidence. Existing contrastive decoding methods mitigate this problem by comparing predictions from the original image with those from externally perturbed visual inputs, but such references can introduce off-manifold artifacts and require costly extra forward passes. We propose SIRA, a training-free internal contrastive decoding framework that constructs a counterfactual reference inside the same LVLM by exploiting the staged information flow of multimodal transformers. Instead of removing visual information from the input, SIRA first lets image and text tokens interact through a shared prefix, forming an aligned multimodal state that preserves prompt interpretation, decoding history, positional structure, and early visual grounding. It then forks a counterfactual branch in later transformer layers, where attention to image-token positions is masked. This branch retains the shared multimodal context but lacks continued access to fine-grained visual evidence, yielding a language-prior-dominated internal reference for token-level contrast. During decoding, SIRA suppresses tokens that remain strong without late visual access and favors predictions whose advantage depends on the full visual pathway. Experiments on POPE, CHAIR, and AMBER with Qwen2.5-VL and LLaVA-v1.5 show that SIRA consistently reduces hallucinations while preserving descriptive coverage and incurring lower overhead than two-pass contrastive decoding. SIRA requires no training, external verifier, or perturbed input, and applies to open-weight LVLMs with white-box inference access.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes SIRA, a training-free internal contrastive decoding framework for mitigating hallucinations in large vision-language models. It constructs a counterfactual reference by first allowing full image-text interaction through a shared prefix (preserving prompt interpretation, decoding history, and early visual grounding) and then forking a branch in later transformer layers where attention to image-token positions is masked, yielding a language-prior-dominated internal reference for token-level contrast during decoding. Experiments on POPE, CHAIR, and AMBER with Qwen2.5-VL and LLaVA-v1.5 are claimed to show consistent hallucination reduction while preserving descriptive coverage and lower overhead than two-pass contrastive decoding.
Significance. If the internal masking mechanism produces a valid language-prior reference without residual visual artifacts or off-distribution effects, SIRA would represent a meaningful efficiency gain over external-perturbation methods by eliminating extra forward passes, perturbed inputs, and external verifiers while remaining applicable to open-weight LVLMs with white-box access.
major comments (1)
- [Method (forked-branch construction)] The central mechanism (shared-prefix fusion followed by attention masking on image tokens in later layers) assumes that the forked branch yields logits dominated purely by language priors. However, because the shared prefix permits full cross-modal interaction, later-layer hidden states for text tokens carry fused visual features via residual connections and value projections rather than solely via attention; masking attention scores alone therefore cannot excise this embedded information. This raises the risk that any contrastive benefit arises from the masking artifact itself rather than from removal of visual evidence. The manuscript should include hidden-state analysis, comparison to a no-image baseline, or ablation on residual pathways to substantiate the assumption.
minor comments (2)
- [Abstract] The abstract states that experiments show consistent gains but provides no quantitative metrics, ablation details, or error analysis; including at least headline numbers (e.g., POPE accuracy deltas) would strengthen the summary.
- [Decoding procedure] Notation for the contrastive decoding step (e.g., how the internal reference logits are combined with the original branch) should be formalized with an equation for reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our work. We address the major comment regarding the forked-branch construction in SIRA below, and we will revise the manuscript accordingly to include additional analyses.
read point-by-point responses
-
Referee: [Method (forked-branch construction)] The central mechanism (shared-prefix fusion followed by attention masking on image tokens in later layers) assumes that the forked branch yields logits dominated purely by language priors. However, because the shared prefix permits full cross-modal interaction, later-layer hidden states for text tokens carry fused visual features via residual connections and value projections rather than solely via attention; masking attention scores alone therefore cannot excise this embedded information. This raises the risk that any contrastive benefit arises from the masking artifact itself rather than from removal of visual evidence. The manuscript should include hidden-state analysis, comparison to a no-image baseline, or ablation on residual pathways to substantiate the assumption.
Authors: We acknowledge that residual connections from the shared prefix do carry some fused multimodal information into the later layers. However, the attention masking in the forked branch is designed to halt further visual token integration, allowing the branch to rely more on language priors for subsequent predictions. This creates a meaningful contrast with the full visual pathway. To substantiate this, we will add hidden-state similarity analysis between the branches, a comparison to a no-image input baseline, and an ablation removing residual pathways where possible. These additions will clarify that the contrastive benefit stems from differential visual access rather than artifacts. revision: yes
Circularity Check
SIRA's shared-prefix masking construction is an independent algorithmic proposal with no self-referential reductions
full rationale
The paper defines SIRA directly as a training-free procedure: run a shared prefix through early layers to form aligned multimodal states, then fork a later-layer branch that masks attention to image-token positions while retaining the shared context. This yields an internal language-prior reference for contrastive decoding. No equations are presented whose outputs are algebraically identical to their inputs by construction; no parameters are fitted to data and then relabeled as predictions; no load-bearing uniqueness theorem or ansatz is imported via self-citation. The central claim rests on the architectural assumption that attention masking after early fusion produces a usable counterfactual, which is tested empirically on POPE, CHAIR, and AMBER rather than derived tautologically. The derivation chain is therefore self-contained and non-circular.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Multimodal transformers process information in stages where early layers align image and text tokens while later layers refine predictions.
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Wenbin An, Feng Tian, Sicong Leng, Jiahao Nie, Haonan Lin, Qianying Wang, Ping Chen, Xiaoqin Zhang, and Shijian Lu. Mitigating object hallucinations in large vision-language models with assembly of global and local attention. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 29915–29926. IEEE, 2025
work page 2025
-
[3]
Qwen2.5-vl technical report, 2025
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report, 2025
work page 2025
-
[4]
Boqi Chen, Xudong Liu, and Jianing Qiu. Mask what matters: Mitigating object hallucinations in multimodal large language models with object-aligned visual contrastive decoding. In Selene Baez Santamaria, Sai Ashish Somayajula, and Atsuki Yamaguchi, editors,Proceedings of the 19th Conference of the European Chapter of the Association for Computational Ling...
work page 2026
-
[5]
Junzhe Chen, Tianshu Zhang, Shiyu Huang, Yuwei Niu, Linfeng Zhang, Lijie Wen, and Xuming Hu. Ict: Image-object cross-level trusted intervention for mitigating object hallucination in large vision-language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 4209–4221, 2025
work page 2025
-
[6]
Yung-Sung Chuang, Yujia Xie, Hongyin Luo, Yoon Kim, James R. Glass, and Pengcheng He. Dola: Decoding by contrasting layers improves factuality in large language models. InThe Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[7]
MaskCD: Mitigating LVLM hallucinations by image head masked contrastive decoding
Jingyuan Deng and Yujiu Yang. MaskCD: Mitigating LVLM hallucinations by image head masked contrastive decoding. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Findings of the Association for Computational Linguistics: EMNLP 2025, pages 18854–18866, Suzhou, China, November 2025. Association for Computa- tional Li...
-
[8]
Multi-modal hallucination control by visual information grounding
Alessandro Favero, Luca Zancato, Matthew Trager, Siddharth Choudhary, Pramuditha Perera, Alessandro Achille, Ashwin Swaminathan, and Stefano Soatto. Multi-modal hallucination control by visual information grounding. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14303–14312. IEEE, 2024
work page 2024
-
[9]
Anisha Gunjal, Jihan Yin, and Erhan Bas. Detecting and preventing hallucinations in large vision language models.Proceedings of the AAAI Conference on Artificial Intelligence, 38(16): 18135–18143, 2024
work page 2024
-
[10]
Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qiang- long Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Transac- tions on Information Systems, 43(2):1–55, 2025
work page 2025
-
[11]
Qidong Huang, Xiaoyi Dong, Pan Zhang, Bin Wang, Conghui He, Jiaqi Wang, Dahua Lin, Weiming Zhang, and Nenghai Yu. Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13418–13427, 2024
work page 2024
-
[12]
Gqa: A new dataset for real-world visual reasoning and compositional question answering
Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6700–6709, 2019. 10
work page 2019
-
[13]
Self-introspective decoding: Alleviating hallucinations for large vision-language models
Fushuo Huo, Wenchao Xu, Zhong Zhang, Haozhao Wang, Zhicheng Chen, and Peilin Zhao. Self-introspective decoding: Alleviating hallucinations for large vision-language models. In The Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[14]
Hallucination augmented contrastive learning for multimodal large language model
Chaoya Jiang, Haiyang Xu, Mengfan Dong, Jiaxing Chen, Wei Ye, Ming Yan, Qinghao Ye, Ji Zhang, Fei Huang, and Shikun Zhang. Hallucination augmented contrastive learning for multimodal large language model. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 27026–27036. IEEE, 2024
work page 2024
-
[15]
What’s in the image? a deep-dive into the vision of vision language models
Omri Kaduri, Shai Bagon, and Tali Dekel. What’s in the image? a deep-dive into the vision of vision language models. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14549–14558. IEEE, 2025. doi: 10.1109/CVPR52734.2025.01356
-
[16]
Code: Contrasting self-generated description to combat hallucination in large multi-modal models
Hyun Kim, Junho Kim, Yeon Kim, and Yong Ro. Code: Contrasting self-generated description to combat hallucination in large multi-modal models. InAdvances in Neural Information Pro- cessing Systems 37, pages 133571–133599. Neural Information Processing Systems Foundation, Inc. (NeurIPS), 2024
work page 2024
-
[17]
Vlind-bench: Measuring language priors in large vision-language models
Kang-il Lee, Minbeom Kim, Seunghyun Yoon, Minsung Kim, Dongryeol Lee, Hyukhun Koh, and Kyomin Jung. Vlind-bench: Measuring language priors in large vision-language models. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 4129–4144. Association for Computational Linguistics, 2025
work page 2025
-
[18]
Mitigating object hallucinations in large vision-language models through visual contrastive decoding
Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, and Lidong Bing. Mitigating object hallucinations in large vision-language models through visual contrastive decoding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13872–13882, 2024
work page 2024
-
[19]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023
work page 2023
-
[20]
Contrastive decoding: Open-ended text generation as optimization
Xiang Lisa Li, Ari Holtzman, Daniel Fried, Percy Liang, Jason Eisner, Tatsunori Hashimoto, Luke Zettlemoyer, and Mike Lewis. Contrastive decoding: Open-ended text generation as optimization. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12286–12312. Association for Computational Li...
work page 2023
-
[21]
Evaluating object hallucination in large vision-language models
Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 292–305, 2023
work page 2023
-
[22]
Microsoft coco: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InEuropean Conference on Computer Vision, pages 740–755. Springer, 2014
work page 2014
-
[23]
Multi-frequency contrastive decod- ing: Alleviating hallucinations for large vision-language models
Bingqian Liu, Fu Zhang, Guoqing Chen, and Jingwei Cheng. Multi-frequency contrastive decod- ing: Alleviating hallucinations for large vision-language models. In Christos Christodoulopou- los, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Con- ference on Empirical Methods in Natural Language Processing, pages 28568–2858...
-
[24]
Improved baselines with visual instruction tuning, 2024
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning, 2024
work page 2024
-
[25]
Springer Nature Switzerland, 2024
Shi Liu, Kecheng Zheng, and Wei Chen.Paying More Attention to Image: A Training-Free Method for Alleviating Hallucination in LVLMs, pages 125–140. Springer Nature Switzerland, 2024
work page 2024
-
[26]
Locating and editing factual associations in gpt, 2022
Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in gpt, 2022. 11
work page 2022
-
[27]
Object hallucination in image captioning
Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object hallucination in image captioning. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4035–4045, 2018
work page 2018
-
[28]
A-okvqa: A benchmark for visual question answering using world knowledge
Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-okvqa: A benchmark for visual question answering using world knowledge. In European Conference on Computer Vision, pages 146–162. Springer, 2022
work page 2022
-
[29]
Aligning large multimodal models with factually augmented rlhf
Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liangyan Gui, Yu-Xiong Wang, Yiming Yang, Kurt Keutzer, and Trevor Darrell. Aligning large multimodal models with factually augmented rlhf. InFindings of the Association for Computational Linguistics ACL 2024, pages 13088–13110. Association for Computational Linguist...
work page 2024
-
[30]
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy Lillicrap, Angeliki Lazaridou, Orhan Firat, et al. Gemini: A family of highly capab...
work page 2025
-
[31]
Mitigating hallucination in multimodal llms with layer contrastive decoding, 2025
Bingkui Tong, Jiaer Xia, and Kaiyang Zhou. Mitigating hallucination in multimodal llms with layer contrastive decoding, 2025
work page 2025
-
[32]
Eyes wide shut? exploring the visual shortcomings of multimodal llms
Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9568–9578. IEEE, 2024
work page 2024
-
[33]
Llama: Open and efficient foundation language models, 2023
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models, 2023
work page 2023
-
[34]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[35]
Only: One-layer intervention sufficiently mitigates hallucinations in large vision-language models
Zifu Wan, Ce Zhang, Silong Yong, Martin Q Ma, Simon Stepputtis, Louis-Philippe Morency, Deva Ramanan, Katia Sycara, and Yaqi Xie. Only: One-layer intervention sufficiently mitigates hallucinations in large vision-language models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3225–3234, 2025
work page 2025
-
[36]
Junyang Wang, Yuhang Wang, Guohai Xu, Jing Zhang, Yukai Gu, Haitao Jia, Jiaqi Wang, Haiyang Xu, Ming Yan, Ji Zhang, et al. Amber: An llm-free multi-dimensional benchmark for mllms hallucination evaluation.arXiv preprint arXiv:2311.07397, 2023
-
[37]
Mitigating hallucinations in large vision-language models with instruction contrastive decoding
Xintong Wang, Jingheng Pan, Liang Ding, and Chris Biemann. Mitigating hallucinations in large vision-language models with instruction contrastive decoding. InFindings of the Association for Computational Linguistics ACL 2024, pages 15840–15853, 2024
work page 2024
-
[38]
Don’t miss the forest for the trees: Attentional vision calibration for large vision language models
Sangmin Woo, Donguk Kim, Jaehyuk Jang, Yubin Choi, and Changick Kim. Don’t miss the forest for the trees: Attentional vision calibration for large vision language models. InFindings of the Association for Computational Linguistics: ACL 2025, pages 1927–1951. Association for Computational Linguistics, 2025
work page 2025
-
[39]
mplug-owi2: Revolutionizing multi-modal large language model with modality collaboration
Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Anwen Hu, Haowei Liu, Qi Qian, Ji Zhang, and Fei Huang. mplug-owi2: Revolutionizing multi-modal large language model with modality collaboration. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13040–13051. IEEE, 2024
work page 2024
-
[40]
Lifting the veil on visual information flow in mllms: Unlocking pathways to faster inference
Hao Yin, Guangzong Si, and Zilei Wang. Lifting the veil on visual information flow in mllms: Unlocking pathways to faster inference. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9382–9391. IEEE, 2025. doi: 10.1109/CVPR52734. 2025.00876. 12
-
[41]
Hao Yin, Guangzong Si, and Zilei Wang. The mirage of performance gains: Why contrastive decoding fails to mitigate object hallucinations in MLLMs? InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026
work page 2026
-
[42]
Shukang Yin, Chaoyou Fu, Sirui Zhao, Tong Xu, Hao Wang, Dianbo Sui, Yunhang Shen, Ke Li, Xing Sun, and Enhong Chen. Woodpecker: hallucination correction for multimodal large language models.Science China Information Sciences, 67(12), 2024
work page 2024
-
[43]
Hallucidoctor: Mitigating hallucinatory toxicity in visual instruction data
Qifan Yu, Juncheng Li, Longhui Wei, Liang Pang, Wentao Ye, Bosheng Qin, Siliang Tang, Qi Tian, and Yueting Zhuang. Hallucidoctor: Mitigating hallucinatory toxicity in visual instruction data. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12944–12953. IEEE, 2024
work page 2024
-
[44]
Tianyu Yu, Yuan Yao, Haoye Zhang, Taiwen He, Yifeng Han, Ganqu Cui, Jinyi Hu, Zhiyuan Liu, Hai-Tao Zheng, and Maosong Sun. Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13807–13816. IEEE, 2024
work page 2024
-
[45]
Less is more: Mitigating multimodal hallucination from an eos decision perspective
Zihao Yue, Liang Zhang, and Qin Jin. Less is more: Mitigating multimodal hallucination from an eos decision perspective. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11766–11781. Association for Computational Linguistics, 2024
work page 2024
-
[46]
Springer Nature Switzerland, 2024
Jinrui Zhang, Teng Wang, Haigang Zhang, Ping Lu, and Feng Zheng.Reflective Instruction Tuning: Mitigating Hallucinations in Large Vision-Language Models, pages 196–213. Springer Nature Switzerland, 2024
work page 2024
-
[47]
Weihong Zhong, Xiaocheng Feng, Liang Zhao, Qiming Li, Lei Huang, Yuxuan Gu, Weitao Ma, Yuan Xu, and Bing Qin. Investigating and mitigating the multimodal hallucination snowballing in large vision-language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11991–12011. Associatio...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.