arxiv: 2604.21911 · v1 · submitted 2026-04-23 · 💻 cs.CV · cs.AI· cs.CL· cs.LG

Recognition: unknown

When Prompts Override Vision: Prompt-Induced Hallucinations in LVLMs

Pegah Khayatan , Jayneel Parekh , Arnaud Dapogny , Mustafa Shukor , Alasdair Newson , Matthieu Cord

Authors on Pith no claims yet

Pith reviewed 2026-05-09 22:08 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CLcs.LG

keywords hallucinationslarge vision-language modelspromptspreference optimizationDPObenchmarkstextual priors

0 comments

The pith

Textual instructions override visual input as the main driver of hallucinations in large vision-language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates the sources of hallucinations in LVLMs, where models generate content not supported by the provided image. It creates the HalluScope benchmark to separate the effects of vision limitations from those of language priors. The analysis concludes that hallucinations arise primarily from over-reliance on background knowledge and details supplied in the text prompt. To address this, the authors introduce HalluVL-DPO, which applies preference optimization on a new dataset so the model learns to choose visually grounded outputs over hallucinated ones. Experiments show the fine-tuned models reduce the targeted errors while maintaining results on other benchmarks.

Core claim

Hallucinations in LVLMs largely stem from excessive reliance on textual priors and background knowledge, especially information introduced through textual instructions. HalluVL-DPO mitigates these by fine-tuning off-the-shelf models with preference optimization on a curated dataset that guides responses toward visual grounding.

What carries the argument

HalluScope benchmark for isolating hallucination causes plus HalluVL-DPO preference optimization on a dataset of grounded versus hallucinated response pairs.

If this is right

Textual instructions can be treated as a controllable variable that strongly influences whether an LVLM stays grounded in the image.
Preference optimization on paired grounded and hallucinated responses provides a practical way to steer existing models without full retraining.
Releasing the benchmark, training data, and code allows systematic testing of prompt effects across different model sizes and architectures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Prompt engineering that minimizes background knowledge injection could serve as a lightweight complement to fine-tuning.
The finding raises the possibility that similar text-over-vision imbalances appear in other multimodal systems such as video or audio models.
If the effect holds, future model designs might incorporate explicit mechanisms to down-weight language priors during visual reasoning steps.

Load-bearing premise

The HalluScope benchmark and curated preference dataset isolate prompt-induced hallucinations without significant confounding from model architecture choices or data collection biases.

What would settle it

If an LVLM still produces the same rate of hallucinations after textual instructions are removed or replaced with neutral prompts, or if HalluVL-DPO training yields no measurable drop in the targeted errors, the claim that textual priors are the dominant cause would be falsified.

Figures

Figures reproduced from arXiv: 2604.21911 by Alasdair Newson, Arnaud Dapogny, Jayneel Parekh, Matthieu Cord, Mustafa Shukor, Pegah Khayatan.

**Figure 1.** Figure 1: Vision–language hallucination failure modes and mitigation. The model reliably recognizes present objects or the absence of random objects, but tends to hallucinate plausible yet absent (i.e., adversarial) objects, especially when the instruction presupposes their presence. Our HalluVL-DPO framework substantially mitigates this failure mode. To address this limitation, we introduce HalluScope, a diagnostic… view at source ↗

**Figure 2.** Figure 2: HalluScope construction pipeline. (A) We start by constructing a semantically diverse subset of COCO images. (B) Objects present in each image are then detected and grounded. (C) Contextually plausible but visually absent adversarial objects are identified using object cooccurrence statistics, and each image is annotated with a present object, a random absent object, and an adversarial object. (D) Finally… view at source ↗

**Figure 3.** Figure 3: Sample-specific weighting based on semantic gap. From left to right, samples receive scores of 1, 2, and 3. Score 1: near rephrasings with minimal contrast. Score 2: both responses follow an incorrect presupposition (cars), though the chosen answer adds slightly more relevant details. Score 3: clear contrast, where the chosen answer correctly identifies the absence of high heels and the rejected answer ass… view at source ↗

**Figure 4.** Figure 4: Sample instances from the HalluVL-DPO training dataset. Visually grounded content is highlighted in green, while hallucinated content is shown in red. The chosen responses are more grounded than the rejected ones. keeps the distribution of training dataset responses close to the model being optimized, which is crucial for an efficient DPO [49]. See Section C.2 for further details on the pipeline. Example s… view at source ↗

**Figure 5.** Figure 5: Ablation of preference pair types(left) and data scale (right). Each colored region shows performance when one preference type is removed while keeping the total number of samples fixed at 20k. Attribute pairs contribute most uniformly across benchmarks, whereas other types primarily benefit structurally similar tasks. On the right, we evaluate LLaVA-1.5-7B on adversarial recognition and presupposition acr… view at source ↗

**Figure 6.** Figure 6: Qualitative results before and after HalluVL-DPO fine-tuning. Top-left and topright: adversarial examples from HalluScope, showing improved recognition (top-left) and correct rejection of false presuppositions (top-right) after fine-tuning. Bottom: captioning example with reduced hallucinated content. the community, we will publicly release the HalluScope benchmark, our construction pipeline, and the Hall… view at source ↗

**Figure 7.** Figure 7: Examples of the Two-Stage Object Presence Verification Pipeline. (Left) shows a clearly visible object (umbrella) with high confidence score returned by Grounding DINO (0.6119). (Middle) the image contains a flag with a low confidence score returned by Grounding DINO (0.4076), but which presence is correctly confirmed by the VLM. (Right) the image contains a shopping bag, and not a handbag. The relatively … view at source ↗

**Figure 8.** Figure 8: Adversary Presupposition Samples from HalluScope. After fine-tuning with HalluVL-DPO, the model correctly confirms the absence of the adversary object even when asked about its attributes. Some limitations remain, as shown in examples where the post-fine-tuning answer is highlighted in orange. The Sentence-BERT (SBERT) embeddings of candidate chosen and rejected responses are compared, and if the cosine d… view at source ↗

**Figure 9.** Figure 9: Qualitative Comparison of Caption Generation under Different Preference Pair Strategies. We compare captions generated for the same image by the original model and by models fine-tuned on description samples using three different preference pair generation strategies. The caption produced by the model trained with unilateral hint augmentation is free of hallucinations but significantly shorter than the ori… view at source ↗

**Figure 10.** Figure 10: Recognition and AdP Scores on HalluScope benchmark. Each point represents a model evaluated along two axes: recognition (average performance on the positive and negative subsets) and AdP. We compare base models, several existing baselines, and models fine-tuned using our generated preference datasets, including both same-model and cross-model training. For models trained with HalluVL-DPO, the subscript in… view at source ↗

read the original abstract

Despite impressive progress in capabilities of large vision-language models (LVLMs), these systems remain vulnerable to hallucinations, i.e., outputs that are not grounded in the visual input. Prior work has attributed hallucinations in LVLMs to factors such as limitations of the vision backbone or the dominance of the language component, yet the relative importance of these factors remains unclear. To resolve this ambiguity, We propose HalluScope, a benchmark to better understand the extent to which different factors induce hallucinations. Our analysis indicates that hallucinations largely stem from excessive reliance on textual priors and background knowledge, especially information introduced through textual instructions. To mitigate hallucinations induced by textual instruction priors, we propose HalluVL-DPO, a framework for fine-tuning off-the-shelf LVLMs towards more visually grounded responses. HalluVL-DPO leverages preference optimization using a curated training dataset that we construct, guiding the model to prefer grounded responses over hallucinated ones. We demonstrate that our optimized model effectively mitigates the targeted hallucination failure mode, while preserving or improving performance on other hallucination benchmarks and visual capability evaluations. To support reproducibility and further research, we will publicly release our evaluation benchmark, preference training dataset, and code at https://pegah-kh.github.io/projects/prompts-override-vision/ .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Textual priors in prompts look like a bigger driver of LVLMs hallucinations than the vision side, and the paper supplies a benchmark plus DPO dataset to measure and reduce that.

read the letter

The main point is that hallucinations in large vision-language models often trace back to over-reliance on textual instructions and background knowledge rather than vision-encoder limits. The authors test this with HalluScope, a benchmark that tries to separate the factors, then build HalluVL-DPO to fine-tune models via preference optimization on pairs that reward grounded answers over hallucinated ones. Their tuned models cut the targeted errors while holding steady on other hallucination checks and general vision tasks, and they commit to releasing the benchmark, dataset, and code.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces HalluScope, a benchmark to analyze factors inducing hallucinations in LVLMs, concluding that these largely stem from excessive reliance on textual priors and background knowledge introduced via instructions rather than vision backbone limitations. It proposes HalluVL-DPO, a preference optimization framework using a curated dataset of grounded vs. hallucinated response pairs to fine-tune off-the-shelf LVLMs, with experiments showing mitigation of the targeted failure mode while preserving or improving performance on other hallucination benchmarks and visual tasks. The benchmark, dataset, and code are to be released publicly.

Significance. If the isolation of textual priors holds, the work offers a clear empirical decomposition of hallucination sources in LVLMs and a practical, targeted mitigation via DPO that avoids broad capability degradation. The open release of HalluScope and the preference dataset provides reusable artifacts for the community, strengthening reproducibility and enabling follow-on studies on prompt engineering and multimodal alignment.

major comments (2)

[§3 (HalluScope construction)] §3 (HalluScope construction): the benchmark description does not report ablations or statistics confirming that visual stimuli are uncorrelated with common vision-encoder failure modes (e.g., object occlusion, fine-grained detail, or low-contrast images). Without such controls or a comparison of vision-only performance on the same images, the central attribution of hallucinations to textual priors cannot be cleanly separated from vision backbone confounds.
[§4.2 (preference dataset curation)] §4.2 (preference dataset curation): the procedure for generating and labeling the preference pairs is not detailed with respect to sampling strategy, human annotation guidelines, or checks against language-model prior leakage. If the curation inadvertently selects responses that align with particular textual biases, the claim that HalluVL-DPO specifically counters prompt-induced hallucinations becomes circular.

minor comments (2)

[Results section] The abstract states that the model 'preserves or improves performance on other hallucination benchmarks,' but the main text should include a table with exact delta values and statistical significance for each baseline comparison.
[§2] Notation for 'textual priors' vs. 'background knowledge' is used interchangeably in places; a single consistent definition in §2 would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of benchmark construction and dataset curation that warrant clarification to strengthen the isolation of textual priors as a hallucination source. We address each point below and commit to revisions that incorporate the suggested controls and details without altering the core claims or experimental findings.

read point-by-point responses

Referee: §3 (HalluScope construction): the benchmark description does not report ablations or statistics confirming that visual stimuli are uncorrelated with common vision-encoder failure modes (e.g., object occlusion, fine-grained detail, or low-contrast images). Without such controls or a comparison of vision-only performance on the same images, the central attribution of hallucinations to textual priors cannot be cleanly separated from vision backbone confounds.

Authors: We acknowledge that explicit controls would further strengthen the separation of factors. HalluScope was built from standard datasets (COCO, Visual Genome) with images pre-filtered for prominent, unambiguous objects to reduce vision confounds, but these selection criteria were not quantified in the original submission. In the revision we will add: (i) summary statistics on image properties including average contrast, occlusion frequency, and fine-grained detail scores; (ii) a vision-only baseline comparison (e.g., CLIP or BLIP captioning accuracy) on the identical HalluScope images to show that the vision backbone succeeds on these stimuli when textual priors are absent. These additions will directly support the attribution to textual instructions. revision: yes
Referee: §4.2 (preference dataset curation): the procedure for generating and labeling the preference pairs is not detailed with respect to sampling strategy, human annotation guidelines, or checks against language-model prior leakage. If the curation inadvertently selects responses that align with particular textual biases, the claim that HalluVL-DPO specifically counters prompt-induced hallucinations becomes circular.

Authors: We agree that expanded methodological detail is required to rule out circularity. The original §4.2 described the high-level construction of grounded vs. hallucinated pairs but omitted granular procedures. In the revised manuscript we will specify: the exact sampling strategy (prompt templates used to elicit hallucinations while keeping the image fixed); the human annotation guidelines (explicit criteria for grounding, hallucination types, and resolution of disagreements); and leakage checks (comparison of preference labels against outputs from a text-only LLM on the same prompts, plus distribution analysis to confirm visual grounding is the differentiating factor). These clarifications will demonstrate that the dataset targets prompt-induced failures rather than generic textual biases. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical benchmark and dataset are independently constructed.

full rationale

The paper introduces HalluScope as a new benchmark and a curated preference dataset for HalluVL-DPO, then reports empirical findings on hallucination sources. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the derivation of the central claim. The attribution to textual priors rests on controlled variation within the newly proposed artifacts rather than reducing to prior self-referential results or definitions. This is a standard empirical contribution with external reproducibility artifacts, so the derivation chain remains self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, the paper does not introduce or rely on explicit free parameters, unstated axioms, or new invented entities; the contribution is empirical construction of benchmark and dataset.

pith-pipeline@v0.9.0 · 5557 in / 1149 out tokens · 32972 ms · 2026-05-09T22:08:10.190412+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 19 canonical work pages · 6 internal anchors

[1]

arXiv preprint arXiv:2406.12718 (2024)

An, W., Tian, F., Leng, S., Nie, J., Lin, H., Wang, Q., Dai, G., Chen, P., Lu, S.: Agla: Mitigating object hallucinations in large vision-language models with assembly of global and local attention. arXiv preprint arXiv:2406.12718 (2024)

work page arXiv 2024
[2]

Hallucination of Multimodal Large Language Models: A Survey

Bai, Z., Wang, P., Xiao, T., He, T., Han, Z., Zhang, Z., Shou, M.Z.: Hallucination of multi- modal large language models: A survey. arXiv preprint arXiv:2404.18930 (2024)

work page internal anchor Pith review arXiv 2024
[3]

Baldassini, F.B., Shukor, M., Cord, M., Soulier, L., Piwowarski, B.: What makes multimodal in-context learning work? In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops. pp. 1539–1550 (June 2024)

2024
[4]

arXiv preprint arXiv:2502.00814 (2025)

Cai, J., Zhu, J., Sun, R., Wang, Y., Li, L., Zhou, W., Li, H.: Disentangling length bias in preference learning via response-conditioned modeling. arXiv preprint arXiv:2502.00814 (2025)

work page arXiv 2025
[5]

In: The Thirteenth Inter- national Conference on Learning Representations (2025),https://openreview.net/forum? id=ziw5bzg2NO

Cho, Y., Kim, K., Hwang, T., Cho, S.: Do you keep an eye on what i ask? mitigating multimodal hallucination via attention-guided ensemble decoding. In: The Thirteenth Inter- national Conference on Learning Representations (2025),https://openreview.net/forum? id=ziw5bzg2NO

2025
[6]

Computational Linguistics (1990) 13

Church, K.W., Hanks, P.: Word association norms, mutual information, and lexicography. Computational Linguistics (1990) 13

1990
[7]

In: Forty-first international conference on machine learning (2024)

Esser, P., Kulal, S., Blattmann, A., Entezari, R., M¨ uller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transformers for high-resolution image synthesis. In: Forty-first international conference on machine learning (2024)

2024
[8]

Fu, C., Chen, P., Shen, Y., Qin, Y., Zhang, M., Lin, X., Yang, J., Zheng, X., Li, K., Sun, X., Wu, Y., Ji, R., Shan, C., He, R.: Mme: A comprehensive evaluation benchmark for multimodal large language models (2025),https://arxiv.org/abs/2306.13394

work page internal anchor Pith review arXiv 2025
[9]

In: Findings of the Association for Computational Linguistics: ACL 2025

Fu, Y., Xie, R., Sun, X., Kang, Z., Li, X.: Mitigating hallucination in multimodal large language model via hallucination-targeted direct preference optimization. In: Findings of the Association for Computational Linguistics: ACL 2025. pp. 16563–16577 (2025)

2025
[10]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Guan, T., Liu, F., Wu, X., Xian, R., Li, Z., Liu, X., Wang, X., Chen, L., Huang, F., Yacoob, Y., et al.: Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14375–14385 (2024)

2024
[11]

ICLR1(2), 3 (2022)

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. ICLR1(2), 3 (2022)

2022
[12]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Huang, Q., Dong, X., Zhang, P., Wang, B., He, C., Wang, J., Lin, D., Zhang, W., Yu, N.: Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13418–13427 (2024)

2024
[13]

In: The Thirteenth International Conference on Learning Representations (2025),https://openreview.net/forum?id=7uDI7w5RQA

Kang, S., Kim, J., Kim, J., Hwang, S.J.: See what you are told: Visual attention sink in large multimodal models. In: The Thirteenth International Conference on Learning Representations (2025),https://openreview.net/forum?id=7uDI7w5RQA

2025
[14]

Lauren¸ con, H., Tronchon, L., Cord, M., Sanh, V.: What matters when building vision-language models? Advances in Neural Information Processing Systems37, 87874–87907 (2024)

2024
[15]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Leng, S., Zhang, H., Chen, G., Li, X., Lu, S., Miao, C., Bing, L.: Mitigating object halluci- nations in large vision-language models through visual contrastive decoding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13872–13882 (2024)

2024
[16]

In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Li, W., Huang, Z., Li, H., Lu, L., Lu, Y., Tian, X., Shen, X., Ye, J.: Visual evidence prompting mitigates hallucinations in large vision-language models. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Asso- ciation for Computational Linguistics (2025),https://aclanthology.org/2025.acl-...

2025
[17]

In Bouamor, H., Pino, J

Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, W.X., Wen, J.: Evaluating object hallucination in large vision-language models. In: Bouamor, H., Pino, J., Bali, K. (eds.) Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023. pp. 292–305. Association for Computational Linguistics (2023...

work page doi:10.18653/v1/2023.emnlp-main.20 2023
[18]

In: Computer vision–ECCV 2014: 13th 14 European conference, zurich, Switzerland, September 6-12, 2014, proceedings, part v 13

Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll´ ar, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: Computer vision–ECCV 2014: 13th 14 European conference, zurich, Switzerland, September 6-12, 2014, proceedings, part v 13. pp. 740–755. Springer (2014)

2014
[19]

In: The Twelfth International Conference on Learning Representations (2024),https://openreview.net/forum?id=J44HfH4JCg

Liu, F., Lin, K., Li, L., Wang, J., Yacoob, Y., Wang, L.: Mitigating hallucination in large multi-modal models via robust instruction tuning. In: The Twelfth International Conference on Learning Representations (2024),https://openreview.net/forum?id=J44HfH4JCg

2024
[20]

Advances in neural information processing systems36(2023)

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36(2023)

2023
[21]

Reduc- ing hallucinations in vision-language models via latent space steering.arXiv preprint arXiv:2410.15778, 2024

Liu, S., Ye, H., Xing, L., Zou, J.: Reducing hallucinations in vision-language models via latent space steering. arXiv preprint arXiv:2410.15778 (2024)

work page arXiv 2024
[22]

URL https://arxiv

Liu, S., Zheng, K., Chen, W.: Paying more attention to image: A training-free method for alleviating hallucination in lvlms, 2024. URL https://arxiv. org/abs/2407.21771

work page arXiv 2024
[23]

In: European conference on computer vision

Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Jiang, Q., Li, C., Yang, J., Su, H., et al.: Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In: European conference on computer vision. pp. 38–55. Springer (2024)

2024
[24]

In: Proceedings of the 2024 Conference on Empiri- cal Methods in Natural Language Processing

Liu, Y., Ji, T., Sun, C., Wu, Y., Zhou, A.: Investigating and mitigating object hallucinations in pretrained vision-language (CLIP) models. In: Proceedings of the 2024 Conference on Empiri- cal Methods in Natural Language Processing. pp. 18288–18301. Association for Computational Linguistics (2024),https://aclanthology.org/2024.emnlp-main.1016/

2024
[25]

Mia-dpo: Multi-image augmented di- rect preference optimization for large vision-language mod- els

Liu, Z., Zang, Y., Dong, X., Zhang, P., Cao, Y., Duan, H., He, C., Xiong, Y., Lin, D., Wang, J.: Mia-dpo: Multi-image augmented direct preference optimization for large vision-language models. arXiv preprint arXiv:2410.17637 (2024)

work page arXiv 2024
[26]

In: The 36th Conference on Neural Information Processing Systems (NeurIPS) (2022)

Lu, P., Mishra, S., Xia, T., Qiu, L., Chang, K.W., Zhu, S.C., Tafjord, O., Clark, P., Kalyan, A.: Learn to explain: Multimodal reasoning via thought chains for science question answering. In: The 36th Conference on Neural Information Processing Systems (NeurIPS) (2022)

2022
[27]

arXiv preprint arXiv:2509.11287 (2025)

Lu, Y., Zhang, Z., Yuan, C., Gao, J., Zhang, C., Qi, X., Li, B., Hu, W.: Mitigating hal- lucinations in large vision-language models by self-injecting hallucinations. arXiv preprint arXiv:2509.11287 (2025)

work page arXiv 2025
[28]

Advances in Neural Information Processing Systems (2025)

Parekh, J., Khayatan, P., Shukor, M., Dapogny, A., Newson, A., Cord, M.: Learning to steer: Input-dependent steering for multimodal llms. Advances in Neural Information Processing Systems (2025)

2025
[29]

In: Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)

Petryk, S., Chan, D., Kachinthaya, A., Zou, H., Canny, J., Gonzalez, J., Darrell, T.: Aloha: A new measure for hallucination in captioning models. In: Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers). pp. 342–357 (2024)

2024
[30]

In: European Conference on Computer Vision

Pi, R., Han, T., Xiong, W., Zhang, J., Liu, R., Pan, R., Zhang, T.: Strengthening multimodal large language model with bootstrapped preference optimization. In: European Conference on Computer Vision. pp. 382–398. Springer (2024)

2024
[31]

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners (2019) 15

2019
[32]

Advances in neural information processing systems36, 53728–53741 (2023)

Rafailov, R., Sharma, A., Mitchell, E., Manning, C.D., Ermon, S., Finn, C.: Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems36, 53728–53741 (2023)

2023
[33]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Reimers, N., Gurevych, I.: Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084 (2019)

work page internal anchor Pith review arXiv 1908
[34]

In: Conference on Empirical Methods in Natural Language Processing (2018),https://api.semanticscholar.org/CorpusID:52176506

Rohrbach, A., Hendricks, L.A., Burns, K., Darrell, T., Saenko, K.: Object hallucination in image captioning. In: Conference on Empirical Methods in Natural Language Processing (2018),https://api.semanticscholar.org/CorpusID:52176506

2018
[35]

In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 2556– 2565 (2018)

2018
[36]

In: The Twelfth Inter- national Conference on Learning Representations (2024),https://openreview.net/forum? id=mMaQvkMzDi

Shukor, M., Rame, A., Dancette, C., Cord, M.: Beyond task performance: evaluating and reducing the flaws of large multimodal models with in-context-learning. In: The Twelfth Inter- national Conference on Learning Representations (2024),https://openreview.net/forum? id=mMaQvkMzDi

2024
[37]

A long way to go: Investigat- ing length correlations in RLHF.arXiv preprint arXiv:2310.03716,

Singhal, P., Goyal, T., Xu, J., Durrett, G.: A long way to go: Investigating length correlations in rlhf. arXiv preprint arXiv:2310.03716 (2023)

work page arXiv 2023
[38]

Aligning large multimodal models with factually augmented rlhf.arXiv preprint arXiv:2309.14525, 2023

Sun, Z., Shen, S., Cao, S., Liu, H., Li, C., Shen, Y., Gan, C., Gui, L., Wang, Y.X., Yang, Y., Keutzer, K., Darrell, T.: Aligning large multimodal models with factually augmented rlhf. ArXivabs/2309.14525(2023),https://api.semanticscholar.org/CorpusID:262824780

work page arXiv 2023
[39]

arXiv preprint arXiv:2410.11779 , year=

Wang, C., Chen, X., Zhang, N., Tian, B., Xu, H., Deng, S., Chen, H.: Mllm can see? dynamic correction decoding for hallucination mitigation. arXiv preprint arXiv:2410.11779 (2024)

work page arXiv 2024
[40]

In: The Thirteenth International Conference on Learning Representations

Wang, K., Gu, H., Gao, M., Zhou, K.: Damo: Decoding by accumulating activations momen- tum for mitigating hallucinations in vision-language models. In: The Thirteenth International Conference on Learning Representations
[41]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., et al.: Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., Liu, Z., Jing, L., Ye, S., Shao, J., et al.: Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265 (2025)

work page internal anchor Pith review arXiv 2025
[43]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Wu, Y., Zhang, L., Yao, H., Du, J., Yan, K., Ding, S., Wu, Y., Li, X.: Antidote: A uni- fied framework for mitigating lvlm hallucinations in counterfactual presupposition and object perception. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 14646–14656 (2025)

2025
[44]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Xiao, B., Wu, H., Xu, W., Dai, X., Hu, H., Lu, Y., Zeng, M., Liu, C., Yuan, L.: Florence- 2: Advancing a unified representation for a variety of vision tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4818–4829 (2024) 16

2024
[45]

V-dpo: Mitigating hallucination in large vi- sion language models via vision-guided direct preference optimization.arXiv preprint arXiv:2411.02712,

Xie, Y., Li, G., Xu, X., Kan, M.Y.: V-dpo: Mitigating hallucination in large vision lan- guage models via vision-guided direct preference optimization. arXiv preprint arXiv:2411.02712 (2024)

work page arXiv 2024
[46]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al.: Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

In: The Thirteenth International Conference on Learning Representations (2025),https://openreview.net/forum?id=Bjq4W7P2Us

Yang, T., Li, Z., Cao, J., Xu, C.: Mitigating hallucination in large vision-language models via modular attribution and intervention. In: The Thirteenth International Conference on Learning Representations (2025),https://openreview.net/forum?id=Bjq4W7P2Us

2025
[48]

arXiv preprint arXiv:2510.02324 (2025)

Yang, W., Qiu, X., Yu, L., Zhang, Y., Yang, O.A., Kokhlikyan, N., Cancedda, N., Garcia- Olano, D.: Hallucination reduction with casal: Contrastive activation steering for amortized learning. arXiv preprint arXiv:2510.02324 (2025)

work page arXiv 2025
[49]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Yang, Z., Luo, X., Han, D., Xu, Y., Li, D.: Mitigating hallucinations in large vision-language models via dpo: On-policy data hold the key. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 10610–10620 (2025)

2025
[50]

In: International conference on machine learning

Yu, W., Yang, Z., Li, L., Wang, J., Lin, K., Liu, Z., Wang, X., Wang, L.: Mm-vet: Evaluating large multimodal models for integrated capabilities. In: International conference on machine learning. PMLR (2024)

2024
[51]

Zhao, Z., Wang, B., Ouyang, L., Dong, X., Wang, J., He, C.: Beyond hallucinations: Enhanc- ing lvlms through hallucination-aware direct preference optimization (2023)

2023
[52]

arXiv preprint arXiv:2402.11411 , year=

Zhou, Y., Cui, C., Rafailov, R., Finn, C., Yao, H.: Aligning modalities in vision large language models via preference fine-tuning. arXiv preprint arXiv:2402.11411 (2024) 17 Overview Section A, describes the construction and the evaluation protocol forHalluScopebenchmark Section B contains various implementation and evaluation details regarding hallucinat...

work page arXiv 2024