pith. machine review for the scientific record. sign in

arxiv: 2402.00253 · v2 · submitted 2024-02-01 · 💻 cs.CV · cs.CL· cs.LG

Recognition: 2 theorem links

A Survey on Hallucination in Large Vision-Language Models

Dapeng Chen, Hanchao Liu, Ke Wang, Liping Hou, Rongjun Li, Wei Peng, Wenyuan Xue, Xiutian Zhao, Yifei Chen

Pith reviewed 2026-05-13 22:05 UTC · model grok-4.3

classification 💻 cs.CV cs.CLcs.LG
keywords hallucinationlarge vision-language modelsLVLMsmultimodal modelsevaluation benchmarksmitigation methodsroot causessurvey
0
0 comments X

The pith

Large vision-language models generate text that conflicts with input images, and this survey defines the problem while reviewing its symptoms, benchmarks, causes, and fixes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines hallucinations in LVLMs as misalignment between factual visual content and generated text. It presents a range of symptoms and notes challenges specific to these models that differ from text-only cases. The survey reviews tailored benchmarks and evaluation methods, then traces root causes to training data and model components. It examines current mitigation approaches and ends with open questions for future work. A reader would care because fixing these errors is necessary for trustworthy use of multimodal systems in tasks like visual question answering.

Core claim

The survey establishes an overview of LVLM hallucinations by clarifying the concept and its symptoms, outlining benchmarks and evaluation methodologies, investigating root causes from training data and model components, critically reviewing mitigation methods, and discussing open questions and future directions to facilitate future mitigation.

What carries the argument

The structured overview that begins with concept clarification and symptom variety, then covers evaluation benchmarks, root-cause analysis in data and components, and mitigation review.

If this is right

  • The benchmarks enable standardized measurement of hallucination rates across different LVLMs.
  • Root-cause insights support targeted changes to training data and model architecture.
  • Reviewed mitigation methods supply concrete starting points for reducing image-text conflicts.
  • The open questions direct subsequent research toward more reliable multimodal generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The survey's categorization could be adapted to study similar errors in newer multimodal systems that add audio or video inputs.
  • Applying the evaluation methods to the most recent LVLMs would test whether the overview remains current.
  • Reducing these hallucinations may improve safety in applications where models guide decisions from visual input.

Load-bearing premise

Hallucinations can be consistently defined and isolated from other error types such as reasoning failures across diverse LVLM architectures and tasks.

What would settle it

A detailed error analysis across multiple LVLMs that shows visual misalignment cases overlap substantially with reasoning failures and resist consistent isolation would undermine the survey's separation of concepts.

read the original abstract

Recent development of Large Vision-Language Models (LVLMs) has attracted growing attention within the AI landscape for its practical implementation potential. However, ``hallucination'', or more specifically, the misalignment between factual visual content and corresponding textual generation, poses a significant challenge of utilizing LVLMs. In this comprehensive survey, we dissect LVLM-related hallucinations in an attempt to establish an overview and facilitate future mitigation. Our scrutiny starts with a clarification of the concept of hallucinations in LVLMs, presenting a variety of hallucination symptoms and highlighting the unique challenges inherent in LVLM hallucinations. Subsequently, we outline the benchmarks and methodologies tailored specifically for evaluating hallucinations unique to LVLMs. Additionally, we delve into an investigation of the root causes of these hallucinations, encompassing insights from the training data and model components. We also critically review existing methods for mitigating hallucinations. The open questions and future directions pertaining to hallucinations within LVLMs are discussed to conclude this survey.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The paper is a literature survey on hallucinations in Large Vision-Language Models (LVLMs). It defines hallucinations as misalignment between factual visual content and generated text, reviews symptoms and LVLM-specific challenges, catalogs benchmarks and evaluation methods, examines root causes tied to training data and model architecture, surveys mitigation techniques, and outlines open questions for future work.

Significance. A well-structured survey that accurately synthesizes the existing literature would provide a useful reference point for researchers entering the area of LVLM reliability, helping to organize disparate findings on causes and mitigations.

major comments (1)
  1. The central definition of hallucination as visual-textual misalignment (stated in the abstract and introduction) risks conflating distinct error types; the survey should include an explicit discussion, with examples from cited benchmarks, of how this definition is operationalized to separate hallucinations from reasoning failures or captioning inaccuracies.
minor comments (3)
  1. The abstract would benefit from a brief statement of the number of papers reviewed and the time window covered to convey the survey's scope.
  2. In the mitigation section, ensure that each reviewed method is accompanied by a short note on the specific LVLM architectures or datasets on which it was evaluated.
  3. The future-directions paragraph could list two or three concrete, falsifiable research questions rather than high-level topics.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and positive recommendation. We address the major comment below and will revise the manuscript to incorporate the requested clarification.

read point-by-point responses
  1. Referee: The central definition of hallucination as visual-textual misalignment (stated in the abstract and introduction) risks conflating distinct error types; the survey should include an explicit discussion, with examples from cited benchmarks, of how this definition is operationalized to separate hallucinations from reasoning failures or captioning inaccuracies.

    Authors: We appreciate this observation and agree that an explicit discussion would strengthen the paper. While our definition centers on misalignment between factual visual content and generated text (as the core phenomenon surveyed), we acknowledge the value in delineating it from related errors. In the revised manuscript, we will add a dedicated paragraph in Section 2 (Definition and Symptoms) that operationalizes the definition. We will draw on examples from the benchmarks catalogued in Section 3, such as POPE (which isolates object-existence hallucinations via binary visual queries, separate from reasoning) and HallusionBench (which includes controlled tests for visual illusions versus logical reasoning failures). We will also note that pure captioning inaccuracies (e.g., stylistic omissions in standard image-caption datasets) fall outside our scope unless they involve factual visual misalignment. This addition will clarify boundaries without changing the survey's scope or central thesis. revision: yes

Circularity Check

0 steps flagged

No significant circularity in this literature survey

full rationale

This is a survey paper that reviews external literature on LVLM hallucinations without advancing original derivations, equations, fitted parameters, or testable predictions. The central claim is to provide an overview of symptoms, benchmarks, causes, and mitigation methods drawn from cited prior work. No load-bearing step reduces by construction to the paper's own inputs or self-citations; definitions and classifications are presented as syntheses of existing research rather than self-referential constructs. The paper explicitly positions itself as a review to facilitate future work, with no self-contained formal chain that could exhibit circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As a survey paper, no new free parameters, axioms, or invented entities are introduced; the content reviews concepts, benchmarks, and methods from the existing literature on large vision-language models.

pith-pipeline@v0.9.0 · 5491 in / 1004 out tokens · 36857 ms · 2026-05-13T22:05:38.517168+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. LLM-as-Judge Framework for Evaluating Tone-Induced Hallucination in Vision-Language Models

    cs.CV 2026-04 unverdicted novelty 7.0

    Ghost-100 benchmark shows prompt tone drives hallucination rates and intensities in VLMs, with non-monotonic peaks at intermediate pressure and task-specific differences that aggregate metrics hide.

  2. Mitigating Action-Relation Hallucinations in LVLMs via Relation-aware Visual Enhancement

    cs.CV 2026-05 unverdicted novelty 6.0

    A new attention-enhancement method using ARS scores and RVE reduces action-relation hallucinations in LVLMs while generalizing to spatial and object hallucinations.

  3. Vocabulary Hijacking in LVLMs: Unveiling Critical Attention Heads by Excluding Inert Tokens to Mitigate Hallucination

    cs.MM 2026-05 unverdicted novelty 6.0

    LVLMs show vocabulary hijacking by inert tokens that decode to hijacking anchors; HABI locates them, NHAR finds resilient heads, and HAVAE boosts those heads to cut hallucinations.

  4. Through the Lens of Character: Resolving Modality-Role Interference in Multimodal Role-Playing Agent

    cs.CV 2026-05 unverdicted novelty 6.0

    CAVI framework uses character-guided token pruning, orthogonal feature modulation, and modality-adaptive role steering to resolve modality-role interference in multimodal RPAs.

  5. CAST: Mitigating Object Hallucination in Large Vision-Language Models via Caption-Guided Visual Attention Steering

    cs.CV 2026-05 unverdicted novelty 6.0

    CAST reduces object hallucination in LVLMs by 6.03% on average across five models and five benchmarks by identifying caption-sensitive attention heads and applying optimized steering directions to their outputs, with ...

  6. Mitigating Multimodal LLMs Hallucinations via Relevance Propagation at Inference Time

    cs.LG 2026-05 unverdicted novelty 6.0

    LIME reduces hallucinations in multimodal LLMs by using LRP to boost perceptual modality contributions through inference-time KV updates.

  7. Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs

    cs.CV 2026-05 unverdicted novelty 6.0

    PVM adds a parallel branch to LVLMs that directly supplies visual embeddings to prevent attention decay over long generated sequences, yielding accuracy gains on reasoning tasks with minimal overhead.

  8. Online Self-Calibration Against Hallucination in Vision-Language Models

    cs.CV 2026-05 unverdicted novelty 6.0

    OSCAR exploits the generative-discriminative gap in LVLMs to build online preference data with MCTS and dual-granularity rewards for DPO-based calibration, claiming SOTA hallucination reduction and improved multimodal...

  9. SceneCritic: A Symbolic Evaluator for 3D Indoor Scene Synthesis

    cs.CV 2026-04 unverdicted novelty 6.0

    SceneCritic is a symbolic, ontology-grounded evaluator for floor-plan layouts that identifies specific semantic, orientation, and geometric violations and aligns better with human judgments than VLM-based scorers.

  10. HTDC: Hesitation-Triggered Differential Calibration for Mitigating Hallucination in Large Vision-Language Models

    cs.CV 2026-04 unverdicted novelty 6.0

    HTDC mitigates hallucinations in LVLMs by triggering calibration only at hesitation-prone decoding steps via contrasts with visual-nullification and semantic-nullification probes.

  11. VLMaterial: Vision-Language Model-Based Camera-Radar Fusion for Physics-Grounded Material Identification

    eess.SP 2026-04 unverdicted novelty 6.0

    VLMaterial fuses VLMs and physics-based radar analysis via PRCA extraction and context-augmented generation to reach 96.08% material identification accuracy on 41 everyday objects without task-specific training.

  12. Not Blind but Silenced: Rebalancing Vision and Language via Adversarial Counter-Commonsense Equilibrium

    cs.CV 2026-05 unverdicted novelty 5.0

    ACE uses adversarial counter-commonsense perturbations on image tokens during decoding to suppress hallucinated linguistic priors while preserving stable visual signals in MLLMs.

  13. Illusion-Aware Visual Preprocessing and Anti-Illusion Prompting for Classic Illusion Understanding in Vision-Language Models

    cs.CV 2026-05 conditional novelty 5.0

    A combination of illusion-specific image transformations, anti-illusion prompts, and majority voting lets VLMs reach 90.48% accuracy on a 630-image illusion benchmark without any model training.

  14. Perceptual Flow Network for Visually Grounded Reasoning

    cs.CV 2026-05 unverdicted novelty 5.0

    PFlowNet decouples perception from reasoning, integrates multi-dimensional rewards with vicinal geometric shaping via variational RL, and reports new SOTA results on V* Bench (90.6%) and MME-RealWorld-lite (67.0%).

  15. Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs

    cs.CV 2026-05 unverdicted novelty 5.0

    PVM adds a parallel learnable branch to LVLMs that supplies visual embeddings on demand to structurally prevent attention decay and visual signal dilution during deep autoregressive generation.

  16. Mitigating Hallucinations in Large Vision-Language Models without Performance Degradation

    cs.CV 2026-04 unverdicted novelty 5.0

    MPD reduces hallucinations in LVLMs by 23.4% while retaining 97.4% of general capability through semantic disentanglement and selective parameter updates.

  17. Mitigating Entangled Steering in Large Vision-Language Models for Hallucination Reduction

    cs.CV 2026-04 unverdicted novelty 5.0

    MESA reduces hallucinations in LVLMs via controlled selective latent intervention that preserves the original token distribution.

  18. Steering the Verifiability of Multimodal AI Hallucinations

    cs.AI 2026-04 unverdicted novelty 5.0

    Researchers create a human-labeled dataset of obvious and elusive multimodal hallucinations and use learned activation-space probes to control their verifiability in MLLMs.

  19. WSVD: Weighted Low-Rank Approximation for Fast and Efficient Execution of Low-Precision Vision-Language Models

    cs.CV 2026-04 unverdicted novelty 5.0

    WSVD delivers over 1.8x faster VLM decoding via weighted low-rank approximation at fine granularity plus quantization, without accuracy loss.

  20. Hallucination of Multimodal Large Language Models: A Survey

    cs.CV 2024-04 accept novelty 5.0

    The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.

  21. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

    cs.CL 2023-11 unverdicted novelty 5.0

    The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.

  22. SmoGVLM: A Small, Graph-enhanced Vision-Language Model

    cs.CV 2026-04 unverdicted novelty 4.0

    A graph-enhanced 1.3B-parameter VLM achieves up to 16.24% gains and outperforms larger VLMs by integrating structured knowledge via GNNs.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · cited by 21 Pith papers · 13 internal anchors

  1. [1]

    Flamingo: a visual language model for few-shot learning

    [Alayrac et al., 2022] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, et al. Flamingo: a visual language model for few-shot learning. In NeurIPS, volume 35,

  2. [2]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    [Bai et al., 2023] Jinze Bai, Shuai Bai, Shusheng Yang, et al. Qwen-vl: A frontier large vision-language model with ver- satile abilities. arXiv preprint arXiv:2308.12966,

  3. [3]

    Position-enhanced visual instruction tuning for multimodal large language models

    [Chen et al., 2023a] Chi Chen, Ruoyu Qin, Fuwen Luo, et al. Position-enhanced visual instruction tuning for multimodal large language models. arXiv preprint arXiv:2308.13437,

  4. [4]

    Minigpt-v2: large language model as a unified interface for vision-language multi-task learning

    [Chen et al., 2023b] Jun Chen, Deyao Zhu, Xiaoqian Shen, et al. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478,

  5. [5]

    InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

    [Chen et al., 2023c] Zhe Chen, Jiannan Wu, Wenhai Wang, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238,

  6. [6]

    Can we edit multimodal large language models? In EMNLP,

    [Cheng et al., 2023] Siyuan Cheng, Bozhong Tian, Qingbin Liu, et al. Can we edit multimodal large language models? In EMNLP,

  7. [7]

    Fine-grained image captioning with clip re- ward

    [Cho et al., 2022] Jaemin Cho, Seunghyun Yoon, Ajinkya Kale, et al. Fine-grained image captioning with clip re- ward. In Findings of NAACL,

  8. [8]

    Dola: Decoding by contrasting layers improves factuality in large language models

    [Chuang et al., 2023] Yung-Sung Chuang, Yujia Xie, Hongyin Luo, et al. Dola: Decoding by contrasting layers improves factuality in large language models. arXiv preprint arXiv:2309.03883,

  9. [9]

    InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

    [Dai et al., 2023a] Wenliang Dai, Junnan Li, Dongxu Li, et al. Instructblip: Towards general-purpose vision- language models with instruction tuning. arXiv preprint arXiv:2305.06500,

  10. [10]

    Neural path hunter: Reducing hallucination in dialogue systems via path grounding

    [Dziri et al., 2021] Nouha Dziri, Andrea Madotto, Osmar R Zaiane, et al. Neural path hunter: Reducing hallucination in dialogue systems via path grounding. In EMNLP,

  11. [11]

    arXiv preprint arXiv:2304.15010 , year=

    [Gao et al., 2023] Peng Gao, Jiaming Han, Renrui Zhang, et al. Llama-adapter v2: Parameter-efficient visual instruc- tion model. arXiv preprint arXiv:2304.15010,

  12. [12]

    Imagebind one embedding space to bind them all

    [Girdhar et al., 2023] Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, et al. Imagebind one embedding space to bind them all. In CVPR,

  13. [13]

    Detecting and preventing hallucinations in large vi- sion language models

    [Gunjal et al., 2023] Anisha Gunjal, Jihan Yin, and Erhan Bas. Detecting and preventing hallucinations in large vi- sion language models. arXiv preprint arXiv:2308.06394 ,

  14. [14]

    arXiv preprint arXiv:2309.03905 , year=

    [Han et al., 2023] Jiaming Han, Renrui Zhang, Wenqi Shao, et al. Imagebind-llm: Multi-modality instruction tuning. arXiv preprint arXiv:2309.03905,

  15. [15]

    The curious case of neural text degeneration

    [Holtzman et al., 2020] Ari Holtzman, Jan Buys, Li Du, et al. The curious case of neural text degeneration. In ICLR,

  16. [16]

    Hu, Yelong Shen, Phillip Wallis, et al

    [Hu et al., 2022] Edward J. Hu, Yelong Shen, Phillip Wallis, et al. Lora: Low-rank adaptation of large language models. In ICLR,

  17. [17]

    Ciem: Contrastive instruction evaluation method for better instruction tuning

    [Hu et al., 2023] Hongyu Hu, Jiyuan Zhang, Minyi Zhao, et al. Ciem: Contrastive instruction evaluation method for better instruction tuning. In NeurIPS 2023 Workshop on Instruction Tuning and Instruction F ollowing,

  18. [18]

    A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

    [Huang et al., 2023a] Lei Huang, Weijiang Yu, Weitao Ma, et al. A survey on hallucination in large language mod- els: Principles, taxonomy, challenges, and open questions. arXiv preprint arXiv:2311.05232,

  19. [19]

    Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation

    [Huang et al., 2023b] Qidong Huang, Xiaoyi Dong, Pan Zhang, et al. Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation. arXiv preprint arXiv:2311.17911,

  20. [20]

    Vcoder: Versatile vision encoders for multimodal large language models

    [Jain et al., 2023] Jitesh Jain, Jianwei Yang, and Humphrey Shi. Vcoder: Versatile vision encoders for multimodal large language models. arXiv preprint arXiv:2312.14233,

  21. [21]

    Survey of hallucination in natural language generation

    [Ji et al., 2023] Ziwei Ji, Nayeon Lee, Rita Frieske, et al. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12),

  22. [22]

    Hallucination augmented contrastive learn- ing for multimodal large language model

    [Jiang et al., 2023] Chaoya Jiang, Haiyang Xu, Mengfan Dong, et al. Hallucination augmented contrastive learn- ing for multimodal large language model. arXiv preprint arXiv:2312.06968,

  23. [23]

    Faithscore: Evaluating hallucinations in large vision- language models

    [Jing et al., 2023] Liqiang Jing, Ruosen Li, Yunmo Chen, et al. Faithscore: Evaluating hallucinations in large vision- language models. arXiv preprint arXiv:2311.01477,

  24. [24]

    V olcano: Mitigating multimodal hallucination through self-feedback guided revision

    [Lee et al., 2023] Seongyun Lee, Sue Hyun Park, Yongrae Jo, et al. V olcano: Mitigating multimodal hallucination through self-feedback guided revision. arXiv preprint arXiv:2311.07362,

  25. [25]

    Mitigating object hallucinations in large vision-language models through visual contrastive decod- ing

    [Leng et al., 2023] Sicong Leng, Hang Zhang, Guanzheng Chen, et al. Mitigating object hallucinations in large vision-language models through visual contrastive decod- ing. arXiv preprint arXiv:2311.16922,

  26. [26]

    Monkey: Image resolution and text label are important things for large multi-modal models

    [Li et al., 2023c] Zhang Li, Biao Yang, Qiang Liu, et al. Monkey: Image resolution and text label are important things for large multi-modal models. arXiv preprint arXiv:2311.06607,

  27. [27]

    Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

    [Lin et al., 2023] Bin Lin, Bin Zhu, Yang Ye, et al. Video- llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122,

  28. [28]

    Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

    [Liu et al., 2023a] Fuxiao Liu, Kevin Lin, Linjie Li, et al. Mitigating hallucination in large multi-modal mod- els via robust instruction tuning. arXiv preprint arXiv:2306.14565,

  29. [29]

    Improved Baselines with Visual Instruction Tuning

    [Liu et al., 2023b] Haotian Liu, Chunyuan Li, Yuheng Li, et al. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744,

  30. [30]

    Llava-plus: Learning to use tools for creating multi- modal agents

    [Liu et al., 2023d] Shilong Liu, Hao Cheng, Haotian Liu, et al. Llava-plus: Learning to use tools for creating multi- modal agents. arXiv preprint arXiv:2311.05437,

  31. [31]

    Vision-and-language pretrained models: A sur- vey

    [Long et al., 2022] Siqu Long, Feiqi Cao, Soyeon Caren Han, et al. Vision-and-language pretrained models: A sur- vey. In IJCAI,

  32. [32]

    Negative object presence evaluation (nope) to measure object hallucination in vision-language models

    [Lovenia et al., 2023] Holy Lovenia, Wenliang Dai, Samuel Cahyawijaya, et al. Negative object presence evaluation (nope) to measure object hallucination in vision-language models. arXiv preprint arXiv:2310.05338,

  33. [33]

    Neural baby talk

    [Lu et al., 2018] Jiasen Lu, Jianwei Yang, Dhruv Batra, et al. Neural baby talk. In CVPR,

  34. [34]

    Evaluation and mitigation of agnosia in multimodal large language models

    [Lu et al., 2023] Jiaying Lu, Jinmeng Rao, Kezhen Chen, et al. Evaluation and mitigation of agnosia in multimodal large language models. arXiv preprint arXiv:2309.04041,

  35. [35]

    GPT-4 Technical Report

    [OpenAI, 2023] OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,

  36. [36]

    Learning transferable visual models from natural language supervision

    [Radford et al., 2021] Alec Radford, Jong Wook Kim, Chris Hallacy, et al. Learning transferable visual models from natural language supervision. In ICML,

  37. [37]

    Direct Preference Optimization: Your Language Model is Secretly a Reward Model

    [Rafailov et al., 2023] Rafael Rafailov, Archit Sharma, Eric Mitchell, et al. Direct preference optimization: Your lan- guage model is secretly a reward model. arXiv preprint arXiv:2305.18290,

  38. [38]

    Object hallucination in image captioning

    [Rohrbach et al., 2018] Anna Rohrbach, Lisa Anne Hen- dricks, Kaylee Burns, et al. Object hallucination in image captioning. In EMNLP,

  39. [39]

    Learning to summarize with human feed- back

    [Stiennon et al., 2020] Nisan Stiennon, Long Ouyang, Jef- frey Wu, et al. Learning to summarize with human feed- back. In NeurIPS, volume 33,

  40. [40]

    Aligning large multimodal models with factually augmented rlhf.arXiv preprint arXiv:2309.14525, 2023

    [Sun et al., 2023] Zhiqing Sun, Sheng Shen, Shengcao Cao, et al. Aligning large multimodal models with factually augmented rlhf. arXiv preprint arXiv:2309.14525,

  41. [41]

    LLaMA: Open and Efficient Foundation Language Models

    [Touvron et al., 2023a] Hugo Touvron, Thibaut Lavril, Gau- tier Izacard, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971,

  42. [42]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    [Touvron et al., 2023b] Hugo Touvron, Louis Martin, Kevin Stone, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288,

  43. [43]

    Vigc: Visual instruction generation and correction

    [Wang et al., 2023a] Bin Wang, Fan Wu, Xiao Han, et al. Vigc: Visual instruction generation and correction. arXiv preprint arXiv:2308.12714,

  44. [44]

    arXiv preprint arXiv:2311.07397 , year=

    [Wang et al., 2023b] Junyang Wang, Yuhang Wang, Guo- hai Xu, et al. An llm-free multi-dimensional bench- mark for mllms hallucination evaluation. arXiv preprint arXiv:2311.07397,

  45. [45]

    Evaluation and analysis of hallucina- tion in large vision-language models

    [Wang et al., 2023c] Junyang Wang, Yiyang Zhou, Guo- hai Xu, et al. Evaluation and analysis of hallucina- tion in large vision-language models. arXiv preprint arXiv:2308.15126,

  46. [46]

    arXiv preprint arXiv:2306.13549 , year=

    [Yin et al., 2023a] Shukang Yin, Chaoyou Fu, Sirui Zhao, et al. A survey on multimodal large language models. arXiv preprint arXiv:2306.13549,

  47. [47]

    Woodpecker: Hallucination correction for multimodal large language models

    [Yin et al., 2023b] Shukang Yin, Chaoyou Fu, Sirui Zhao, et al. Woodpecker: Hallucination correction for multimodal large language models. arXiv preprint arXiv:2310.16045,

  48. [48]

    Ferret: Refer and ground anything anywhere at any granularity.arXiv preprint arXiv:2310.07704, 2023

    [You et al., 2023] Haoxuan You, Haotian Zhang, Zhe Gan, et al. Ferret: Refer and ground anything anywhere at any granularity. arXiv preprint arXiv:2310.07704,

  49. [49]

    Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback.arXiv preprint arXiv:2312.00849, 2024

    [Yu et al., 2023] Tianyu Yu, Yuan Yao, Haoye Zhang, et al. Rlhf-v: Towards trustworthy mllms via behavior align- ment from fine-grained correctional human feedback. arXiv preprint arXiv:2312.00849,

  50. [50]

    Halle-switch: Rethinking and controlling ob- ject existence hallucinations in large vision language mod- els for detailed caption

    [Zhai et al., 2023] Bohan Zhai, Shijia Yang, Xiangchen Zhao, et al. Halle-switch: Rethinking and controlling ob- ject existence hallucinations in large vision language mod- els for detailed caption. arXiv preprint arXiv:2310.01779,

  51. [51]

    Recognize anything: A strong image tagging model

    [Zhang et al., 2023a] Youcai Zhang, Xinyu Huang, Jinyu Ma, et al. Recognize anything: A strong image tagging model. arXiv preprint arXiv:2306.03514,

  52. [52]

    Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models

    [Zhang et al., 2023b] Yue Zhang, Yafu Li, Leyang Cui, et al. Siren’s song in the ai ocean: A survey on hallucination in large language models. arXiv preprint arXiv:2309.01219,

  53. [53]

    Enhancing the spatial awareness capability of multi-modal large language model

    [Zhao et al., 2023a] Yongqiang Zhao, Zhenyu Li, Zhi Jin, et al. Enhancing the spatial awareness capability of multi-modal large language model. arXiv preprint arXiv:2310.20357,

  54. [54]

    Beyond hallucinations: Enhancing lvlms through hallucination-aware direct preference optimization,

    [Zhao et al., 2023b] Zhiyuan Zhao, Bin Wang, Linke Ouyang, et al. Beyond hallucinations: Enhancing lvlms through hallucination-aware direct preference optimization. arXiv preprint arXiv:2311.16839,

  55. [55]

    Analyzing and mitigating object hallucina- tion in large vision-language models

    [Zhou et al., 2023] Yiyang Zhou, Chenhang Cui, Jaehong Yoon, et al. Analyzing and mitigating object hallucina- tion in large vision-language models. In NeurIPS 2023 Workshop on Instruction Tuning and Instruction F ollow- ing,

  56. [56]

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    [Zhu et al., 2023] Deyao Zhu, Jun Chen, Xiaoqian Shen, et al. Minigpt-4: Enhancing vision-language understand- ing with advanced large language models. arXiv preprint arXiv:2304.10592, 2023