pith. machine review for the scientific record. sign in

arxiv: 2404.18930 · v2 · submitted 2024-04-29 · 💻 cs.CV

Recognition: 3 theorem links

· Lean Theorem

Hallucination of Multimodal Large Language Models: A Survey

Authors on Pith no claims yet

Pith reviewed 2026-05-11 12:26 UTC · model grok-4.3

classification 💻 cs.CV
keywords hallucinationmultimodal large language modelsMLLMssurveycausesbenchmarksmitigation
0
0 comments X

The pith

Multimodal large language models often produce text that contradicts the images they process, and this survey organizes the causes, benchmarks, and mitigation approaches for the problem.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reviews advances in detecting, measuring, and reducing hallucinations in multimodal large language models, systems that combine vision and language processing. It classifies the underlying causes of outputs that do not match visual content and surveys available evaluation benchmarks, metrics, and reduction strategies. A sympathetic reader would care because these inconsistencies block dependable use in tasks such as image captioning or visual reasoning. The survey also examines current limitations and poses open questions to steer future improvements in model trustworthiness.

Core claim

By drawing the granular classification and landscapes of hallucination causes, evaluation benchmarks, and mitigation methods, this survey deepens the understanding of hallucinations in MLLMs and inspires further advancements, while formulating open questions that delineate potential pathways for future research.

What carries the argument

The granular classification of hallucination causes, benchmarks, metrics, and mitigation strategies that structures the research landscape and highlights gaps.

If this is right

  • Targeted fixes can be developed once specific causes are isolated through the classification.
  • New models can be tested against the compiled benchmarks and metrics for consistent evaluation.
  • Combining reviewed mitigation strategies may yield stronger overall reliability in deployed systems.
  • Open questions point to concrete next steps such as better detection during generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same cause categories could be tested on text-only models to check whether visual input introduces distinct hallucination patterns.
  • Mitigation techniques listed might be integrated into training rather than applied only at inference time.
  • Re-running the benchmarks on models released after the survey could expose whether new hallucination forms have emerged.

Load-bearing premise

The reviewed research represents the full scope of the field and the proposed classifications capture the main categories without major omissions.

What would settle it

A later study that identifies a widespread hallucination cause absent from the survey's taxonomy or demonstrates that the reviewed benchmarks miss common real-world errors.

read the original abstract

This survey presents a comprehensive analysis of the phenomenon of hallucination in multimodal large language models (MLLMs), also known as Large Vision-Language Models (LVLMs), which have demonstrated significant advancements and remarkable abilities in multimodal tasks. Despite these promising developments, MLLMs often generate outputs that are inconsistent with the visual content, a challenge known as hallucination, which poses substantial obstacles to their practical deployment and raises concerns regarding their reliability in real-world applications. This problem has attracted increasing attention, prompting efforts to detect and mitigate such inaccuracies. We review recent advances in identifying, evaluating, and mitigating these hallucinations, offering a detailed overview of the underlying causes, evaluation benchmarks, metrics, and strategies developed to address this issue. Additionally, we analyze the current challenges and limitations, formulating open questions that delineate potential pathways for future research. By drawing the granular classification and landscapes of hallucination causes, evaluation benchmarks, and mitigation methods, this survey aims to deepen the understanding of hallucinations in MLLMs and inspire further advancements in the field. Through our thorough and in-depth review, we contribute to the ongoing dialogue on enhancing the robustness and reliability of MLLMs, providing valuable insights and resources for researchers and practitioners alike. Resources are available at: https://github.com/showlab/Awesome-MLLM-Hallucination.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. This survey paper examines hallucinations in multimodal large language models (MLLMs/LVLMs). It reviews the underlying causes, presents evaluation benchmarks and metrics, surveys mitigation strategies, discusses current challenges and limitations, and outlines open questions for future work. The authors provide granular classifications of causes, benchmarks, and methods, and link to a GitHub repository of resources.

Significance. If the coverage is representative, the survey would be a useful organizing resource in a rapidly expanding area of multimodal AI. The structured taxonomies of causes, evaluations, and mitigations, together with the public GitHub repository, could help researchers locate relevant work and identify gaps. The paper's value rests on the completeness of its synthesis rather than on new derivations or experiments.

major comments (2)
  1. [Introduction] Introduction and overall survey scope: the manuscript states it offers a 'comprehensive analysis' and 'granular classification' of causes, benchmarks, and mitigation methods, yet provides no description of the literature search methodology (search date, keywords, databases, inclusion/exclusion criteria, or number of papers screened). In a field that expanded explosively after 2023, this omission leaves the representativeness of the taxonomy unverified and is load-bearing for the central claim.
  2. [Sections 3-5] Sections on causes, evaluation, and mitigation (throughout): without explicit selection criteria or a coverage audit, it is impossible to determine whether influential papers or entire sub-areas (e.g., specific recent benchmarks or mitigation techniques) have been omitted from the proposed classifications.
minor comments (2)
  1. [Abstract] The abstract and introduction could briefly note the approximate number of papers reviewed and the final search cutoff date to give readers an immediate sense of scope.
  2. [Figures] Figure captions for the taxonomy diagrams would benefit from more explicit legends indicating how categories were derived.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and positive overall assessment of our survey. We address each major comment point by point below and will make revisions to improve transparency on literature selection and scope.

read point-by-point responses
  1. Referee: [Introduction] Introduction and overall survey scope: the manuscript states it offers a 'comprehensive analysis' and 'granular classification' of causes, benchmarks, and mitigation methods, yet provides no description of the literature search methodology (search date, keywords, databases, inclusion/exclusion criteria, or number of papers screened). In a field that expanded explosively after 2023, this omission leaves the representativeness of the taxonomy unverified and is load-bearing for the central claim.

    Authors: We acknowledge that an explicit description of the literature search process is absent from the current manuscript. Our survey was compiled by reviewing key papers on MLLM hallucinations from major sources up to the submission period, with emphasis on high-impact works that directly address causes, evaluations, or mitigations. To strengthen verifiability, we will add a dedicated paragraph in the revised Introduction detailing the methodology: search conducted through March 2024 using keywords including 'MLLM hallucination', 'LVLM hallucination', 'multimodal large language model hallucination', and 'vision-language model hallucination'; primary databases and venues comprising arXiv, Google Scholar, CVPR, ICCV, NeurIPS, ICLR, and ACL; and inclusion focused on papers contributing novel insights into hallucination phenomena, benchmarks, metrics, or mitigation strategies. This addition will clarify the basis for our taxonomies without altering the survey's core content. revision: yes

  2. Referee: [Sections 3-5] Sections on causes, evaluation, and mitigation (throughout): without explicit selection criteria or a coverage audit, it is impossible to determine whether influential papers or entire sub-areas (e.g., specific recent benchmarks or mitigation techniques) have been omitted from the proposed classifications.

    Authors: We agree that greater transparency on selection would help readers evaluate potential gaps in the classifications across Sections 3-5. The methodology paragraph added to the Introduction (as described above) will explicitly outline our criteria and sources, enabling assessment of coverage for causes, benchmarks, and mitigations. Additionally, we will expand the 'Challenges and Limitations' section to note that the rapid post-2023 expansion of the field means some very recent sub-area developments may not be exhaustively covered, while highlighting our public GitHub repository (https://github.com/showlab/Awesome-MLLM-Hallucination) as a mechanism for community updates and identification of omissions. These changes address the concern directly while preserving the survey's focus on synthesis rather than exhaustive enumeration. revision: yes

Circularity Check

0 steps flagged

No circularity: pure literature synthesis without derivations or self-referential reductions

full rationale

This is a survey paper whose central contribution is an overview and taxonomy drawn from external literature. No equations, fitted parameters, or original derivations appear anywhere in the manuscript. The classifications of causes, benchmarks, and mitigation strategies are presented as syntheses of cited prior work rather than as outputs derived from the paper's own inputs. Self-citations, if present, are not load-bearing for any claimed result; the paper does not invoke uniqueness theorems or ansatzes from its own prior publications to justify its structure. The assumption of comprehensive coverage is a methodological limitation of surveys in general but does not constitute circularity under the defined criteria, as there is no reduction of a prediction or claim to the paper's own fitted values or self-citations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The survey relies on standard literature review assumptions without introducing new parameters or entities.

axioms (1)
  • domain assumption The selected recent advances sufficiently represent the current state of hallucination research in MLLMs.
    Survey completeness depends on the authors' curation of key papers.

pith-pipeline@v0.9.0 · 5547 in / 988 out tokens · 102847 ms · 2026-05-11T12:26:16.676278+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Cost Jcost_nonneg unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We review recent advances in identifying, evaluating, and mitigating these hallucinations, offering a detailed overview of the underlying causes, evaluation benchmarks, metrics, and strategies developed to address this issue.

  • PhiForcing phi_unique_pos unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    By drawing the granular classification and landscapes of hallucination causes, evaluation benchmarks, and mitigation methods

  • DimensionForcing dimension_forced unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Hallucination of Multimodal Large Language Models: A Survey

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 41 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos

    cs.CV 2026-05 unverdicted novelty 8.0

    TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.

  2. VISTA: Video Interaction Spatio-Temporal Analysis Benchmark

    cs.CV 2026-05 unverdicted novelty 8.0

    VISTA is the first large-scale interaction-aware benchmark that decomposes videos into entities, actions, and relations to diagnose spatio-temporal biases in vision-language models.

  3. 3D-VCD: Hallucination Mitigation in 3D-LLM Embodied Agents through Visual Contrastive Decoding

    cs.CV 2026-04 unverdicted novelty 8.0

    3D-VCD reduces hallucinations in 3D-LLM embodied agents by contrasting predictions from original and distorted 3D scene representations at inference time.

  4. Decoding by Perturbation: Mitigating MLLM Hallucinations via Dynamic Textual Perturbation

    cs.CL 2026-04 unverdicted novelty 7.0

    DeP mitigates MLLM hallucinations by dynamically perturbing text prompts to identify and reinforce stable visual evidence regions while counteracting language prior biases using attention variance and logit statistics.

  5. DetailVerifyBench: A Benchmark for Dense Hallucination Localization in Long Image Captions

    cs.CV 2026-04 unverdicted novelty 7.0

    DetailVerifyBench supplies 1,000 images and densely annotated long captions to evaluate precise hallucination localization in multimodal large language models.

  6. Instruction Lens Score: Your Instruction Contributes a Powerful Object Hallucination Detector for Multimodal Large Language Models

    cs.LG 2026-05 unverdicted novelty 6.0

    Instruction token embeddings encode visual information that can be leveraged to detect object hallucinations in MLLMs via a new combined score outperforming prior detectors.

  7. Mitigating Action-Relation Hallucinations in LVLMs via Relation-aware Visual Enhancement

    cs.CV 2026-05 unverdicted novelty 6.0

    A new attention-enhancement method using ARS scores and RVE reduces action-relation hallucinations in LVLMs while generalizing to spatial and object hallucinations.

  8. When Looking Is Not Enough: Visual Attention Structure Reveals Hallucination in MLLMs

    cs.CV 2026-05 unverdicted novelty 6.0

    Layer-wise Laplacian energy of visual attention reveals hallucination emergence in MLLMs and enables LaSCD, a closed-form logit remapping strategy that mitigates hallucinations while preserving general performance.

  9. Vocabulary Hijacking in LVLMs: Unveiling Critical Attention Heads by Excluding Inert Tokens to Mitigate Hallucination

    cs.MM 2026-05 unverdicted novelty 6.0

    LVLMs show vocabulary hijacking by inert tokens that decode to hijacking anchors; HABI locates them, NHAR finds resilient heads, and HAVAE boosts those heads to cut hallucinations.

  10. Common-agency Games for Multi-Objective Test-Time Alignment

    cs.GT 2026-05 unverdicted novelty 6.0

    CAGE uses common-agency games and an EPEC algorithm to compute equilibrium policies that balance multiple conflicting objectives for test-time LLM alignment.

  11. Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs

    cs.CV 2026-05 unverdicted novelty 6.0

    PVM adds a parallel branch to LVLMs that directly supplies visual embeddings to prevent attention decay over long generated sequences, yielding accuracy gains on reasoning tasks with minimal overhead.

  12. Online Self-Calibration Against Hallucination in Vision-Language Models

    cs.CV 2026-05 unverdicted novelty 6.0

    OSCAR exploits the generative-discriminative gap in LVLMs to build online preference data with MCTS and dual-granularity rewards for DPO-based calibration, claiming SOTA hallucination reduction and improved multimodal...

  13. Exploring Audio Hallucination in Egocentric Video Understanding

    cs.CV 2026-04 unverdicted novelty 6.0

    AV-LLMs hallucinate audio from visuals in egocentric videos, scoring only 27.3% accuracy on foreground sounds and 39.5% on background sounds in a 1000-question evaluation.

  14. SoccerRef-Agents: Multi-Agent System for Automated Soccer Refereeing

    cs.AI 2026-04 unverdicted novelty 6.0

    SoccerRef-Agents is a multi-agent framework using MLLMs, cross-modal RAG, and a custom knowledge base that outperforms general MLLMs on soccer foul decisions and explanations.

  15. When Prompts Override Vision: Prompt-Induced Hallucinations in LVLMs

    cs.CV 2026-04 unverdicted novelty 6.0

    Hallucinations in LVLMs largely arise from textual priors in prompts, and can be reduced by fine-tuning with preference optimization on grounded vs. hallucinated response pairs.

  16. HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering

    cs.AI 2026-04 unverdicted novelty 6.0

    HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.

  17. Evian: Towards Explainable Visual Instruction-tuning Data Auditing

    cs.CV 2026-04 unverdicted novelty 6.0

    EVian decomposes vision-language model responses into three cognitive components and audits them along consistency, coherence, and accuracy axes, showing that a small curated subset outperforms much larger training sets.

  18. Mitigating Multimodal Hallucination via Phase-wise Self-reward

    cs.CV 2026-04 unverdicted novelty 6.0

    PSRD mitigates visual hallucinations in LVLMs via phase-wise self-reward decoding, cutting rates by 50% on LLaVA-1.5-7B and outperforming prior methods on five benchmarks.

  19. Beyond Text-Dominance: Understanding Modality Preference of Omni-modal Large Language Models

    cs.AI 2026-04 unverdicted novelty 6.0

    Omni-modal LLMs exhibit visual preference that emerges in mid-to-late layers, enabling hallucination detection without task-specific training.

  20. SceneCritic: A Symbolic Evaluator for 3D Indoor Scene Synthesis

    cs.CV 2026-04 unverdicted novelty 6.0

    SceneCritic is a symbolic, ontology-grounded evaluator for floor-plan layouts that identifies specific semantic, orientation, and geometric violations and aligns better with human judgments than VLM-based scorers.

  21. HTDC: Hesitation-Triggered Differential Calibration for Mitigating Hallucination in Large Vision-Language Models

    cs.CV 2026-04 unverdicted novelty 6.0

    HTDC mitigates hallucinations in LVLMs by triggering calibration only at hesitation-prone decoding steps via contrasts with visual-nullification and semantic-nullification probes.

  22. Visual Enhanced Depth Scaling for Multimodal Latent Reasoning

    cs.CV 2026-04 unverdicted novelty 6.0

    Visual replay module and adaptive depth scaling improve multimodal latent reasoning, reaching SOTA benchmarks with faster inference than explicit chain-of-thought methods.

  23. Visually-Guided Policy Optimization for Multimodal Reasoning

    cs.CV 2026-04 unverdicted novelty 6.0

    VGPO introduces visual attention compensation and dual-grained advantage re-weighting to reinforce visual focus in VLMs, yielding better activation and performance on multimodal reasoning tasks.

  24. RASR: Retrieval-Augmented Semantic Reasoning for Fake News Video Detection

    cs.CV 2026-04 unverdicted novelty 6.0

    RASR retrieves cross-instance semantic evidence and uses domain priors to drive multimodal LLM reasoning for improved fake news video detection on FakeSV and FakeTT datasets.

  25. CharTool: Tool-Integrated Visual Reasoning for Chart Understanding

    cs.AI 2026-04 unverdicted novelty 6.0

    CharTool equips MLLMs with cropping and code tools plus agentic RL on DuoChart data to raise chart-reasoning accuracy by up to 9.78 percent on benchmarks.

  26. Not Blind but Silenced: Rebalancing Vision and Language via Adversarial Counter-Commonsense Equilibrium

    cs.CV 2026-05 unverdicted novelty 5.0

    ACE uses adversarial counter-commonsense perturbations on image tokens during decoding to suppress hallucinated linguistic priors while preserving stable visual signals in MLLMs.

  27. Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs

    cs.CV 2026-05 unverdicted novelty 5.0

    PVM adds a parallel learnable branch to LVLMs that supplies visual embeddings on demand to structurally prevent attention decay and visual signal dilution during deep autoregressive generation.

  28. Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding

    cs.LG 2026-04 unverdicted novelty 5.0

    A co-evolving proposer-critic RL framework improves GUI grounding accuracy by letting the model critique its own proposals rendered on screenshots.

  29. Mitigating Hallucinations in Large Vision-Language Models without Performance Degradation

    cs.CV 2026-04 unverdicted novelty 5.0

    MPD reduces hallucinations in LVLMs by 23.4% while retaining 97.4% of general capability through semantic disentanglement and selective parameter updates.

  30. HalluClear: Diagnosing, Evaluating and Mitigating Hallucinations in GUI Agents

    cs.AI 2026-04 unverdicted novelty 5.0

    HalluClear supplies a taxonomy, calibrated evaluation, and lightweight post-training mitigation that reduces hallucinations in GUI agents using only 9K samples.

  31. Distorted or Fabricated? A Survey on Hallucination in Video LLMs

    cs.CV 2026-04 unverdicted novelty 5.0

    The survey organizes hallucinations in Vid-LLMs into dynamic distortion and content fabrication, reviews evaluation benchmarks and mitigation methods, and traces root causes to weak temporal modeling and visual grounding.

  32. Visual Enhanced Depth Scaling for Multimodal Latent Reasoning

    cs.CV 2026-04 unverdicted novelty 5.0

    A visual replay module combined with adaptive depth scaling improves multimodal latent reasoning, delivering state-of-the-art benchmark results and faster inference than explicit chain-of-thought methods.

  33. Visual Enhanced Depth Scaling for Multimodal Latent Reasoning

    cs.CV 2026-04 unverdicted novelty 5.0

    Visual replay and depth scaling in latent reasoning produce state-of-the-art multimodal results with faster inference than explicit CoT.

  34. Cognitive Pivot Points and Visual Anchoring: Unveiling and Rectifying Hallucinations in Multimodal Reasoning Models

    cs.AI 2026-04 unverdicted novelty 5.0

    Multimodal reasoning models hallucinate at high-entropy cognitive bifurcation points due to loss of visual semantic anchoring, and the V-STAR training paradigm with HVAR rewards and FRM reflection mitigates this by re...

  35. Spotlight and Shadow: Attention-Guided Dual-Anchor Introspective Decoding for MLLM Hallucination Mitigation

    cs.CV 2026-04 unverdicted novelty 5.0

    DaID mitigates MLLM hallucinations by attention-guided selection of dual layers that calibrate token generation using internal perceptual discrepancies.

  36. Mitigating Entangled Steering in Large Vision-Language Models for Hallucination Reduction

    cs.CV 2026-04 unverdicted novelty 5.0

    MESA reduces hallucinations in LVLMs via controlled selective latent intervention that preserves the original token distribution.

  37. Steering the Verifiability of Multimodal AI Hallucinations

    cs.AI 2026-04 unverdicted novelty 5.0

    Researchers create a human-labeled dataset of obvious and elusive multimodal hallucinations and use learned activation-space probes to control their verifiability in MLLMs.

  38. Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

    cs.CV 2024-08 unverdicted novelty 5.0

    Show-o unifies autoregressive and discrete diffusion modeling inside one transformer to support multimodal understanding and generation tasks with competitive benchmark performance.

  39. Valley3: Scaling Omni Foundation Models for E-commerce

    cs.AI 2026-05 unverdicted novelty 4.0

    Valley3 is an omni MLLM for e-commerce that uses a four-stage pre-training pipeline plus post-training for controllable reasoning and agentic search, outperforming baselines on e-commerce benchmarks while staying comp...

  40. A Brief Overview: Agentic Reinforcement Learning In Large Language Models

    cs.AI 2026-04 unverdicted novelty 2.0

    This review synthesizes conceptual foundations, methods, challenges, and future directions for agentic reinforcement learning in large language models.

  41. A Brief Overview: Agentic Reinforcement Learning In Large Language Models

    cs.AI 2026-04 unverdicted novelty 2.0

    The paper surveys the conceptual foundations, methodological innovations, challenges, and future directions of agentic reinforcement learning frameworks that embed cognitive capabilities like meta-reasoning and self-r...

Reference graph

Works this paper leans on

237 extracted references · 237 canonical work pages · cited by 37 Pith papers · 34 internal anchors

  1. [1]

    Wenbin An, Feng Tian, Sicong Leng, Jiahao Nie, Haonan Lin, QianYing Wang, Guang Dai, Ping Chen, and Shijian Lu. 2024. Agla: Mitigating object hallucinations in large vision-language models with assembly of global and local attention. arXiv preprint arXiv:2406.12718 (2024)

  2. [2]

    Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, et al. 2023. Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390 (2023)

  3. [3]

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. 2023. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966 (2023)

  4. [4]

    Zechen Bai, Hai Ci, and Mike Zheng Shou. 2025. Impossible Videos. arXiv preprint arXiv:2503.14378 (2025)

  5. [5]

    Zechen Bai, Jianxiong Gao, Ziteng Gao, Pichao Wang, Zheng Zhang, Tong He, and Mike Zheng Shou. 2024. Factorized Visual Tokenization and Generation. arXiv preprint arXiv:2411.16681 (2024)

  6. [6]

    Zechen Bai, Tong He, Haiyang Mei, Pichao Wang, Ziteng Gao, Joya Chen, Zheng Zhang, and Mike Zheng Shou. 2024. One token to seg them all: Language instructed reasoning segmentation in videos. Advances in Neural Information Processing Systems 37 (2024), 6833–6859

  7. [7]

    Zechen Bai, Yuta Nakashima, and Noa Garcia. 2021. Explain me the painting: Multi-topic knowledgeable art description generation. In Proceedings of the IEEE/CVF international conference on computer vision . 5422–5432

  8. [8]

    Zechen Bai, Tianjun Xiao, Tong He, Pichao Wang, Zheng Zhang, Thomas Brox, and Mike Zheng Shou. 2025. Bridging Information Asymmetry in Text-video Retrieval: A Data-centric Approach. In The Thirteenth International Conference on Learning Representations

  9. [9]

    Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. 65–72

  10. [10]

    Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, and Sağnak Taşırlar

  11. [11]

    https://www.adept.ai/blog/fuyu-8b

    Fuyu-8B: A Multimodal Architecture for AI Agents . https://www.adept.ai/blog/fuyu-8b

  12. [12]

    Assaf Ben-Kish, Moran Yanuka, Morris Alper, Raja Giryes, and Hadar Averbuch-Elor. 2023. MOCHa: Multi-Objective Reinforcement Mitigating Caption Hallucinations. arXiv preprint arXiv:2312.03631 (2023)

  13. [13]

    James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. 2023. Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf 2, 3 (2023), 8

  14. [14]

    Ralph Allan Bradley and Milton E Terry. 1952. Rank analysis of incomplete block designs: I. The method of paired comparisons. Biometrika 39, 3/4 (1952), 324–345. https://www.jstor.org/stable/2334029

  15. [15]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

  16. [16]

    Yinzhi Cao and Junfeng Yang. 2015. Towards making systems forget with machine unlearning. In2015 IEEE symposium on security and privacy . IEEE, 463–480

  17. [17]

    Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. 2021. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision. 9650–9660

  18. [18]

    Sungguk Cha, Jusung Lee, Younghyun Lee, and Cheoljong Yang. 2024. Visually dehallucinative instruction generation. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 5510–5514

  19. [19]

    Yue Chang, Liqiang Jing, Xiaopeng Zhang, and Yue Zhang. 2024. A Unified Hallucination Mitigation Framework for Large Vision-Language Models. arXiv preprint arXiv:2409.16494 (2024)

  20. [20]

    Liwei Che, Tony Qingze Liu, Jing Jia, Weiyi Qin, Ruixiang Tang, and Vladimir Pavlovic. 2025. EAZY: Eliminating Hallucinations in LVLMs by Zeroing out Hallucinatory Image Tokens. arXiv preprint arXiv:2503.07772 (2025). Preprint, Vol. 1, No. 1, Article . Publication date: April 2025. Hallucination of Multimodal Large Language Models: A Survey 31

  21. [21]

    Beitao Chen, Xinyu Lyu, Lianli Gao, Jingkuan Song, and Heng Tao Shen. 2024. Alleviating hallucinations in large vision-language models through hallucination-induced optimization. arXiv preprint arXiv:2405.15356 (2024)

  22. [22]

    Beitao Chen, Xinyu Lyu, Lianli Gao, Jingkuan Song, and Heng Tao Shen. 2025. Attention Hijackers: Detect and Disentangle Attention Hijacking in LVLMs for Hallucination Mitigation. arXiv preprint arXiv:2503.08216 (2025)

  23. [23]

    Cong Chen, Mingyu Liu, Chenchen Jing, Yizhou Zhou, Fengyun Rao, Hao Chen, Bo Zhang, and Chunhua Shen

  24. [24]

    arXiv preprint arXiv:2503.06486 (2025)

    PerturboLLaVA: Reducing Multimodal Hallucinations with Perturbative Visual Training. arXiv preprint arXiv:2503.06486 (2025)

  25. [25]

    Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. 2023. Shikra: Unleashing Multimodal LLM’s Referential Dialogue Magic. arXiv preprint arXiv:2306.15195 (2023)

  26. [26]

    Xuweiyi Chen, Ziqiao Ma, Xuejun Zhang, Sihan Xu, Shengyi Qian, Jianing Yang, David Fouhey, and Joyce Chai. 2024. Multi-object hallucination in vision language models. Advances in Neural Information Processing Systems 37 (2024), 44393–44418

  27. [27]

    Xiang Chen, Chenxi Wang, Yida Xue, Ningyu Zhang, Xiaoyan Yang, Qiang Li, Yue Shen, Jinjie Gu, and Huajun Chen

  28. [28]

    arXiv preprint arXiv:2402.03190 (2024)

    Unified Hallucination Detection for Multimodal Large Language Models. arXiv preprint arXiv:2402.03190 (2024)

  29. [29]

    Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. 2025. Janus- pro: Unified multimodal understanding and generation with data and model scaling. arXiv preprint arXiv:2501.17811 (2025)

  30. [30]

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. 2024. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition . 24185–24198

  31. [31]

    Zhaorun Chen, Zhuokai Zhao, Hongyin Luo, Huaxiu Yao, Bo Li, and Jiawei Zhou. 2024. HALC: Object Hallucination Reduction via Adaptive Focal-Contrast Decoding. arXiv preprint arXiv:2403.00425 (2024)

  32. [32]

    Zhiyang Chen, Yousong Zhu, Yufei Zhan, Zhaowen Li, Chaoyang Zhao, Jinqiao Wang, and Ming Tang. 2023. Mitigating Hallucination in Visual Language Models with Visual Supervision. arXiv preprint arXiv:2311.16479 (2023)

  33. [33]

    Jaemin Cho, Yushi Hu, Roopal Garg, Peter Anderson, Ranjay Krishna, Jason Baldridge, Mohit Bansal, Jordi Pont- Tuset, and Su Wang. 2023. Davidsonian scene graph: Improving reliability in fine-grained evaluation for text-image generation. arXiv preprint arXiv:2310.18235 (2023)

  34. [34]

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradb...

  35. [35]

    Christiano, Jan Leike, Tom B

    Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2017. Deep Reinforcement Learning from Human Preferences. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA , Isabelle Guyon, Ulrike von Luxburg, Samy Beng...

  36. [36]

    Zheng Chu, Jingchang Chen, Qianglong Chen, Weijiang Yu, Tao He, Haotian Wang, Weihua Peng, Ming Liu, Bing Qin, and Ting Liu. 2023. A Survey of Chain of Thought Reasoning: Advances, Frontiers and Future. ArXiv preprint abs/2309.15402 (2023). https://arxiv.org/abs/2309.15402

  37. [37]

    Yung-Sung Chuang, Yujia Xie, Hongyin Luo, Yoon Kim, James Glass, and Pengcheng He. 2023. Dola: Decoding by contrasting layers improves factuality in large language models. ArXiv preprint abs/2309.03883 (2023). https: //arxiv.org/abs/2309.03883

  38. [38]

    Chenhang Cui, Yiyang Zhou, Xinyu Yang, Shirley Wu, Linjun Zhang, James Zou, and Huaxiu Yao. 2023. Holistic analysis of hallucination in gpt-4v (ision): Bias and interference challenges. arXiv preprint arXiv:2311.03287 (2023)

  39. [39]

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. 2023. InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning. arXiv:2305.06500 [cs.CV]

  40. [40]

    Grégoire Delétang, Anian Ruoss, Paul-Ambroise Duquenne, Elliot Catt, Tim Genewein, Christopher Mattern, Jordi Grau-Moya, Li Kevin Wenliang, Matthew Aitchison, Laurent Orseau, et al. 2023. Language modeling is compression. arXiv preprint arXiv:2309.10668 (2023). Preprint, Vol. 1, No. 1, Article . Publication date: April 2025. 32 Bai, et al

  41. [41]

    Ailin Deng, Zhirui Chen, and Bryan Hooi. 2024. Seeing is Believing: Mitigating Hallucination in Large Vision-Language Models via CLIP-Guided Decoding. arXiv preprint arXiv:2402.15300 (2024)

  42. [42]

    Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. 2022. GLM: General Language Model Pretraining with Autoregressive Blank Infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) . 320–335

  43. [43]

    Ke Fan, Zechen Bai, Tianjun Xiao, Dominik Zietlow, Max Horn, Zixu Zhao, Carl-Johann Simon-Gabriel, Mike Zheng Shou, Francesco Locatello, Bernt Schiele, et al. 2023. Unsupervised open-vocabulary object localization in videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision . 13747–13755

  44. [44]

    Alessandro Favero, Luca Zancato, Matthew Trager, Siddharth Choudhary, Pramuditha Perera, Alessandro Achille, Ashwin Swaminathan, and Stefano Soatto. 2024. Multi-modal hallucination control by visual information grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 14303–14312

  45. [45]

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. 2023. MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models. arXiv preprint arXiv:2306.13394 (2023)

  46. [46]

    Yuhan Fu, Ruobing Xie, Xingwu Sun, Zhanhui Kang, and Xirong Li. 2024. Mitigating Hallucination in Multimodal Large Language Model via Hallucination-targeted Direct Preference Optimization. arXiv preprint arXiv:2411.10436 (2024)

  47. [47]

    Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui He, Xiangyu Yue, et al. 2023. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010 (2023)

  48. [48]

    Yunhao Ge, Xiaohui Zeng, Jacob Samuel Huffman, Tsung-Yi Lin, Ming-Yu Liu, and Yin Cui. 2024. Visual fact checker: enabling high-fidelity detailed caption generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14033–14042

  49. [49]

    Ross Girshick. 2015. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision . 1440–1448

  50. [50]

    Tao Gong, Chengqi Lyu, Shilong Zhang, Yudong Wang, Miao Zheng, Qian Zhao, Kuikun Liu, Wenwei Zhang, Ping Luo, and Kai Chen. 2023. MultiModal-GPT: A Vision and Language Model for Dialogue with Humans. arXiv:2305.04790 [cs.CV]

  51. [51]

    Xuan Gong, Tianshi Ming, Xinpeng Wang, and Zhihua Wei. 2024. Damro: Dive into the attention mechanism of lvlm to reduce object hallucination. arXiv preprint arXiv:2410.04514 (2024)

  52. [52]

    Google. 2023. Bard. https://bard.google.com/

  53. [53]

    Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2017. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition . 6904–6913

  54. [54]

    Tianrui Guan, Fuxiao Liu, Xiyang Wu Ruiqi Xian Zongxia Li, Xiaoyu Liu Xijun Wang, Lichang Chen Furong Huang Yaser Yacoob, and Dinesh Manocha Tianyi Zhou. 2023. HALLUSIONBENCH: An Advanced Diagnostic Suite for Entangled Language Hallucination & Visual Illusion in Large Vision-Language Models. arXiv e-prints (2023), arXiv–2310

  55. [55]

    Anisha Gunjal, Jihan Yin, and Erhan Bas. 2024. Detecting and preventing hallucinations in large vision language models. In Proceedings of the AAAI Conference on Artificial Intelligence , Vol. 38. 18135–18143

  56. [56]

    Chi Han, Jialiang Xu, Manling Li, Yi Fung, Chenkai Sun, Nan Jiang, Tarek Abdelzaher, and Heng Ji. 2023. Lm-switch: Lightweight language model conditioning in word embedding space. arXiv preprint arXiv:2305.12798 (2023)

  57. [57]

    Jiaming Han, Renrui Zhang, Wenqi Shao, Peng Gao, Peng Xu, Han Xiao, Kaipeng Zhang, Chris Liu, Song Wen, Ziyu Guo, et al. 2023. Imagebind-llm: Multi-modality instruction tuning. arXiv preprint arXiv:2309.03905 (2023)

  58. [58]

    Tianyang Han, Qing Lian, Rui Pan, Renjie Pi, Jipeng Zhang, Shizhe Diao, Yong Lin, and Tong Zhang. 2024. The Instinctive Bias: Spurious Images lead to Hallucination in MLLMs. arXiv preprint arXiv:2402.03757 (2024)

  59. [59]

    Zongbo Han, Zechen Bai, Haiyang Mei, Qianli Xu, Changqing Zhang, and Mike Zheng Shou. 2024. Skip \𝑛: A simple method to reduce hallucination in Large Vision-Language Models. arXiv preprint arXiv:2402.01345 (2024)

  60. [60]

    Jinghan He, Kuan Zhu, Haiyun Guo, Junfeng Fang, Zhenglin Hua, Yuheng Jia, Ming Tang, Tat-Seng Chua, and Jinqiao Wang. 2024. Cracking the Code of Hallucination in LVLMs with Vision-aware Head Divergence. arXiv preprint arXiv:2412.13949 (2024)

  61. [61]

    Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask r-cnn. InProceedings of the IEEE international conference on computer vision . 2961–2969

  62. [62]

    Xin He, Longhui Wei, Lingxi Xie, and Qi Tian. 2024. Incorporating Visual Experts to Resolve the Information Loss in Multimodal Large Language Models. arXiv preprint arXiv:2401.03105 (2024)

  63. [63]

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. Measuring Massive Multitask Language Understanding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021 . OpenReview.net. https://openreview.net/forum?id=d7KBjmI3GmQ Preprint, Vol. 1, No. 1,...

  64. [64]

    Jonathan Ho and Tim Salimans. 2022. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022)

  65. [65]

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

  66. [66]

    LoRA: Low-Rank Adaptation of Large Language Models

    Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021)

  67. [67]

    Hongyu Hu, Jiyuan Zhang, Minyi Zhao, and Zhenbang Sun. 2023. Ciem: Contrastive instruction evaluation method for better instruction tuning. arXiv preprint arXiv:2309.02301 (2023)

  68. [68]

    Wenbo Hu, Yifan Xu, Yi Li, Weiyue Li, Zeyuan Chen, and Zhuowen Tu. 2023. Bliva: A simple multimodal llm for better handling of text-rich visual questions. arXiv preprint arXiv:2308.09936 (2023)

  69. [69]

    Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al. 2023. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. arXiv preprint arXiv:2311.05232 (2023)

  70. [70]

    Qidong Huang, Xiaoyi Dong, Pan Zhang, Bin Wang, Conghui He, Jiaqi Wang, Dahua Lin, Weiming Zhang, and Nenghai Yu. 2023. OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation. arXiv preprint arXiv:2311.17911 (2023)

  71. [71]

    Wen Huang, Hongbin Liu, Minxin Guo, and Neil Zhenqiang Gong. 2024. Visual Hallucinations of Multi-modal Large Language Models. arXiv preprint arXiv:2402.14683 (2024)

  72. [72]

    Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu, Chuancheng Lv, Yikai Zhang, Jiayi Lei, et al . 2023. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. ArXiv preprint abs/2305.08322 (2023). https://arxiv.org/abs/2305.08322

  73. [73]

    Drew A Hudson and Christopher D Manning. 2019. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 6700–6709

  74. [74]

    Fushuo Huo, Wenchao Xu, Zhong Zhang, Haozhao Wang, Zhicheng Chen, and Peilin Zhao. 2024. Self-introspective decoding: Alleviating hallucinations for large vision-language models. arXiv preprint arXiv:2408.02032 (2024)

  75. [75]

    Jitesh Jain, Jianwei Yang, and Humphrey Shi. 2023. Vcoder: Versatile vision encoders for multimodal large language models. arXiv preprint arXiv:2312.14233 (2023)

  76. [76]

    Joel Jang, Dongkeun Yoon, Sohee Yang, Sungmin Cha, Moontae Lee, Lajanugen Logeswaran, and Minjoon Seo. 2022. Knowledge unlearning for mitigating privacy risks in language models. arXiv preprint arXiv:2210.01504 (2022)

  77. [77]

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. 2021. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning . PMLR, 4904–4916

  78. [78]

    Chaoya Jiang, Hongrui Jia, Mengfan Dong, Wei Ye, Haiyang Xu, Ming Yan, Ji Zhang, and Shikun Zhang. 2024. Hal-eval: A universal and fine-grained hallucination evaluation framework for large vision language models. In Proceedings of the 32nd ACM International Conference on Multimedia . 525–534

  79. [79]

    Chaoya Jiang, Haiyang Xu, Mengfan Dong, Jiaxing Chen, Wei Ye, Ming Yan, Qinghao Ye, Ji Zhang, Fei Huang, and Shikun Zhang. 2023. Hallucination Augmented Contrastive Learning for Multimodal Large Language Model. arXiv preprint arXiv:2312.06968 (2023)

  80. [80]

    Nick Jiang, Anish Kachinthaya, Suzie Petryk, and Yossi Gandelsman. 2024. Interpreting and editing vision-language representations to mitigate hallucinations. arXiv preprint arXiv:2410.02762 (2024)

Showing first 80 references.