Object Hallucination in Image Captioning

Anna Rohrbach , Lisa Anne Hendricks , Kaylee Burns , Trevor Darrell , Kate Saenko

Authors on Pith no claims yet

classification 💻 cs.CL cs.CV

keywords hallucinationimagemodelscaptioningobjectmetricsstandardassess

read the original abstract

Despite continuously improving performance, contemporary image captioning models are prone to "hallucinating" objects that are not actually in a scene. One problem is that standard metrics only measure similarity to ground truth captions and may not fully capture image relevance. In this work, we propose a new image relevance metric to evaluate current models with veridical visual labels and assess their rate of object hallucination. We analyze how captioning model architectures and learning objectives contribute to object hallucination, explore when hallucination is likely due to image misclassification or language priors, and assess how well current sentence metrics capture object hallucination. We investigate these questions on the standard image captioning benchmark, MSCOCO, using a diverse set of models. Our analysis yields several interesting findings, including that models which score best on standard sentence metrics do not always have lower hallucination and that models which hallucinate more tend to make errors driven by language priors.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 16 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Prefill-Time Intervention for Mitigating Hallucination in Large Vision-Language Models
cs.CV 2026-04 conditional novelty 7.0

Prefill-Time Intervention (PTI) reduces hallucinations in large vision-language models by applying a one-time modality-aware steering correction to the initial KV cache at the prefill stage rather than during autoregr...
Mitigating Action-Relation Hallucinations in LVLMs via Relation-aware Visual Enhancement
cs.CV 2026-05 unverdicted novelty 6.0

A new attention-enhancement method using ARS scores and RVE reduces action-relation hallucinations in LVLMs while generalizing to spatial and object hallucinations.
CAST: Mitigating Object Hallucination in Large Vision-Language Models via Caption-Guided Visual Attention Steering
cs.CV 2026-05 unverdicted novelty 6.0

CAST reduces object hallucination in LVLMs by 6.03% on average across five models and five benchmarks by identifying caption-sensitive attention heads and applying optimized steering directions to their outputs, with ...
Online Self-Calibration Against Hallucination in Vision-Language Models
cs.CV 2026-05 unverdicted novelty 6.0

OSCAR exploits the generative-discriminative gap in LVLMs to build online preference data with MCTS and dual-granularity rewards for DPO-based calibration, claiming SOTA hallucination reduction and improved multimodal...
Mitigating Multimodal Hallucination via Phase-wise Self-reward
cs.CV 2026-04 unverdicted novelty 6.0

PSRD mitigates visual hallucinations in LVLMs via phase-wise self-reward decoding, cutting rates by 50% on LLaVA-1.5-7B and outperforming prior methods on five benchmarks.
Revis: Sparse Latent Steering to Mitigate Object Hallucination in Large Vision-Language Models
cs.AI 2026-02 unverdicted novelty 6.0

REVIS reduces object hallucination in large vision-language models by about 19% via sparse orthogonal projection in latent space at suppression depths while keeping reasoning intact.
Contextualized Visual Personalization in Vision-Language Models
cs.CV 2026-02 unverdicted novelty 6.0

CoViP is a unified framework that improves vision-language models' personalized image captioning and downstream tasks through RL-based post-training while introducing diagnostics to confirm visual context usage.
AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation
cs.CL 2023-11 unverdicted novelty 6.0

AMBER is an LLM-free multi-dimensional benchmark for evaluating hallucinations in MLLMs across generative and discriminative tasks.
Aligning Large Multimodal Models with Factually Augmented RLHF
cs.CV 2023-09 conditional novelty 6.0

Factually Augmented RLHF aligns large multimodal models to reduce hallucinations, reaching 94% of GPT-4 on LLaVA-Bench and 60% improvement on the new MMHAL-BENCH.
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
cs.CV 2023-06 accept novelty 6.0

A new dataset of 400k visual instructions including negative examples at three semantic levels reduces hallucinations in models like MiniGPT-4 when used for fine-tuning while improving benchmark performance.
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
cs.CV 2023-04 conditional novelty 6.0

MiniGPT-4 shows that aligning a frozen vision encoder to Vicuna via one projection layer plus a second-stage detailed-description fine-tune produces GPT-4-like vision-language abilities including detailed captions, cr...
Mitigating Hallucinations in Large Vision-Language Models without Performance Degradation
cs.CV 2026-04 unverdicted novelty 5.0

MPD reduces hallucinations in LVLMs by 23.4% while retaining 97.4% of general capability through semantic disentanglement and selective parameter updates.
Spotlight and Shadow: Attention-Guided Dual-Anchor Introspective Decoding for MLLM Hallucination Mitigation
cs.CV 2026-04 unverdicted novelty 5.0

DaID mitigates MLLM hallucinations by attention-guided selection of dual layers that calibrate token generation using internal perceptual discrepancies.
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
cs.CV 2024-08 conditional novelty 5.0

MiniCPM-Llama3-V 2.5 delivers GPT-4V-level multimodal performance on phones through architecture, pretraining, and alignment optimizations.
Hallucination of Multimodal Large Language Models: A Survey
cs.CV 2024-04 accept novelty 5.0

The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning
cs.CV 2023-10 unverdicted novelty 5.0

MiniGPT-v2 adds unique task identifiers to a large language model so one system can perform image description, visual question answering, and visual grounding after three-stage training.