TRIVIA+ is a new long-context RAG hallucination benchmark with four noisy label variants that shows current detectors have substantial room for improvement and are hindered by label noise.
hub
Analyzing and Mitigating Object Hallucination in Large Vision-Language Models
22 Pith papers cite this work. Polarity classification is still indexing.
abstract
Large vision-language models (LVLMs) have shown remarkable abilities in understanding visual information with human languages. However, LVLMs still suffer from object hallucination, which is the problem of generating descriptions that include objects that do not actually exist in the images. This can negatively impact many vision-language tasks, such as visual summarization and reasoning. To address this issue, we propose a simple yet powerful algorithm, LVLM Hallucination Revisor (LURE), to post-hoc rectify object hallucination in LVLMs by reconstructing less hallucinatory descriptions. LURE is grounded in a rigorous statistical analysis of the key factors underlying object hallucination, including co-occurrence (the frequent appearance of certain objects alongside others in images), uncertainty (objects with higher uncertainty during LVLM decoding), and object position (hallucination often appears in the later part of the generated text). LURE can also be seamlessly integrated with any LVLMs. We evaluate LURE on six open-source LVLMs, achieving a 23% improvement in general object hallucination evaluation metrics over the previous best approach. In both GPT and human evaluations, LURE consistently ranks at the top. Our data and code are available at https://github.com/YiyangZhou/LURE.
hub tools
citation-role summary
citation-polarity summary
roles
background 3polarities
background 3representative citing papers
DO-Bench is a controlled benchmark that attributes VLM object hallucination errors to textual prior pressure, perceptual limits, or their interaction via two diagnostic dimensions and metrics.
Presents Med-HallMark benchmark, MediHall Score metric, and MediHallDetector model for hallucination detection and evaluation in medical LVLMs.
Reweighting training emphasis toward image-negative tokens and filtering hallucinated data reduces object hallucination in LVLMs across three model variants.
Natural language descriptions generated via a closed-loop pipeline with digital twins capture the selectivity of most neurons in macaque V1 and V4, with synthesized images driving 96% of V4 neurons into the top or bottom 5% of natural-image response distributions.
CAVI framework uses character-guided token pruning, orthogonal feature modulation, and modality-adaptive role steering to resolve modality-role interference in multimodal RPAs.
CAST reduces object hallucination in LVLMs by 6.03% on average across five models and five benchmarks by identifying caption-sensitive attention heads and applying optimized steering directions to their outputs, with negligible added inference cost.
LIME reduces hallucinations in multimodal LLMs by using LRP to boost perceptual modality contributions through inference-time KV updates.
HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.
Decoder-side Temporal Rebalancing (DTR) reduces hallucinations in Video-LLMs by mitigating over-dominance of a single anchor frame during inference without training or auxiliary models.
ReflectCAP distills model-specific hallucination and oversight patterns into Structured Reflection Notes that steer LVLMs toward more factual and complete image captions, reaching the Pareto frontier on factuality-coverage trade-offs.
SENTINEL reduces MLLM object hallucinations by over 90% via sentence-level early intervention with detector-bootstrapped preference data and C-DPO loss, outperforming prior SOTA on hallucination and capability benchmarks.
UE-DPO quantifies epistemic uncertainty from grounding failures to direct more learning pressure on hard visual tokens in preferred samples while easing penalties on dispreferred ones.
MPD reduces hallucinations in LVLMs by 23.4% while retaining 97.4% of general capability through semantic disentanglement and selective parameter updates.
VCE mitigates object hallucination in LVLMs by decomposing activation patterns from contrastive visual inputs via SVD to suppress hallucination subspaces through targeted parameter edits.
ReVisiT refines LVLM output distributions during decoding by projecting selected vision tokens into text space via context-aware constrained divergence minimization.
CAAC mitigates hallucinations in LVLMs via Visual-Token Calibration and Adaptive Attention Re-Scaling guided by model confidence, showing gains on CHAIR, AMBER, and POPE especially in long-form generation.
The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.
POVID generates AI-created preference data to fine-tune vision-language models with DPO, reducing hallucinations and improving benchmark scores.
The paper defines Agent AI as interactive multimodal systems that perceive grounded data and generate embodied actions, arguing this approach can mitigate hallucinations in foundation models.
This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.
The paper reviews the background, technology, applications, limitations, and future directions of OpenAI's Sora text-to-video generative model based on public information.
citing papers explorer
-
Rethinking Evaluation for LLM Hallucination Detection: A Desiderata, A New RAG-based Benchmark, New Insights
TRIVIA+ is a new long-context RAG hallucination benchmark with four noisy label variants that shows current detectors have substantial room for improvement and are hindered by label noise.
-
DO-Bench: An Attributable Benchmark for Diagnosing Object Hallucination in Vision-Language Models
DO-Bench is a controlled benchmark that attributes VLM object hallucination errors to textual prior pressure, perceptual limits, or their interaction via two diagnostic dimensions and metrics.
-
Detecting and Evaluating Medical Hallucinations in Large Vision Language Models
Presents Med-HallMark benchmark, MediHall Score metric, and MediHallDetector model for hallucination detection and evaluation in medical LVLMs.
-
Reducing Object Hallucination in LVLMs via Emphasizing Image-negative Tokens
Reweighting training emphasis toward image-negative tokens and filtering hallucinated data reduces object hallucination in LVLMs across three model variants.
-
Letting the neural code speak: Automated characterization of monkey visual neurons through human language
Natural language descriptions generated via a closed-loop pipeline with digital twins capture the selectivity of most neurons in macaque V1 and V4, with synthesized images driving 96% of V4 neurons into the top or bottom 5% of natural-image response distributions.
-
Through the Lens of Character: Resolving Modality-Role Interference in Multimodal Role-Playing Agent
CAVI framework uses character-guided token pruning, orthogonal feature modulation, and modality-adaptive role steering to resolve modality-role interference in multimodal RPAs.
-
CAST: Mitigating Object Hallucination in Large Vision-Language Models via Caption-Guided Visual Attention Steering
CAST reduces object hallucination in LVLMs by 6.03% on average across five models and five benchmarks by identifying caption-sensitive attention heads and applying optimized steering directions to their outputs, with negligible added inference cost.
-
Mitigating Multimodal LLMs Hallucinations via Relevance Propagation at Inference Time
LIME reduces hallucinations in multimodal LLMs by using LRP to boost perceptual modality contributions through inference-time KV updates.
-
HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering
HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.
-
Relaxing Anchor-Frame Dominance for Mitigating Hallucinations in Video Large Language Models
Decoder-side Temporal Rebalancing (DTR) reduces hallucinations in Video-LLMs by mitigating over-dominance of a single anchor frame during inference without training or auxiliary models.
-
ReflectCAP: Detailed Image Captioning with Reflective Memory
ReflectCAP distills model-specific hallucination and oversight patterns into Structured Reflection Notes that steer LVLMs toward more factual and complete image captions, reaching the Pareto frontier on factuality-coverage trade-offs.
-
Mitigating Object Hallucinations via Sentence-Level Early Intervention
SENTINEL reduces MLLM object hallucinations by over 90% via sentence-level early intervention with detector-bootstrapped preference data and C-DPO loss, outperforming prior SOTA on hallucination and capability benchmarks.
-
Uncertainty-Aware Exploratory Direct Preference Optimization for Multimodal Large Language Models
UE-DPO quantifies epistemic uncertainty from grounding failures to direct more learning pressure on hard visual tokens in preferred samples while easing penalties on dispreferred ones.
-
Mitigating Hallucinations in Large Vision-Language Models without Performance Degradation
MPD reduces hallucinations in LVLMs by 23.4% while retaining 97.4% of general capability through semantic disentanglement and selective parameter updates.
-
VCE: A zero-cost hallucination mitigation method of LVLMs via visual contrastive editing
VCE mitigates object hallucination in LVLMs by decomposing activation patterns from contrastive visual inputs via SVD to suppress hallucination subspaces through targeted parameter edits.
-
Revisit What You See: Revealing Visual Semantics in Vision Tokens to Guide LVLM Decoding
ReVisiT refines LVLM output distributions during decoding by projecting selected vision tokens into text space via context-aware constrained divergence minimization.
-
Mitigating Hallucination in Large Vision-Language Models via Adaptive Attention Calibration
CAAC mitigates hallucinations in LVLMs via Visual-Token Calibration and Adaptive Attention Re-Scaling guided by model confidence, showing gains on CHAIR, AMBER, and POPE especially in long-form generation.
-
Hallucination of Multimodal Large Language Models: A Survey
The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.
-
Aligning Modalities in Vision Large Language Models via Preference Fine-tuning
POVID generates AI-created preference data to fine-tune vision-language models with DPO, reducing hallucinations and improving benchmark scores.
-
Agent AI: Surveying the Horizons of Multimodal Interaction
The paper defines Agent AI as interactive multimodal systems that perceive grounded data and generate embodied actions, arguing this approach can mitigate hallucinations in foundation models.
-
A Survey on Multimodal Large Language Models
This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.
-
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models
The paper reviews the background, technology, applications, limitations, and future directions of OpenAI's Sora text-to-video generative model based on public information.