hub Mixed citations

AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation

Junyang Wang, Yuhang Wang, Guohai Xu, Jing Zhang, Yukai Gu, Haitao Jia · 2023 · cs.CL · arXiv 2311.07397

Mixed citation behavior. Most common role is background (67%).

57 Pith papers citing it

Background 67% of classified citations

open full Pith review browse 57 citing papers arXiv PDF

abstract

Despite making significant progress in multi-modal tasks, current Multi-modal Large Language Models (MLLMs) encounter the significant challenge of hallucinations, which may lead to harmful consequences. Therefore, evaluating MLLMs' hallucinations is becoming increasingly important in model improvement and practical application deployment. Previous works are limited in high evaluation costs (e.g., relying on humans or advanced LLMs) and insufficient evaluation dimensions (e.g., types of tasks and hallucinations). In this paper, we propose an LLM-free multi-dimensional benchmark AMBER, which can be used to evaluate both generative task and discriminative task including existence, attribute and relation hallucination. Based on AMBER, we design a low-cost and efficient evaluation pipeline. Additionally, we conduct a comprehensive evaluation and detailed analysis of mainstream MLLMs including GPT-4V(ision), and also give guideline suggestions for mitigating hallucinations. The data and code of AMBER are available at https://github.com/junyangwang0410/AMBER.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4 baseline 1 dataset 1

citation-polarity summary

background 4 baseline 1 use dataset 1

representative citing papers

DataComp-VLM: Improved Open Datasets for Vision-Language Models

cs.CV · 2026-06-26 · conditional · novelty 8.0 · 2 refs

DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).

AVID: A Benchmark for Omni-Modal Audio-Visual Inconsistency Understanding via Agent-Driven Construction

cs.MM · 2026-04-15 · unverdicted · novelty 8.0

AVID is the first large-scale benchmark for audio-visual inconsistency detection, grounding, classification, and reasoning in long videos, constructed via agent-driven methods and showing that state-of-the-art models struggle while a fine-tuned baseline improves performance.

MoHallBench: A Benchmark for Motion Hallucination in Video Large Language Models

cs.CV · 2026-07-01 · unverdicted · novelty 7.0

MoHallBench is a new benchmark evaluating motion hallucination in VideoLLMs from co-occurrence priors, sequential inference, and similarity confusion, revealing decoupling from action recognition performance.

Mind the Heads: Topological Representation Alignment for Multimodal LLMs

cs.CV · 2026-06-22 · conditional · novelty 7.0

HeRA aligns least-aligned attention heads in MLLMs using an MKNN-based contrastive objective to preserve cross-modal topological structure, yielding gains on vision-centric tasks and reduced hallucinations across 18 benchmarks.

P$^2$-DPO: Grounding Hallucination in Perceptual Processing via Calibration Direct Preference Optimization

cs.CV · 2026-06-02 · unverdicted · novelty 7.0

P²-DPO generates on-policy preference pairs targeting focus-and-enhance perception and visual robustness, combined with a calibration loss, to reduce hallucinations in LVLMs more effectively than human-feedback baselines.

MM-Snowball: Evaluating and Mitigating Hallucination Snowballing in Multimodal Multi-Turn Dialogue

cs.CV · 2026-05-30 · unverdicted · novelty 7.0

MM-Snowball benchmark diagnoses hallucination snowballing in multi-turn MLLM dialogues; CAVR mitigates it via dual visual rectification at representation and logit levels.

YARD: Y-Architecture Register Decoding for Efficient Hallucination Mitigation in Large Vision-Language Models

cs.CV · 2026-05-29 · unverdicted · novelty 7.0

YARD is a training-free method using Y-shaped decoder architecture and register tokens to improve contrastive decoding for hallucination reduction in LVLMs with lower latency.

The Abstraction Gap in Vision-Language Causal Reasoning

cs.CL · 2026-05-27 · unverdicted · novelty 7.0

Introduces Abstraction Gap metric and CAGE benchmark showing seven of eight VLMs have large gaps between text plausibility and chain-based causal reasoning, with one model succeeding.

Rethinking Visual Neglect: Steering via Context-Preference for MLLM Hallucination Mitigation

cs.CL · 2026-05-27 · unverdicted · novelty 7.0

CAS mitigates object hallucinations in MLLMs by extracting two context preference vectors from designed conflict samples and applying signed residual injection at mid-early MLP layers without retraining or added latency.

Do We Really Need External Tools to Mitigate Hallucinations? SIRA: Shared-Prefix Internal Reconstruction of Attribution

cs.CV · 2026-05-14 · conditional · novelty 7.0

SIRA mitigates hallucinations in LVLMs by internally contrasting full visual access against a masked late-layer branch that retains shared context but lacks fine-grained visual evidence.

OxyEcomBench: Benchmarking Multimodal Foundation Models across E-Commerce Ecosystems

cs.DB · 2026-05-13 · conditional · novelty 7.0

OxyEcomBench is a unified multimodal benchmark covering 6 capability areas and 29 tasks with authentic e-commerce data to measure how well foundation models handle real platform, merchant, and customer challenges.

DO-Bench: An Attributable Benchmark for Diagnosing Object Hallucination in Vision-Language Models

cs.CV · 2026-04-18 · unverdicted · novelty 7.0

DO-Bench is a controlled benchmark that attributes VLM object hallucination errors to textual prior pressure, perceptual limits, or their interaction via two diagnostic dimensions and metrics.

ZINA: Multimodal Fine-grained Hallucination Detection and Editing

cs.CV · 2025-06-16 · unverdicted · novelty 7.0

ZINA detects fine-grained hallucinations in MLLM outputs, classifies errors into six types, and proposes edits, outperforming GPT-4o and Llama-3.2 on the new VisionHall dataset of annotated and synthetic samples.

VidHal: Benchmarking Temporal Hallucinations in Vision LLMs

cs.CV · 2024-11-25 · unverdicted · novelty 7.0

VidHal is a new benchmark that evaluates VLLM temporal hallucinations through a caption ordering task on videos with varying hallucination levels.

ADAPT: Attention Dynamics Alignment with Preference Tuning for Faithful MLLMs

cs.CV · 2026-06-30 · unverdicted · novelty 6.0

ADAPT reduces MLLM hallucinations 40-60% by aligning cross-attention dynamics via visual anchors, supervised inference, and preference tuning while preserving general capabilities.

Clearer Sight, Fewer Lies: Oriented Pickup Preference Optimization for Multimodal Hallucination Mitigation

cs.CV · 2026-06-29 · unverdicted · novelty 6.0 · 2 refs

OPPO is an evidence-aware preference optimization objective that contrasts faithful responses under varying visual evidence strengths to reduce hallucinations in MLLMs.

Detecting Clinical Hallucinations in LVLMs via Counterfactual Visual Grounding Uncertainty

cs.CV · 2026-06-26 · unverdicted · novelty 6.0

A counterfactual visual grounding uncertainty method detects hallucinations in LVLMs on medical images, improving over baselines with interpretable evidence and cross-model transfer.

GAVEL: Grounded Caption Error Verification and Localization

cs.CL · 2026-06-25 · unverdicted · novelty 6.0

GAVEL introduces a joint task, dataset, and benchmark for verifying, explaining, and localizing caption-image misalignments, with a supervised baseline that improves grounding and explanation metrics over strong closed-source models.

Vision-driven Preference Synthesis for Mitigating Hallucinations in VLMs

cs.CV · 2026-06-24 · unverdicted · novelty 6.0

ViPSy constructs policy-aligned and visually grounded preference pairs for VLMs via visual cues from image variants, yielding SOTA hallucination reductions of 35.7% on AMBER and 24.5% on Object HalBench.

Spectral Query-Key Product Weight Steering for Training-Free VLM Hallucination Mitigation

cs.CV · 2026-06-18 · unverdicted · novelty 6.0

QK Product Steering suppresses dominant singular modes in the per-head QK product of selected middle layers via a closed-form query-only update, yielding 4.0% average relative CHAIR_s reduction on three GQA VLMs.

OmniHalluc-L: Counterfactual Benchmarking and Modality-Perturbation Reliability Calibration for Long-Form Omni Hallucination

cs.MM · 2026-06-02 · unverdicted · novelty 6.0

OmniHalluc-L benchmark shows open-weight omni models at 32-41% strict-pair accuracy on long-form hallucination, raised to 36-51% by Modality-Perturbation Reliability Calibration that fuses audio-negative probe shifts with native confidence.

Steer Where It Matters: Token-Level Visual-Sensitivity Steering for LVLMs Hallucination Mitigation

cs.CV · 2026-06-02 · unverdicted · novelty 6.0

TLVS mitigates hallucinations in LVLMs via token-level extraction and visual-sensitivity-adaptive steering applied only at critical decoding steps.

Learning from Fine-Grained Visual Discrepancies: Mitigating Multimodal Hallucinations via In-Context Visual Contrastive Optimization

cs.CV · 2026-05-29 · unverdicted · novelty 6.0

IC-VCO places contrastive images in one context for a consistent DPO-style objective, adds Visual Contrast Distillation, and uses semantic perturbation for hard negatives, reporting best results on five benchmarks.

Reasoning Matters: Mitigate Hallucination in Multimodal Large Reasoning Models via Reasoning-Conditioned Preference Optimization

cs.AI · 2026-05-27 · unverdicted · novelty 6.0

RC-DPO adds a CoT-conditioned preference term to DPO and pairs it with MCTS-based positive CoT generation plus attention-guided pruning for negatives, yielding lower hallucination rates on multimodal benchmarks.

citing papers explorer

Showing 3 of 3 citing papers after filters.

Reasoning Matters: Mitigate Hallucination in Multimodal Large Reasoning Models via Reasoning-Conditioned Preference Optimization cs.AI · 2026-05-27 · unverdicted · none · ref 4 · internal anchor
RC-DPO adds a CoT-conditioned preference term to DPO and pairs it with MCTS-based positive CoT generation plus attention-guided pruning for negatives, yielding lower hallucination rates on multimodal benchmarks.
How do Humans Process AI-generated Hallucination Contents: a Neuroimaging Study cs.AI · 2026-05-16 · unverdicted · none · ref 11 · 2 links · internal anchor
EEG study reveals distinct ERP patterns for AI hallucinations, with misjudged ones failing to trigger standard neurocognitive verification pathways.
Steering the Verifiability of Multimodal AI Hallucinations cs.AI · 2026-04-08 · unverdicted · none · ref 35 · internal anchor
Researchers create a human-labeled dataset of obvious and elusive multimodal hallucinations and use learned activation-space probes to control their verifiability in MLLMs.

AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer