DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).
hub Mixed citations
AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation
Mixed citation behavior. Most common role is background (67%).
abstract
Despite making significant progress in multi-modal tasks, current Multi-modal Large Language Models (MLLMs) encounter the significant challenge of hallucinations, which may lead to harmful consequences. Therefore, evaluating MLLMs' hallucinations is becoming increasingly important in model improvement and practical application deployment. Previous works are limited in high evaluation costs (e.g., relying on humans or advanced LLMs) and insufficient evaluation dimensions (e.g., types of tasks and hallucinations). In this paper, we propose an LLM-free multi-dimensional benchmark AMBER, which can be used to evaluate both generative task and discriminative task including existence, attribute and relation hallucination. Based on AMBER, we design a low-cost and efficient evaluation pipeline. Additionally, we conduct a comprehensive evaluation and detailed analysis of mainstream MLLMs including GPT-4V(ision), and also give guideline suggestions for mitigating hallucinations. The data and code of AMBER are available at https://github.com/junyangwang0410/AMBER.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
AVID is the first large-scale benchmark for audio-visual inconsistency detection, grounding, classification, and reasoning in long videos, constructed via agent-driven methods and showing that state-of-the-art models struggle while a fine-tuned baseline improves performance.
MoHallBench is a new benchmark evaluating motion hallucination in VideoLLMs from co-occurrence priors, sequential inference, and similarity confusion, revealing decoupling from action recognition performance.
P²-DPO generates on-policy preference pairs targeting focus-and-enhance perception and visual robustness, combined with a calibration loss, to reduce hallucinations in LVLMs more effectively than human-feedback baselines.
MM-Snowball benchmark diagnoses hallucination snowballing in multi-turn MLLM dialogues; CAVR mitigates it via dual visual rectification at representation and logit levels.
YARD is a training-free method using Y-shaped decoder architecture and register tokens to improve contrastive decoding for hallucination reduction in LVLMs with lower latency.
Introduces Abstraction Gap metric and CAGE benchmark showing seven of eight VLMs have large gaps between text plausibility and chain-based causal reasoning, with one model succeeding.
CAS mitigates object hallucinations in MLLMs by extracting two context preference vectors from designed conflict samples and applying signed residual injection at mid-early MLP layers without retraining or added latency.
SIRA mitigates hallucinations in LVLMs by internally contrasting full visual access against a masked late-layer branch that retains shared context but lacks fine-grained visual evidence.
OxyEcomBench is a unified multimodal benchmark covering 6 capability areas and 29 tasks with authentic e-commerce data to measure how well foundation models handle real platform, merchant, and customer challenges.
DO-Bench is a controlled benchmark that attributes VLM object hallucination errors to textual prior pressure, perceptual limits, or their interaction via two diagnostic dimensions and metrics.
ZINA detects fine-grained hallucinations in MLLM outputs, classifies errors into six types, and proposes edits, outperforming GPT-4o and Llama-3.2 on the new VisionHall dataset of annotated and synthetic samples.
VidHal is a new benchmark that evaluates VLLM temporal hallucinations through a caption ordering task on videos with varying hallucination levels.
ADAPT reduces MLLM hallucinations 40-60% by aligning cross-attention dynamics via visual anchors, supervised inference, and preference tuning while preserving general capabilities.
OPPO is an evidence-aware preference optimization objective that contrasts faithful responses under varying visual evidence strengths to reduce hallucinations in MLLMs.
A counterfactual visual grounding uncertainty method detects hallucinations in LVLMs on medical images, improving over baselines with interpretable evidence and cross-model transfer.
ViPSy constructs policy-aligned and visually grounded preference pairs for VLMs via visual cues from image variants, yielding SOTA hallucination reductions of 35.7% on AMBER and 24.5% on Object HalBench.
QK Product Steering suppresses dominant singular modes in the per-head QK product of selected middle layers via a closed-form query-only update, yielding 4.0% average relative CHAIR_s reduction on three GQA VLMs.
OmniHalluc-L benchmark shows open-weight omni models at 32-41% strict-pair accuracy on long-form hallucination, raised to 36-51% by Modality-Perturbation Reliability Calibration that fuses audio-negative probe shifts with native confidence.
TLVS mitigates hallucinations in LVLMs via token-level extraction and visual-sensitivity-adaptive steering applied only at critical decoding steps.
IC-VCO places contrastive images in one context for a consistent DPO-style objective, adds Visual Contrast Distillation, and uses semantic perturbation for hard negatives, reporting best results on five benchmarks.
RC-DPO adds a CoT-conditioned preference term to DPO and pairs it with MCTS-based positive CoT generation plus attention-guided pruning for negatives, yielding lower hallucination rates on multimodal benchmarks.
New benchmark DRBench and four-stage supervision framework DRScaffold improve dense-scene reasoning in lightweight VLMs, with a 3B model surpassing a frozen 32B model on the benchmark while maintaining general performance.
AOD isolates hallucination signals in LVLM representations with an adversarial minimax objective and uses dual-forward contrastive decoding to reduce hallucinations while preserving utility.
citing papers explorer
-
Reasoning Matters: Mitigate Hallucination in Multimodal Large Reasoning Models via Reasoning-Conditioned Preference Optimization
RC-DPO adds a CoT-conditioned preference term to DPO and pairs it with MCTS-based positive CoT generation plus attention-guided pruning for negatives, yielding lower hallucination rates on multimodal benchmarks.
-
How do Humans Process AI-generated Hallucination Contents: a Neuroimaging Study
EEG study reveals distinct ERP patterns for AI hallucinations, with misjudged ones failing to trigger standard neurocognitive verification pathways.
-
Steering the Verifiability of Multimodal AI Hallucinations
Researchers create a human-labeled dataset of obvious and elusive multimodal hallucinations and use learned activation-space probes to control their verifiability in MLLMs.