pith. machine review for the scientific record. sign in

arxiv: 2505.21472 · v2 · submitted 2025-05-27 · 💻 cs.CV · cs.CL· cs.MM

Recognition: unknown

Mitigating Hallucination in Large Vision-Language Models via Adaptive Attention Calibration

Authors on Pith no claims yet
classification 💻 cs.CV cs.CLcs.MM
keywords attentionvisualcaaccalibrationhallucinationacrossadaptivebias
0
0 comments X
read the original abstract

Large vision-language models (LVLMs) achieve impressive performance on multimodal tasks but often suffer from hallucination, and confidently describe objects or attributes not present in the image. Current training-free interventions struggle to maintain accuracy in open-ended and long-form generation scenarios. We introduce the Confidence-Aware Attention Calibration (CAAC) framework to address this challenge by targeting two key biases: spatial perception bias, which distributes attention disproportionately across image tokens, and modality bias, which shifts focus from visual to textual inputs over time. CAAC employs a two-step approach: Visual-Token Calibration (VTC) to balance attention across visual tokens, and Adaptive Attention Re-Scaling (AAR) to reinforce visual grounding guided by the model's confidence. This confidence-driven adjustment ensures consistent visual alignment during generation. Experiments on CHAIR, AMBER, and POPE benchmarks demonstrate that CAAC outperforms baselines, particularly in long-form generations, effectively reducing hallucination.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Combating Visual Neglect and Semantic Drift in Large Multimodal Models for Enhanced Cross-Modal Retrieval

    cs.CV 2026-04 unverdicted novelty 5.0

    SSA-ME uses saliency-aware modeling to reduce visual neglect and semantic drift, achieving SOTA results on the MMEB benchmark for multimodal retrieval.