pith. machine review for the scientific record. sign in

arxiv: 2601.13707 · v2 · submitted 2026-01-20 · 💻 cs.CV · cs.AI· cs.LG

Recognition: unknown

Attention-space Contrastive Guidance for Efficient Hallucination Mitigation in LVLMs

Authors on Pith no claims yet
classification 💻 cs.CV cs.AIcs.LG
keywords contrastiveguidancevisuallyattention-spaceefficientgroundedhallucinationlvlms
0
0 comments X
read the original abstract

Hallucinations in large vision--language models (LVLMs) often arise when language priors dominate over visual evidence, leading to object misidentification and visually inconsistent descriptions. We address this problem by framing hallucination mitigation as contrastive guidance that steers generation toward visually grounded and semantically faithful text. We propose Attention-space Contrastive Guidance (ACG), a training-free, single-pass method that operates directly in self-attention layers, where hallucination-inducing cross-modal biases emerge. ACG constructs both image-conditioned and approximate text-only attention paths within a single forward pass, enabling efficient guidance before errors accumulate at the output layer. Because this masking-based surrogate can introduce approximation bias, we further apply a lightweight orthogonal projection that suppresses components aligned with the text-only path, yielding a more visually grounded correction. Experiments on CHAIR and POPE show that ACG improves faithfulness over existing training-free baselines while maintaining caption quality, reducing latency by up to $2\times$ compared to multi-pass contrastive decoding methods.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Decoding by Perturbation: Mitigating MLLM Hallucinations via Dynamic Textual Perturbation

    cs.CL 2026-04 unverdicted novelty 7.0

    DeP mitigates MLLM hallucinations by dynamically perturbing text prompts to identify and reinforce stable visual evidence regions while counteracting language prior biases using attention variance and logit statistics.