Global Context or Local Detail? Adaptive Visual Grounding for Hallucination Mitigation

· 2026 · cs.CV · arXiv 2604.24396

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Vision-Language Models (VLMs) are frequently undermined by object hallucination--generating content that contradicts visual reality--due to an over-reliance on linguistic priors. We introduce Positive-and-Negative Decoding (PND), a training-free inference framework that intervenes directly in the decoding process to enforce visual fidelity. PND is motivated by our key finding of a critical attention deficit in VLMs, where visual features are empirically under-weighted. Our framework corrects this via a dual-path contrast: The positive path amplifies salient visual evidence using multi-layer attention to encourage faithful descriptions, directly counteracting the attention deficit. Simultaneously, the negative path identifies and degrades the core object's features to create a strong counterfactual, which penalizes ungrounded, prior-dominant generation. By contrasting the model's outputs from these two perspectives at each step, PND steers generation towards text that is not just linguistically probable, but visually factual. Extensive experiments on benchmarks like POPE, MME, and CHAIR show that PND achieves state-of-the-art performance with up to 6.5% accuracy improvement, substantially reducing object hallucination while also enhancing descriptive detail--all without requiring any model retraining. The method generalizes effectively across diverse VLM architectures including LLaVA, InstructBLIP, InternVL, and Qwen-VL.

representative citing papers

Seeing Before Agreeing: Aligning Multi-Agent Consensus with Visual Evidence

cs.CV · 2026-05-29 · unverdicted · novelty 7.0

EAGLE is a new evidence-aligned framework that improves multi-agent VQA by enforcing consistency in visual grounding across agents, achieving best average performance on six benchmarks.

citing papers explorer

Showing 1 of 1 citing paper.

Seeing Before Agreeing: Aligning Multi-Agent Consensus with Visual Evidence cs.CV · 2026-05-29 · unverdicted · none · ref 4 · internal anchor
EAGLE is a new evidence-aligned framework that improves multi-agent VQA by enforcing consistency in visual grounding across agents, achieving best average performance on six benchmarks.

Global Context or Local Detail? Adaptive Visual Grounding for Hallucination Mitigation

fields

years

verdicts

representative citing papers

citing papers explorer