arxiv: 2605.01733 · v1 · submitted 2026-05-03 · 💻 cs.CV · cs.AI

Recognition: unknown

GEASS: Training-Free Caption Steering for Hallucination Mitigation in Vision-Language Models

Zeshang Li , Shuoyang Zhang , Jiashen Ding

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:30 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords hallucination mitigationvision-language modelstraining-free methodscaption steeringobject hallucinationPOPE benchmarkHallusionBenchinference-time correction

0 comments

The pith

GEASS lets vision-language models decide per query how much of a self-generated caption to trust, cutting hallucinations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language models generate captions that can anchor their answers and reasoning paths, but these captions often contain more omissions than fabrications, and each fabrication has outsized harm. The paper shows that treating every caption as helpful is counterproductive and that usefulness is a per-query property rather than a uniform one. GEASS therefore gates caption consumption by the clean path's confidence, weights it by the entropy reduction it causes, and tightens the evidence requirement when the two paths disagree. This training-free intervention improves accuracy over both standard inference and contrastive decoding on POPE and HallusionBench across four different VLMs while requiring only two extra forward passes per query.

Core claim

GEASS is a training-free module that decides on each query how much of the caption the model consumes: it gates the caption by the clean path's confidence, weights it by the entropy reduction it produces, and raises the evidence bar when the two pathways disagree. Experiments on POPE and HallusionBench across four VLMs show that GEASS consistently improves over vanilla inference and contrastive decoding, with only two extra forward passes per query.

What carries the argument

Gated Evidence-Aware Selective Steering (GEASS), a per-query mechanism that combines clean-path confidence gating, entropy-reduction weighting, and disagreement-based evidence raising to control how much caption information enters the model's reasoning.

If this is right

Caption usefulness must be treated as a query-specific rather than corpus-wide property.
Hallucination mitigation is possible at inference time without any model training or additional data.
Only two extra forward passes are needed to achieve measurable gains on POPE and HallusionBench.
The method applies across multiple vision-language models without architecture-specific changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same gating logic could be applied to other self-generated intermediates such as chain-of-thought steps to limit error propagation.
Inference-time selection mechanisms may offer a lightweight complement to existing decoding strategies in multimodal settings.
Testing the approach on open-ended generation tasks beyond the two benchmarks would clarify whether the per-query adaptation generalizes.

Load-bearing premise

The combination of clean-path confidence and entropy reduction can reliably identify when and how much caption content is useful without discarding beneficial information or introducing new selection bias on a per-query basis.

What would settle it

On a new VLM or benchmark, if GEASS produces lower accuracy than vanilla inference or contrastive decoding, the claim that selective steering consistently mitigates hallucinations would be refuted.

Figures

Figures reproduced from arXiv: 2605.01733 by Jiashen Ding, Shuoyang Zhang, Zeshang Li.

**Figure 1.** Figure 1: Caption anchoring effect observed on Qwen2.5-VL-3B with chain-of-thought reasoning. The model’s output (left, in red) closely mirrors the phrasing of the embedded caption (right, in red), demonstrating that captions reshape not only the final answer but the model’s entire reasoning trajectory. methods avoid retraining and intervene at decoding. One line leverages external vision models for post-hoc verific… view at source ↗

**Figure 2.** Figure 2: Left: Objects contained in the caption generated by InternVL2-8B. Right: Salient objects that are clearly visible in the image but not mentioned in the caption [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Top: With a correct caption mentioning “a dog sitting on the beach,” both Qwen2.5-VL-3B and InternVL2-8B revise their initially incorrect answers from No to Yes, demonstrating confidence grounding. Bottom: With a wrong caption mentioning “a cat sitting on the beach,” both models similarly flip to Yes and fabricate supporting details, demonstrating hallucination amplification. The same anchoring mechanism … view at source ↗

**Figure 4.** Figure 4: Asymmetric per-instance impact of caption errors on Qwen2.5-VL-3B (100 image–question instances): fabrication shifts predictions sharply (∆p = 0.64, 87% flips), while omission is mild on average (∆p = 0.13) but its long tail still flips 11% of answers. ∆p is the caption-induced shift toward the wrong answer; answers flip above the shaded threshold (∆p > 0.4). Inner boxes mark median and IQR; diamonds mark … view at source ↗

**Figure 5.** Figure 5: Overview of the GEASS pipeline. Given an image I and a question Q, the model first generates a caption C via self-captioning (Stage 1). Two parallel forward passes through the same VLM with shared parameters produce logit vectors zclean (conditioned on I, Q) and zcap (conditioned on I, Q, C) (Stage 2). The adaptive fusion module (Stage 3) computes a confidence gate α that assesses whether the model needs h… view at source ↗

read the original abstract

Vision-Language Models (VLMs) excel at grounded reasoning but remain prone to object hallucination. Recent work treats self-generated captions as a uniformly positive resource, yet we find that naively embedding one can degrade rather than help--dropping Qwen2.5-VL-3B accuracy on HallusionBench by nearly 10 points. Two structural properties explain this. First, captions anchor not only the model's final answer but also its reasoning trajectory and lexical choices. Second, caption errors are asymmetric: omissions vastly outnumber fabrications, yet each fabrication carries a much larger per-instance impact. A caption's usefulness is therefore a per-query property, not a per-corpus one. We propose GEASS (Gated Evidence-Aware Selective Steering), a training-free module that decides on each query how much of the caption the model consumes: it gates the caption by the clean path's confidence, weights it by the entropy reduction it produces, and raises the evidence bar when the two pathways disagree. Experiments on POPE and HallusionBench across four VLMs show that GEASS consistently improves over vanilla inference and contrastive decoding, with only two extra forward passes per query.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GEASS gives a simple training-free way to gate caption use in VLMs and reports gains on two benchmarks, but the gating logic rests on untested assumptions about confidence signals.

read the letter

The main point is a training-free module that decides per query how much of a self-generated caption to feed into a VLM. It gates on the clean-path confidence, weights by entropy reduction, and raises the bar on disagreement. This produces consistent gains over vanilla inference and contrastive decoding on POPE and HallusionBench across four models, using only two extra forward passes. The observation that captions can anchor reasoning and that their errors are asymmetric (mostly omissions, but fabrications hurt more) is useful and leads to treating usefulness as query-specific rather than uniform. That framing is the clearest advance over prior work that just injects captions or uses contrastive decoding. The efficiency claim is also practical for deployment. The empirical results are stated clearly in the abstract, which is a start. The soft spots are more noticeable. The abstract gives no error bars, no statistical tests, and no details on how queries or captions were chosen, so the size and reliability of the gains are hard to judge from what is here. The stress-test concern about clean-path miscalibration lands: VLMs often assign high confidence to wrong outputs, and if that happens the gating could overweight a bad caption or drop a useful one. The paper notes the asymmetry but does not show that the two signals stay predictive once the clean path itself errs. That leaves the central mechanism vulnerable to the exact per-query bias it aims to fix. This is for groups working on inference-time hallucination fixes in VLMs. A reader who wants lightweight methods without retraining would find the logic and the reported numbers worth looking at. It is coherent enough and grounded enough in existing benchmarks to deserve a serious referee, though the full paper will need ablations on the gating components and direct tests of the calibration assumption before the claims can be taken as settled.

Referee Report

2 major / 1 minor

Summary. The paper claims that naively embedding self-generated captions can degrade VLM performance on tasks like HallusionBench due to anchoring effects and asymmetric error distributions (omissions vs. fabrications), and introduces GEASS, a training-free module that gates and weights caption consumption per query using the clean inference path's confidence, entropy reduction, and disagreement signals. Experiments across POPE and HallusionBench on four VLMs report consistent gains over vanilla inference and contrastive decoding at the cost of two extra forward passes.

Significance. If the per-query selection signals prove reliable, GEASS would offer a lightweight, training-free intervention for hallucination mitigation that avoids the pitfalls of uniform caption use, with potential for easy integration into existing VLMs without retraining or extra data.

major comments (2)

[Abstract] Abstract: The load-bearing assumption that clean-path confidence combined with entropy reduction reliably decides caption usefulness per query is not demonstrated in cases where the clean path itself hallucinates, despite the abstract's discussion of error asymmetry; this leaves the gating mechanism vulnerable to the exact per-query bias it seeks to avoid.
[Experiments] Experiments (as summarized in abstract): The claims of consistent improvements lack error bars, statistical significance tests, and details on query selection/caption generation, preventing full verification of the empirical results on POPE and HallusionBench across the four models.

minor comments (1)

Provide more precise implementation details on how the gating and weighting are applied during inference to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and describe the revisions we will incorporate.

read point-by-point responses

Referee: [Abstract] Abstract: The load-bearing assumption that clean-path confidence combined with entropy reduction reliably decides caption usefulness per query is not demonstrated in cases where the clean path itself hallucinates, despite the abstract's discussion of error asymmetry; this leaves the gating mechanism vulnerable to the exact per-query bias it seeks to avoid.

Authors: We agree that the abstract's discussion of error asymmetry does not explicitly demonstrate the gating behavior on queries where the clean path hallucinates. The current design raises the evidence bar on pathway disagreement, which is intended to protect against clean-path errors, but we acknowledge this requires direct validation. In the revised manuscript we will add a dedicated analysis subsection that isolates queries where the clean inference hallucinates (identified via ground-truth mismatch on POPE and HallusionBench), reports the distribution of gating decisions, and quantifies how the disagreement signal alters caption consumption in those cases. revision: yes
Referee: [Experiments] Experiments (as summarized in abstract): The claims of consistent improvements lack error bars, statistical significance tests, and details on query selection/caption generation, preventing full verification of the empirical results on POPE and HallusionBench across the four models.

Authors: We accept that the reported results would be more verifiable with error bars, statistical tests, and fuller experimental details. The revised version will include standard-error bars on all POPE and HallusionBench tables and figures, report paired statistical significance tests (e.g., McNemar or Wilcoxon signed-rank) for each claimed improvement, and expand the experimental setup with explicit descriptions of query sampling, caption-generation prompts, decoding parameters, and the exact procedure used to obtain the clean and caption-augmented paths. revision: yes

Circularity Check

0 steps flagged

No significant circularity: GEASS rules defined directly from model outputs

full rationale

The paper presents GEASS as a training-free module whose per-query gating, weighting, and evidence-bar decisions are computed on-the-fly from the VLM's own clean-path confidence and entropy reduction signals. No parameters are fitted to the target benchmarks, no equations reduce the steering logic to its own inputs by construction, and no load-bearing claims rest on self-citations whose validity is presupposed. The central mechanism is therefore an independent, externally falsifiable heuristic whose correctness can be tested against POPE and HallusionBench without circular reference to the method itself. This is the normal, non-circular case for an empirical, output-driven inference-time intervention.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on two domain assumptions about caption error asymmetry and the proxy quality of confidence and entropy; no free parameters or new entities are introduced.

axioms (2)

domain assumption Caption errors are asymmetric: omissions vastly outnumber fabrications, yet each fabrication carries a much larger per-instance impact.
Explicitly stated in the abstract as the reason naive caption embedding can degrade performance.
ad hoc to paper Clean-path confidence and entropy reduction are valid per-query signals for deciding caption consumption.
These quantities form the core gating and weighting logic of GEASS.

pith-pipeline@v0.9.0 · 5510 in / 1275 out tokens · 51142 ms · 2026-05-10T15:30:55.142595+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 8 canonical work pages · 5 internal anchors

[1]

Qwen2.5-VL Technical Report

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

8 Submission and Formatting Instructions for ICML 2025 Chen, L., Li, J., Dong, X., Zhang, P., He, C., Wang, J., Zhao, F., and Lin, D. Sharegpt4v: Improving large multi-modal models with better captions. InEuropean Conference on Computer Vision, pp. 370–387. Springer, 2024a. Chen, Z., Wang, W., Cao, Y ., Liu, Y ., Gao, Z., Cui, E., Zhu, J., Ye, S., Tian, H...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

and Song, M

Lee, J. and Song, M. Retrieval visual contrastive decoding to mitigate object hallucinations in large vision-language models. InFindings of the Association for Computational Linguistics: ACL 2025, pp. 8200–8219,

2025
[4]

Evaluating object hallucination in large vision-language models

Li, Y ., Du, Y ., Zhou, K., Wang, J., Zhao, X., and Wen, J.-R. Evaluating object hallucination in large vision-language models. InProceedings of the 2023 conference on empiri- cal methods in natural language processing, pp. 292–305,

2023
[5]

Li, Z., Shi, H., Gao, Y ., Liu, D., Wang, Z., Chen, Y ., Liu, T., Zhao, L., Wang, H., and Metaxas, D. N. The hidden life of tokens: Reducing hallucination of large vision- language models via visual information steering.arXiv preprint arXiv:2502.03628,

work page arXiv
[6]

Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

Liu, F., Lin, K., Li, L., Wang, J., Yacoob, Y ., and Wang, L. Mitigating hallucination in large multi-modal models via robust instruction tuning.arXiv preprint arXiv:2306.14565,

work page internal anchor Pith review arXiv
[7]

Second: Mitigat- ing perceptual hallucination in vision-language models via selective and contrastive decoding.arXiv preprint arXiv:2506.08391,

Park, W., Kim, W., Kim, J., and Do, J. Second: Mitigat- ing perceptual hallucination in vision-language models via selective and contrastive decoding.arXiv preprint arXiv:2506.08391,

work page arXiv
[8]

A., Burns, K., Darrell, T., and Saenko, K

Rohrbach, A., Hendricks, L. A., Burns, K., Darrell, T., and Saenko, K. Object hallucination in image captioning. In Proceedings of the 2018 Conference on Empirical Meth- ods in Natural Language Processing, pp. 4035–4045,

2018
[9]

A., and Kundu, S

Sarkar, S., Che, Y ., Gavin, A., Beerel, P. A., and Kundu, S. Mitigating hallucinations in vision-language models through image-guided head suppression. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 12492–12511,

2025
[10]

Aligning large multi- modal models with factually augmented rlhf

Sun, Z., Shen, S., Cao, S., Liu, H., Li, C., Shen, Y ., Gan, C., Gui, L., Wang, Y .-X., Yang, Y ., et al. Aligning large multi- modal models with factually augmented rlhf. InFindings of the Association for Computational Linguistics: ACL 2024, pp. 13088–13110,

2024
[11]

Mitigating hallucinations in large vision-language models with in- ternal fact-based contrastive decoding.arXiv preprint arXiv:2502.01056,

9 Submission and Formatting Instructions for ICML 2025 Wang, C., Zhou, X., Fu, W., and Zhou, Y . Mitigating hallucinations in large vision-language models with in- ternal fact-based contrastive decoding.arXiv preprint arXiv:2502.01056,

work page arXiv 2025
[12]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Zhu, J., Wang, W., Chen, Z., Liu, Z., Ye, S., Gu, L., Tian, H., Duan, Y ., Su, W., Shao, J., et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Youcanhave an appendix here

10 Submission and Formatting Instructions for ICML 2025 A. Youcanhave an appendix here. You can have as much text here as you want. The main body must be at most 8 pages long. For the final version, one more page can be added. If you want, you can use an appendix like this one. The \onecolumn command above can be kept in place if you prefer a one-column a...

2025