Recognition: unknown
GEASS: Training-Free Caption Steering for Hallucination Mitigation in Vision-Language Models
Pith reviewed 2026-05-10 15:30 UTC · model grok-4.3
The pith
GEASS lets vision-language models decide per query how much of a self-generated caption to trust, cutting hallucinations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GEASS is a training-free module that decides on each query how much of the caption the model consumes: it gates the caption by the clean path's confidence, weights it by the entropy reduction it produces, and raises the evidence bar when the two pathways disagree. Experiments on POPE and HallusionBench across four VLMs show that GEASS consistently improves over vanilla inference and contrastive decoding, with only two extra forward passes per query.
What carries the argument
Gated Evidence-Aware Selective Steering (GEASS), a per-query mechanism that combines clean-path confidence gating, entropy-reduction weighting, and disagreement-based evidence raising to control how much caption information enters the model's reasoning.
If this is right
- Caption usefulness must be treated as a query-specific rather than corpus-wide property.
- Hallucination mitigation is possible at inference time without any model training or additional data.
- Only two extra forward passes are needed to achieve measurable gains on POPE and HallusionBench.
- The method applies across multiple vision-language models without architecture-specific changes.
Where Pith is reading between the lines
- The same gating logic could be applied to other self-generated intermediates such as chain-of-thought steps to limit error propagation.
- Inference-time selection mechanisms may offer a lightweight complement to existing decoding strategies in multimodal settings.
- Testing the approach on open-ended generation tasks beyond the two benchmarks would clarify whether the per-query adaptation generalizes.
Load-bearing premise
The combination of clean-path confidence and entropy reduction can reliably identify when and how much caption content is useful without discarding beneficial information or introducing new selection bias on a per-query basis.
What would settle it
On a new VLM or benchmark, if GEASS produces lower accuracy than vanilla inference or contrastive decoding, the claim that selective steering consistently mitigates hallucinations would be refuted.
Figures
read the original abstract
Vision-Language Models (VLMs) excel at grounded reasoning but remain prone to object hallucination. Recent work treats self-generated captions as a uniformly positive resource, yet we find that naively embedding one can degrade rather than help--dropping Qwen2.5-VL-3B accuracy on HallusionBench by nearly 10 points. Two structural properties explain this. First, captions anchor not only the model's final answer but also its reasoning trajectory and lexical choices. Second, caption errors are asymmetric: omissions vastly outnumber fabrications, yet each fabrication carries a much larger per-instance impact. A caption's usefulness is therefore a per-query property, not a per-corpus one. We propose GEASS (Gated Evidence-Aware Selective Steering), a training-free module that decides on each query how much of the caption the model consumes: it gates the caption by the clean path's confidence, weights it by the entropy reduction it produces, and raises the evidence bar when the two pathways disagree. Experiments on POPE and HallusionBench across four VLMs show that GEASS consistently improves over vanilla inference and contrastive decoding, with only two extra forward passes per query.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that naively embedding self-generated captions can degrade VLM performance on tasks like HallusionBench due to anchoring effects and asymmetric error distributions (omissions vs. fabrications), and introduces GEASS, a training-free module that gates and weights caption consumption per query using the clean inference path's confidence, entropy reduction, and disagreement signals. Experiments across POPE and HallusionBench on four VLMs report consistent gains over vanilla inference and contrastive decoding at the cost of two extra forward passes.
Significance. If the per-query selection signals prove reliable, GEASS would offer a lightweight, training-free intervention for hallucination mitigation that avoids the pitfalls of uniform caption use, with potential for easy integration into existing VLMs without retraining or extra data.
major comments (2)
- [Abstract] Abstract: The load-bearing assumption that clean-path confidence combined with entropy reduction reliably decides caption usefulness per query is not demonstrated in cases where the clean path itself hallucinates, despite the abstract's discussion of error asymmetry; this leaves the gating mechanism vulnerable to the exact per-query bias it seeks to avoid.
- [Experiments] Experiments (as summarized in abstract): The claims of consistent improvements lack error bars, statistical significance tests, and details on query selection/caption generation, preventing full verification of the empirical results on POPE and HallusionBench across the four models.
minor comments (1)
- Provide more precise implementation details on how the gating and weighting are applied during inference to aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and describe the revisions we will incorporate.
read point-by-point responses
-
Referee: [Abstract] Abstract: The load-bearing assumption that clean-path confidence combined with entropy reduction reliably decides caption usefulness per query is not demonstrated in cases where the clean path itself hallucinates, despite the abstract's discussion of error asymmetry; this leaves the gating mechanism vulnerable to the exact per-query bias it seeks to avoid.
Authors: We agree that the abstract's discussion of error asymmetry does not explicitly demonstrate the gating behavior on queries where the clean path hallucinates. The current design raises the evidence bar on pathway disagreement, which is intended to protect against clean-path errors, but we acknowledge this requires direct validation. In the revised manuscript we will add a dedicated analysis subsection that isolates queries where the clean inference hallucinates (identified via ground-truth mismatch on POPE and HallusionBench), reports the distribution of gating decisions, and quantifies how the disagreement signal alters caption consumption in those cases. revision: yes
-
Referee: [Experiments] Experiments (as summarized in abstract): The claims of consistent improvements lack error bars, statistical significance tests, and details on query selection/caption generation, preventing full verification of the empirical results on POPE and HallusionBench across the four models.
Authors: We accept that the reported results would be more verifiable with error bars, statistical tests, and fuller experimental details. The revised version will include standard-error bars on all POPE and HallusionBench tables and figures, report paired statistical significance tests (e.g., McNemar or Wilcoxon signed-rank) for each claimed improvement, and expand the experimental setup with explicit descriptions of query sampling, caption-generation prompts, decoding parameters, and the exact procedure used to obtain the clean and caption-augmented paths. revision: yes
Circularity Check
No significant circularity: GEASS rules defined directly from model outputs
full rationale
The paper presents GEASS as a training-free module whose per-query gating, weighting, and evidence-bar decisions are computed on-the-fly from the VLM's own clean-path confidence and entropy reduction signals. No parameters are fitted to the target benchmarks, no equations reduce the steering logic to its own inputs by construction, and no load-bearing claims rest on self-citations whose validity is presupposed. The central mechanism is therefore an independent, externally falsifiable heuristic whose correctness can be tested against POPE and HallusionBench without circular reference to the method itself. This is the normal, non-circular case for an empirical, output-driven inference-time intervention.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Caption errors are asymmetric: omissions vastly outnumber fabrications, yet each fabrication carries a much larger per-instance impact.
- ad hoc to paper Clean-path confidence and entropy reduction are valid per-query signals for deciding caption consumption.
Reference graph
Works this paper leans on
-
[1]
Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
8 Submission and Formatting Instructions for ICML 2025 Chen, L., Li, J., Dong, X., Zhang, P., He, C., Wang, J., Zhao, F., and Lin, D. Sharegpt4v: Improving large multi-modal models with better captions. InEuropean Conference on Computer Vision, pp. 370–387. Springer, 2024a. Chen, Z., Wang, W., Cao, Y ., Liu, Y ., Gao, Z., Cui, E., Zhu, J., Ye, S., Tian, H...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
and Song, M
Lee, J. and Song, M. Retrieval visual contrastive decoding to mitigate object hallucinations in large vision-language models. InFindings of the Association for Computational Linguistics: ACL 2025, pp. 8200–8219,
2025
-
[4]
Evaluating object hallucination in large vision-language models
Li, Y ., Du, Y ., Zhou, K., Wang, J., Zhao, X., and Wen, J.-R. Evaluating object hallucination in large vision-language models. InProceedings of the 2023 conference on empiri- cal methods in natural language processing, pp. 292–305,
2023
- [5]
-
[6]
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
Liu, F., Lin, K., Li, L., Wang, J., Yacoob, Y ., and Wang, L. Mitigating hallucination in large multi-modal models via robust instruction tuning.arXiv preprint arXiv:2306.14565,
work page internal anchor Pith review arXiv
-
[7]
Park, W., Kim, W., Kim, J., and Do, J. Second: Mitigat- ing perceptual hallucination in vision-language models via selective and contrastive decoding.arXiv preprint arXiv:2506.08391,
-
[8]
A., Burns, K., Darrell, T., and Saenko, K
Rohrbach, A., Hendricks, L. A., Burns, K., Darrell, T., and Saenko, K. Object hallucination in image captioning. In Proceedings of the 2018 Conference on Empirical Meth- ods in Natural Language Processing, pp. 4035–4045,
2018
-
[9]
A., and Kundu, S
Sarkar, S., Che, Y ., Gavin, A., Beerel, P. A., and Kundu, S. Mitigating hallucinations in vision-language models through image-guided head suppression. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 12492–12511,
2025
-
[10]
Aligning large multi- modal models with factually augmented rlhf
Sun, Z., Shen, S., Cao, S., Liu, H., Li, C., Shen, Y ., Gan, C., Gui, L., Wang, Y .-X., Yang, Y ., et al. Aligning large multi- modal models with factually augmented rlhf. InFindings of the Association for Computational Linguistics: ACL 2024, pp. 13088–13110,
2024
-
[11]
9 Submission and Formatting Instructions for ICML 2025 Wang, C., Zhou, X., Fu, W., and Zhou, Y . Mitigating hallucinations in large vision-language models with in- ternal fact-based contrastive decoding.arXiv preprint arXiv:2502.01056,
-
[12]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Zhu, J., Wang, W., Chen, Z., Liu, Z., Ye, S., Gu, L., Tian, H., Duan, Y ., Su, W., Shao, J., et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Youcanhave an appendix here
10 Submission and Formatting Instructions for ICML 2025 A. Youcanhave an appendix here. You can have as much text here as you want. The main body must be at most 8 pages long. For the final version, one more page can be added. If you want, you can use an appendix like this one. The \onecolumn command above can be kept in place if you prefer a one-column a...
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.