Recognition: unknown
DO-Bench: An Attributable Benchmark for Diagnosing Object Hallucination in Vision-Language Models
Pith reviewed 2026-05-10 06:30 UTC · model grok-4.3
The pith
Object hallucinations in vision-language models stem from distinct prior influence and perceptual weakness that a controlled benchmark can separately measure.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DO-Bench isolates the sources of object hallucination through Prior Override, which strengthens textual priors with constant visuals, and Perception-Limited, which increases visual evidence from full scenes to localized crops. It defines PriorRobust and PerceptionAbility metrics to quantify these behaviors consistently across open- and closed-source models, revealing that object hallucination reflects heterogeneous, mechanism-dependent failure patterns beyond aggregate accuracy.
What carries the argument
DO-Bench's paired Prior Override and Perception-Limited dimensions, which hold one factor constant while varying the other to attribute errors to prior suppression, perceptual insufficiency, or their interaction.
If this is right
- Different models show distinct profiles of prior sensitivity and perceptual reliability.
- Errors can be attributed to specific causes rather than reported only as overall accuracy.
- Aggregate benchmarks alone cannot distinguish the underlying mechanisms of hallucination.
- Targeted fixes become possible once the dominant failure mode for a given model is identified.
Where Pith is reading between the lines
- Developers could select or fine-tune models based on their measured profiles for tasks where one failure mode dominates.
- The paired-intervention design could be adapted to diagnose other hallucination types such as attributes or spatial relations.
- Training data mixtures might be adjusted to strengthen perceptual grounding in models that score low on that metric.
Load-bearing premise
The structured multimodal interventions in the two dimensions truly isolate prior influence from perceptual grounding without introducing new confounds or altering model behavior in unintended ways.
What would settle it
A finding that models' error rates do not shift as expected when textual priors are strengthened or when visual evidence is improved from full scenes to object crops, or that the two dimensions interact in ways the metrics cannot separate.
Figures
read the original abstract
Object level hallucination remains a central reliability challenge for vision language models (VLMs), particularly in binary object existence verification. Existing benchmarks emphasize aggregate accuracy but rarely disentangle whether errors stem from perceptual limitations or from the influence of contextual textual priors, leaving underlying failure mechanisms ambiguous. We introduce DO-Bench, a controlled diagnostic benchmark that isolates these sources through structured multimodal interventions. Rather than evaluating models in unconstrained settings, DO-Bench probes two complementary dimensions: the Prior Override dimension progressively strengthens contextual textual priors while holding visual evidence constant to assess resistance to prior pressure, and the Perception-Limited dimension incrementally enhances visual evidence from full-scene context to localized object crops to measure perceptual grounding strength. This paired design enables attribution of errors to prior suppression, perceptual insufficiency, or their interaction. We further define two diagnostic metrics, PriorRobust and PerceptionAbility, to quantify these behaviors consistently. Evaluations across diverse open- and closed-source VLMs reveal systematic differences in prior sensitivity and perceptual reliability, demonstrating that object hallucination reflects heterogeneous, mechanism dependent failure patterns beyond aggregate accuracy.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces DO-Bench, a controlled diagnostic benchmark for object hallucination in vision-language models. It uses two complementary dimensions of structured multimodal interventions: Prior Override, which progressively strengthens contextual textual priors while holding the image fixed, and Perception-Limited, which varies visual input from full scenes to localized object crops. This paired design is intended to attribute errors to prior suppression, perceptual insufficiency, or their interaction. Two new metrics, PriorRobust and PerceptionAbility, are defined to quantify these behaviors, and evaluations across diverse open- and closed-source VLMs are reported to demonstrate heterogeneous, mechanism-dependent failure patterns beyond aggregate accuracy.
Significance. If the interventions are shown to isolate the targeted factors without confounds, DO-Bench would advance the field by supplying a diagnostic tool that moves beyond aggregate accuracy to reveal specific mechanisms of object hallucination in VLMs. The explicit separation of prior pressure and perceptual grounding, together with the two new metrics, offers a reproducible framework for comparing model behaviors across architectures.
major comments (2)
- [Abstract and Benchmark Design sections] The central claim that the Prior Override and Perception-Limited interventions cleanly isolate prior influence from perceptual grounding lacks any validation data, control experiments, or error analysis demonstrating that the interventions achieve their intended separation. In joint vision-language models, appending textual priors can alter cross-attention and tokenization even when pixel values remain identical, while object crops can remove global scene context used for disambiguation; without evidence that these side effects are negligible, error attribution remains ambiguous. This issue is load-bearing for the diagnostic value of the benchmark.
- [Evaluations section] The evaluations are described as revealing systematic differences in prior sensitivity and perceptual reliability, yet the manuscript supplies no quantitative results, model-specific breakdowns, or ablation studies that tie observed errors directly to the isolated factors rather than to unintended effects of the interventions. This leaves the claim of 'mechanism dependent failure patterns' unsupported by the presented evidence.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which identifies key areas where additional validation would strengthen the diagnostic claims of DO-Bench. We address each major comment below and outline targeted revisions to provide the requested evidence for intervention isolation and mechanism attribution.
read point-by-point responses
-
Referee: [Abstract and Benchmark Design sections] The central claim that the Prior Override and Perception-Limited interventions cleanly isolate prior influence from perceptual grounding lacks any validation data, control experiments, or error analysis demonstrating that the interventions achieve their intended separation. In joint vision-language models, appending textual priors can alter cross-attention and tokenization even when pixel values remain identical, while object crops can remove global scene context used for disambiguation; without evidence that these side effects are negligible, error attribution remains ambiguous. This issue is load-bearing for the diagnostic value of the benchmark.
Authors: We agree that explicit validation of factor isolation is essential and currently absent from the manuscript. The design intends separation by holding the image fixed while strengthening textual priors (Prior Override) and by varying visual evidence granularity while holding text fixed (Perception-Limited), but we acknowledge potential side effects such as attention shifts or loss of scene context. In revision, we will add control experiments including neutral-text baselines, attention-map analyses to quantify unintended cross-attention changes, and error-pattern comparisons under minimal interventions. These will empirically demonstrate that side effects are limited and support the intended attribution. revision: yes
-
Referee: [Evaluations section] The evaluations are described as revealing systematic differences in prior sensitivity and perceptual reliability, yet the manuscript supplies no quantitative results, model-specific breakdowns, or ablation studies that tie observed errors directly to the isolated factors rather than to unintended effects of the interventions. This leaves the claim of 'mechanism dependent failure patterns' unsupported by the presented evidence.
Authors: The evaluations report performance across open- and closed-source VLMs on the two dimensions, indicating heterogeneous behaviors. However, we concur that the current presentation lacks the granular quantitative breakdowns and ablations needed to directly link errors to the targeted factors. We will revise the section to include per-model metric tables, error attribution breakdowns that map failures to prior versus perception dimensions, and ablation studies that systematically vary one intervention while holding the other constant. These additions will provide the quantitative support for mechanism-dependent patterns. revision: yes
Circularity Check
No circularity: benchmark proposal with externally validated interventions and metrics
full rationale
The paper introduces DO-Bench as a diagnostic benchmark using structured multimodal interventions (Prior Override and Perception-Limited dimensions) and defines two metrics (PriorRobust and PerceptionAbility). No mathematical derivations, equations, parameter fittings, or predictions are present that reduce to inputs by construction. The central claims rest on the design of the benchmark and external model evaluations rather than self-referential steps or self-citation chains. The work is self-contained against external benchmarks, with value depending on whether the interventions isolate the intended factors (a correctness question, not circularity).
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Structured changes to textual priors and visual crops isolate the intended failure mechanisms without side effects on model behavior.
invented entities (2)
-
PriorRobust metric
no independent evidence
-
PerceptionAbility metric
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report. ArXivabs/2502.13923(2025),https: //api.semanticscholar.org/CorpusID:276449796
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Exploring the Secondary Risks of Large Language Models
Chen, J., Fang, Z., Yang, X., Yu, C., Yin, Z., Su, H.: Exploring the secondary risks of large language models. ArXivabs/2506.12382(2025),https://api. semanticscholar.org/CorpusID:279402629 DO-Bench 15
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
In: Findings of the As- sociation for Computational Linguistics: EMNLP 2025
Chen, J., Yang, X., Fang, Z., Tian, Y., Dong, Y., Yin, Z., Su, H.: AutoBreach: Universal and adaptive jailbreaking with efficient wordplay-guided optimization via multi-LLMs. In: Chiruzzo, L., Ritter, A., Wang, L. (eds.) Findings of the As- sociation for Computational Linguistics: NAACL 2025. pp. 6792–6813. Associa- tion for Computational Linguistics, Alb...
-
[4]
Advances in Neural Information Processing Systems37, 44393–44418 (2024)
Chen, X., Ma, Z., Zhang, X., Xu, S., Qian, S., Yang, J., Fouhey, D., Chai, J.: Multi- object hallucination in vision language models. Advances in Neural Information Processing Systems37, 44393–44418 (2024)
2024
-
[5]
Chen, Z., Wang, W., Cao, Y., Liu, Y., Gao, Z., Cui, E., Zhu, J., Ye, S., Tian, H., Liu, Z., et al.: Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271 (2024)
work page internal anchor Pith review arXiv 2024
-
[6]
Dola: Decoding by contrasting layers improves factuality in large language models
Chuang, Y.S., Xie, Y., Luo, H., Kim, Y., Glass, J., He, P.: Dola: Decoding by contrasting layers improves factuality in large language models. arXiv preprint arXiv:2309.03883 (2023)
-
[7]
Information Fusion123, 103271 (2025)
Dai, D., Xu, L., Li, Y., Zhang, Y., Xia, S.: Humanvlm: Foundation for human-scene vision-language model. Information Fusion123, 103271 (2025)
2025
-
[8]
Advances in neural information processing systems36, 49250–49267 (2023)
Dai, W., Li, J., Li, D., Tiong, A., Zhao, J., Wang, W., Li, B., Fung, P.N., Hoi, S.: Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in neural information processing systems36, 49250–49267 (2023)
2023
-
[9]
In: Proceedings of the 17th Conference of the European Chapter of the Association for Computational Lin- guistics
Dai, W., Liu, Z., Ji, Z., Su, D., Fung, P.: Plausible may not be faithful: Probing object hallucination in vision-language pre-training. In: Proceedings of the 17th Conference of the European Chapter of the Association for Computational Lin- guistics. pp. 2136–2148 (2023)
2023
-
[10]
Information Fusion p
Danish, S., Sadeghi-Niaraki, A., Khan, S.U., Dang, L.M., Tightiz, L., Moon, H.: A comprehensive survey of vision-language models: Pretrained models, fine-tuning, prompt engineering, adapters, and benchmark datasets. Information Fusion p. 103623 (2025)
2025
-
[11]
Dan Hendrycks and Thomas Dietterich
Feng, Y., Liu, Y., Yang, S., Cai, W., Zhang, J., Zhan, Q., Huang, Z., Yan, H., Wan, Q., Liu, C., et al.: Vision-language model for object detection and segmentation: A review and evaluation. arXiv preprint arXiv:2504.09480 (2025)
-
[12]
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Fu, C., Chen, P., Shen, Y., Qin, Y., Zhang, M., Lin, X., Yang, J., Zheng, X., Li, K., Sun, X., et al.: Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394 (2023)
work page internal anchor Pith review arXiv 2023
-
[13]
In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition
Guan, T., Liu, F., Wu, X., Xian, R., Li, Z., Liu, X., Wang, X., Chen, L., Huang, F., Yacoob, Y., et al.: Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 14375–14385 (2024)
2024
-
[14]
In: Proceedings of the AAAI Conference on Artificial Intelligence
Gunjal, A., Yin, J., Bas, E.: Detecting and preventing hallucinations in large vision language models. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 18135–18143 (2024)
2024
-
[15]
Computers, Materials & Continua82(2) (2025)
Ho, H.T., Nguyen, L.V., Pham, M.T., Pham, Q.H., Tran, Q.D., Huy, D.N.M., Nguyen, T.H.: A review on vision-language-based approaches: Challenges and ap- plications. Computers, Materials & Continua82(2) (2025)
2025
-
[16]
arXiv preprint arXiv:2505.16146 (2025) 16 J
Hua, Z., He, J., Yao, Z., Han, T., Guo, H., Jia, Y., Fang, J.: Steering lvlms via sparse autoencoder for hallucination mitigation. arXiv preprint arXiv:2505.16146 (2025) 16 J. Wang et al
-
[17]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Huang, Q., Dong, X., Zhang, P., Wang, B., He, C., Wang, J., Lin, D., Zhang, W., Yu, N.: Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13418–13427 (2024)
2024
-
[18]
Visual Intelligence 2(1), 17 (2024)
Jiang, Y., Yan, X., Ji, G.P., Fu, K., Sun, M., Xiong, H., Fan, D.P., Khan, F.S.: Effectiveness assessment of recent large vision-language models. Visual Intelligence 2(1), 17 (2024)
2024
-
[19]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Kaul, P., Li, Z., Yang, H., Dukler, Y., Swaminathan, A., Taylor, C., Soatto, S.: Throne: An object-based hallucination benchmark for the free-form generations of large vision-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 27228–27238 (2024)
2024
-
[20]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Leng, S., Zhang, H., Chen, G., Li, X., Lu, S., Miao, C., Bing, L.: Mitigating object hallucinations in large vision-language models through visual contrastive decoding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13872–13882 (2024)
2024
-
[21]
arXiv preprint arXiv:2503.14895 (2025)
Li,S.,Sun,J.,Zheng,G.,Fan,X.,Shen,Y.,Lu,Y.,Xi,Z.,Yang,Y.,Tan,W.,Ji,T., et al.: Mitigating object hallucinations in mllms via multi-frequency perturbations. arXiv preprint arXiv: 2503.14895 (2025)
-
[22]
arXiv preprint arXiv:2603.03857 (2026)
Li, Y., Zhan, H., Chen, J., Gong, Y., Liu, Q., Lu, Y.: Deepscan: A training-free framework for visually grounded reasoning in large vision-language models. arXiv preprint arXiv:2603.03857 (2026)
-
[23]
arXiv preprint arXiv:2505.10118 (2025)
Li, Y., Zhan, H., Chen, T., Liu, Q., Lu, Y.: Why 1+ 1< 1 in visual token pruning: Beyond naïve integration via multi-objective balanced covering. arXiv preprint arXiv:2505.10118 (2025)
-
[24]
In: Proceedings of the 2023 conference on empirical methods in natural language processing
Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, W.X., Wen, J.R.: Evaluating object hal- lucination in large vision-language models. In: Proceedings of the 2023 conference on empirical methods in natural language processing. pp. 292–305 (2023)
2023
-
[25]
Expert Systems with Applications p
Lin, X., Wang, J., Fu, L., Yan, H., Ye, Q.: Instance-aware visual prompting helps multimodal models see better. Expert Systems with Applications p. 129373 (2025)
2025
-
[26]
Advances in neural information processing systems36, 34892–34916 (2023)
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36, 34892–34916 (2023)
2023
-
[27]
In: Proceedings of the 3rd Workshop on Advances in Language and Vision Research (ALVR)
Lovenia, H., Dai, W., Cahyawijaya, S., Ji, Z., Fung, P.: Negative object presence evaluation (nope) to measure object hallucination in vision-language models. In: Proceedings of the 3rd Workshop on Advances in Language and Vision Research (ALVR). pp. 37–58 (2024)
2024
-
[28]
arXiv preprint arXiv:2509.11287 (2025)
Lu, Y., Zhang, Z., Yuan, C., Gao, J., Zhang, C., Qi, X., Li, B., Hu, W.: Mitigat- ing hallucinations in large vision-language models by self-injecting hallucinations. arXiv preprint arXiv:2509.11287 (2025)
-
[29]
In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Parcalabescu,L.,Cafagna,M.,Muradjan,L.,Frank,A.,Calixto,I.,Gatt,A.:Valse: A task-independent benchmark for vision and language models centered on linguis- tic phenomena. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 8253–8280 (2022)
2022
-
[30]
In: Conference on Empirical Methods in Natu- ral Language Processing (2018),https://api.semanticscholar.org/CorpusID: 52176506
Rohrbach, A., Hendricks, L.A., Burns, K., Darrell, T., Saenko, K.: Object hal- lucination in image captioning. In: Conference on Empirical Methods in Natu- ral Language Processing (2018),https://api.semanticscholar.org/CorpusID: 52176506
2018
-
[31]
In: Proceedings of the 2nd Workshop on Uncertainty-Aware NLP (UncertaiNLP 2025)
Seth, A., Manocha, D., Agarwal, C.: Hallucinogen: Benchmarking hallucination in implicit reasoning within large vision language models. In: Proceedings of the 2nd Workshop on Uncertainty-Aware NLP (UncertaiNLP 2025). pp. 89–102 (2025) DO-Bench 17
2025
-
[32]
In: Proceedings of the 33rd ACM International Conference on Multimedia
Shang, Y., Zeng, X., Zhu, Y., Yang, X., Fang, Z., Zhang, J., Chen, J., Liu, Z., Tian, Y.: From pixels to tokens: Revisiting object hallucinations in large vision- language models. In: Proceedings of the 33rd ACM International Conference on Multimedia. p. 10496–10505. MM ’25, Association for Computing Machinery, New York, NY, USA (2025).https://doi.org/10....
-
[33]
Wang, F., Ding, L., Rao, J., Liu, Y., juan Shen, L., Ding, C.: Can linguistic knowl- edge improve multimodal alignment in vision-language pretraining? ACM Trans- actions on Multimedia Computing, Communications and Applications20, 1 – 22 (2023),https://api.semanticscholar.org/CorpusID:261101220
2023
-
[34]
arXiv preprint arXiv:2311.07397 , year=
Wang, J., Wang, Y., Xu, G., Zhang, J., Gu, Y., Jia, H., Wang, J., Xu, H., Yan, M., Zhang, J., et al.: Amber: An llm-free multi-dimensional benchmark for mllms hallucination evaluation. arXiv preprint arXiv:2311.07397 (2023)
-
[35]
IEEE Transactions on Pattern Analysis and Machine Intelligence47(7), 5238–5255 (2024)
Yang, Z., Wang, J., Ye, X., Tang, Y., Chen, K., Zhao, H., Torr, P.H.: Language- aware vision transformer for referring segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence47(7), 5238–5255 (2024)
2024
-
[36]
Remote Sensing16(7), 1168 (2024)
Ye, P., Xiao, G., Liu, J.: Multimodal features alignment for vision–language object tracking. Remote Sensing16(7), 1168 (2024)
2024
-
[37]
Science China Information Sciences67(12), 220105 (2024)
Yin, S., Fu, C., Zhao, S., Xu, T., Wang, H., Sui, D., Shen, Y., Li, K., Sun, X., Chen, E.: Woodpecker: Hallucination correction for multimodal large language models. Science China Information Sciences67(12), 220105 (2024)
2024
-
[38]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Yue, X., Ni, Y., Zhang, K., Zheng, T., Liu, R., Zhang, G., Stevens, S., Jiang, D., Ren, W., Sun, Y., et al.: Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9556–9567 (2024)
2024
-
[39]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Zeng, Y., Huang, Y., Zhang, J., Jie, Z., Chai, Z., Wang, L.: Investigating compo- sitional challenges in vision-language models for visual grounding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14141–14151 (2024)
2024
-
[40]
Zhai, B., Yang, S., Zhao, X., Xu, C., Shen, S., Zhao, D., Keutzer, K., Li, M., Yan, T., Fan, X.: Halle-switch: Rethinking and controlling object existence hallucina- tions in large vision-language models for detailed caption (2023)
2023
-
[41]
IEEE transactions on pattern analysis and machine intelligence46(8), 5625–5644 (2024)
Zhang, J., Huang, J., Jin, S., Lu, S.: Vision-language models for vision tasks: A survey. IEEE transactions on pattern analysis and machine intelligence46(8), 5625–5644 (2024)
2024
-
[42]
In: 2025 IEEE 8th International Conference on Multimedia Information Processing and Retrieval (MIPR)
Zhao, F., Zhang, C., Zhang, R., Wang, T., Li, X.: Mitigating image captioning hal- lucinations in vision-language models. In: 2025 IEEE 8th International Conference on Multimedia Information Processing and Retrieval (MIPR). pp. 297–302. IEEE (2025)
2025
-
[43]
Zhao, T., Zhang, T., Zhu, M., Shen, H., Lee, K., Lu, X., Yin, J.: Vl-checklist: Eval- uating pre-trained vision-language models with objects, attributes and relations. arXiv preprint arXiv:2207.00221 (2022)
-
[44]
Analyzing and mitigating object hallucination in large vision-language models,
Zhou, Y., Cui, C., Yoon, J., Zhang, L., Deng, Z., Finn, C., Bansal, M., Yao, H.: Analyzing and mitigating object hallucination in large vision-language models. arXiv preprint arXiv:2310.00754 (2023)
-
[45]
In: International Conference on Multimedia Modeling
Zhu, F., Liu, Z., Yao, N.X., Wu, H., Wang, W., Feng, F., Wang, C., Luan, H., Chua, T.S.: Mmdocbench: Benchmarking large vision-language models for fine-grained visual document understanding and grounding. In: International Conference on Multimedia Modeling. pp. 74–88. Springer (2026)
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.