pith. machine review for the scientific record. sign in

arxiv: 2604.22822 · v1 · submitted 2026-04-18 · 💻 cs.CV · cs.AI

Recognition: unknown

DO-Bench: An Attributable Benchmark for Diagnosing Object Hallucination in Vision-Language Models

Jiawei Chen, JiYang Wang, Mengqi Xiao, Yangfu Li, Yu Cheng, Zhaoxia Yin

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:30 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords object hallucinationvision-language modelsdiagnostic benchmarkprior influenceperceptual groundingerror attribution
0
0 comments X

The pith

Object hallucinations in vision-language models stem from distinct prior influence and perceptual weakness that a controlled benchmark can separately measure.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DO-Bench to diagnose why vision-language models wrongly claim objects exist in images. It creates two test dimensions: one that adds stronger text hints while keeping the image the same to see if models resist false suggestions, and another that shows clearer visual details to see if models can actually see the objects. By measuring resistance to priors and strength of perception separately, the benchmark reveals that different models fail in different ways rather than all having the same problem. This matters because aggregate accuracy scores hide the real causes, making it hard to fix hallucinations effectively.

Core claim

DO-Bench isolates the sources of object hallucination through Prior Override, which strengthens textual priors with constant visuals, and Perception-Limited, which increases visual evidence from full scenes to localized crops. It defines PriorRobust and PerceptionAbility metrics to quantify these behaviors consistently across open- and closed-source models, revealing that object hallucination reflects heterogeneous, mechanism-dependent failure patterns beyond aggregate accuracy.

What carries the argument

DO-Bench's paired Prior Override and Perception-Limited dimensions, which hold one factor constant while varying the other to attribute errors to prior suppression, perceptual insufficiency, or their interaction.

If this is right

  • Different models show distinct profiles of prior sensitivity and perceptual reliability.
  • Errors can be attributed to specific causes rather than reported only as overall accuracy.
  • Aggregate benchmarks alone cannot distinguish the underlying mechanisms of hallucination.
  • Targeted fixes become possible once the dominant failure mode for a given model is identified.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Developers could select or fine-tune models based on their measured profiles for tasks where one failure mode dominates.
  • The paired-intervention design could be adapted to diagnose other hallucination types such as attributes or spatial relations.
  • Training data mixtures might be adjusted to strengthen perceptual grounding in models that score low on that metric.

Load-bearing premise

The structured multimodal interventions in the two dimensions truly isolate prior influence from perceptual grounding without introducing new confounds or altering model behavior in unintended ways.

What would settle it

A finding that models' error rates do not shift as expected when textual priors are strengthened or when visual evidence is improved from full scenes to object crops, or that the two dimensions interact in ways the metrics cannot separate.

Figures

Figures reproduced from arXiv: 2604.22822 by Jiawei Chen, JiYang Wang, Mengqi Xiao, Yangfu Li, Yu Cheng, Zhaoxia Yin.

Figure 1
Figure 1. Figure 1: Pilot study. Two minimal interventions on object existence verification: target￾focused cropping (crop) strengthens visual evidence, while contextual prior strengthen￾ing (prior) increases prior pressure under the same image. (a) Cropping substantially recovers false denials (correctness gain), whereas stronger priors increase errors by over￾turning originally correct decisions, for both LLaVA-v1.5-7B and … view at source ↗
Figure 2
Figure 2. Figure 2: DO-Bench constructs a fixed 10-instance set from each scene via controlled interventions over two dimensions. Contextual prior strength is varied across four lev￾els for both a present-but-anomalous A-object (A0–A3) and an absent-but-expected B-object (B0–B3) under the Full view. Under neutral contextual priors (A0), visual evidence for the A-object is concentrated through Cluster and Crop views. These in￾… view at source ↗
Figure 3
Figure 3. Figure 3: Object category distribution in DO-Bench. (a) Present A-objects (66 unique categories). (b) Absent B-objects (77 unique categories). Numbers indicate the number of scene groups. resulting in 1,240 total evaluation samples. Across the 124 scenes, DO-Bench covers 66 unique A-object categories and 77 unique B-object categories; the corresponding category distributions are shown in [PITH_FULL_IMAGE:figures/fu… view at source ↗
Figure 4
Figure 4. Figure 4: Trend-based behavioral analysis on DO-Bench. [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Prior–evidence orthogonality in the InternVL2.5 family. [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
read the original abstract

Object level hallucination remains a central reliability challenge for vision language models (VLMs), particularly in binary object existence verification. Existing benchmarks emphasize aggregate accuracy but rarely disentangle whether errors stem from perceptual limitations or from the influence of contextual textual priors, leaving underlying failure mechanisms ambiguous. We introduce DO-Bench, a controlled diagnostic benchmark that isolates these sources through structured multimodal interventions. Rather than evaluating models in unconstrained settings, DO-Bench probes two complementary dimensions: the Prior Override dimension progressively strengthens contextual textual priors while holding visual evidence constant to assess resistance to prior pressure, and the Perception-Limited dimension incrementally enhances visual evidence from full-scene context to localized object crops to measure perceptual grounding strength. This paired design enables attribution of errors to prior suppression, perceptual insufficiency, or their interaction. We further define two diagnostic metrics, PriorRobust and PerceptionAbility, to quantify these behaviors consistently. Evaluations across diverse open- and closed-source VLMs reveal systematic differences in prior sensitivity and perceptual reliability, demonstrating that object hallucination reflects heterogeneous, mechanism dependent failure patterns beyond aggregate accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces DO-Bench, a controlled diagnostic benchmark for object hallucination in vision-language models. It uses two complementary dimensions of structured multimodal interventions: Prior Override, which progressively strengthens contextual textual priors while holding the image fixed, and Perception-Limited, which varies visual input from full scenes to localized object crops. This paired design is intended to attribute errors to prior suppression, perceptual insufficiency, or their interaction. Two new metrics, PriorRobust and PerceptionAbility, are defined to quantify these behaviors, and evaluations across diverse open- and closed-source VLMs are reported to demonstrate heterogeneous, mechanism-dependent failure patterns beyond aggregate accuracy.

Significance. If the interventions are shown to isolate the targeted factors without confounds, DO-Bench would advance the field by supplying a diagnostic tool that moves beyond aggregate accuracy to reveal specific mechanisms of object hallucination in VLMs. The explicit separation of prior pressure and perceptual grounding, together with the two new metrics, offers a reproducible framework for comparing model behaviors across architectures.

major comments (2)
  1. [Abstract and Benchmark Design sections] The central claim that the Prior Override and Perception-Limited interventions cleanly isolate prior influence from perceptual grounding lacks any validation data, control experiments, or error analysis demonstrating that the interventions achieve their intended separation. In joint vision-language models, appending textual priors can alter cross-attention and tokenization even when pixel values remain identical, while object crops can remove global scene context used for disambiguation; without evidence that these side effects are negligible, error attribution remains ambiguous. This issue is load-bearing for the diagnostic value of the benchmark.
  2. [Evaluations section] The evaluations are described as revealing systematic differences in prior sensitivity and perceptual reliability, yet the manuscript supplies no quantitative results, model-specific breakdowns, or ablation studies that tie observed errors directly to the isolated factors rather than to unintended effects of the interventions. This leaves the claim of 'mechanism dependent failure patterns' unsupported by the presented evidence.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which identifies key areas where additional validation would strengthen the diagnostic claims of DO-Bench. We address each major comment below and outline targeted revisions to provide the requested evidence for intervention isolation and mechanism attribution.

read point-by-point responses
  1. Referee: [Abstract and Benchmark Design sections] The central claim that the Prior Override and Perception-Limited interventions cleanly isolate prior influence from perceptual grounding lacks any validation data, control experiments, or error analysis demonstrating that the interventions achieve their intended separation. In joint vision-language models, appending textual priors can alter cross-attention and tokenization even when pixel values remain identical, while object crops can remove global scene context used for disambiguation; without evidence that these side effects are negligible, error attribution remains ambiguous. This issue is load-bearing for the diagnostic value of the benchmark.

    Authors: We agree that explicit validation of factor isolation is essential and currently absent from the manuscript. The design intends separation by holding the image fixed while strengthening textual priors (Prior Override) and by varying visual evidence granularity while holding text fixed (Perception-Limited), but we acknowledge potential side effects such as attention shifts or loss of scene context. In revision, we will add control experiments including neutral-text baselines, attention-map analyses to quantify unintended cross-attention changes, and error-pattern comparisons under minimal interventions. These will empirically demonstrate that side effects are limited and support the intended attribution. revision: yes

  2. Referee: [Evaluations section] The evaluations are described as revealing systematic differences in prior sensitivity and perceptual reliability, yet the manuscript supplies no quantitative results, model-specific breakdowns, or ablation studies that tie observed errors directly to the isolated factors rather than to unintended effects of the interventions. This leaves the claim of 'mechanism dependent failure patterns' unsupported by the presented evidence.

    Authors: The evaluations report performance across open- and closed-source VLMs on the two dimensions, indicating heterogeneous behaviors. However, we concur that the current presentation lacks the granular quantitative breakdowns and ablations needed to directly link errors to the targeted factors. We will revise the section to include per-model metric tables, error attribution breakdowns that map failures to prior versus perception dimensions, and ablation studies that systematically vary one intervention while holding the other constant. These additions will provide the quantitative support for mechanism-dependent patterns. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark proposal with externally validated interventions and metrics

full rationale

The paper introduces DO-Bench as a diagnostic benchmark using structured multimodal interventions (Prior Override and Perception-Limited dimensions) and defines two metrics (PriorRobust and PerceptionAbility). No mathematical derivations, equations, parameter fittings, or predictions are present that reduce to inputs by construction. The central claims rest on the design of the benchmark and external model evaluations rather than self-referential steps or self-citation chains. The work is self-contained against external benchmarks, with value depending on whether the interventions isolate the intended factors (a correctness question, not circularity).

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim depends on the untested premise that the controlled interventions cleanly separate prior and perceptual effects.

axioms (1)
  • domain assumption Structured changes to textual priors and visual crops isolate the intended failure mechanisms without side effects on model behavior.
    Invoked in the description of the Prior Override and Perception-Limited dimensions.
invented entities (2)
  • PriorRobust metric no independent evidence
    purpose: Quantify model resistance to increasing textual prior pressure
    Newly introduced diagnostic score with no external validation or prior literature reference in the abstract.
  • PerceptionAbility metric no independent evidence
    purpose: Quantify model perceptual grounding as visual evidence increases
    Newly introduced diagnostic score with no external validation or prior literature reference in the abstract.

pith-pipeline@v0.9.0 · 5498 in / 1241 out tokens · 27963 ms · 2026-05-10T06:30:06.387027+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

45 extracted references · 16 canonical work pages · 4 internal anchors

  1. [1]

    Qwen2.5-VL Technical Report

    Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report. ArXivabs/2502.13923(2025),https: //api.semanticscholar.org/CorpusID:276449796

  2. [2]

    Exploring the Secondary Risks of Large Language Models

    Chen, J., Fang, Z., Yang, X., Yu, C., Yin, Z., Su, H.: Exploring the secondary risks of large language models. ArXivabs/2506.12382(2025),https://api. semanticscholar.org/CorpusID:279402629 DO-Bench 15

  3. [3]

    In: Findings of the As- sociation for Computational Linguistics: EMNLP 2025

    Chen, J., Yang, X., Fang, Z., Tian, Y., Dong, Y., Yin, Z., Su, H.: AutoBreach: Universal and adaptive jailbreaking with efficient wordplay-guided optimization via multi-LLMs. In: Chiruzzo, L., Ritter, A., Wang, L. (eds.) Findings of the As- sociation for Computational Linguistics: NAACL 2025. pp. 6792–6813. Associa- tion for Computational Linguistics, Alb...

  4. [4]

    Advances in Neural Information Processing Systems37, 44393–44418 (2024)

    Chen, X., Ma, Z., Zhang, X., Xu, S., Qian, S., Yang, J., Fouhey, D., Chai, J.: Multi- object hallucination in vision language models. Advances in Neural Information Processing Systems37, 44393–44418 (2024)

  5. [5]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Chen, Z., Wang, W., Cao, Y., Liu, Y., Gao, Z., Cui, E., Zhu, J., Ye, S., Tian, H., Liu, Z., et al.: Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271 (2024)

  6. [6]

    Dola: Decoding by contrasting layers improves factuality in large language models

    Chuang, Y.S., Xie, Y., Luo, H., Kim, Y., Glass, J., He, P.: Dola: Decoding by contrasting layers improves factuality in large language models. arXiv preprint arXiv:2309.03883 (2023)

  7. [7]

    Information Fusion123, 103271 (2025)

    Dai, D., Xu, L., Li, Y., Zhang, Y., Xia, S.: Humanvlm: Foundation for human-scene vision-language model. Information Fusion123, 103271 (2025)

  8. [8]

    Advances in neural information processing systems36, 49250–49267 (2023)

    Dai, W., Li, J., Li, D., Tiong, A., Zhao, J., Wang, W., Li, B., Fung, P.N., Hoi, S.: Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in neural information processing systems36, 49250–49267 (2023)

  9. [9]

    In: Proceedings of the 17th Conference of the European Chapter of the Association for Computational Lin- guistics

    Dai, W., Liu, Z., Ji, Z., Su, D., Fung, P.: Plausible may not be faithful: Probing object hallucination in vision-language pre-training. In: Proceedings of the 17th Conference of the European Chapter of the Association for Computational Lin- guistics. pp. 2136–2148 (2023)

  10. [10]

    Information Fusion p

    Danish, S., Sadeghi-Niaraki, A., Khan, S.U., Dang, L.M., Tightiz, L., Moon, H.: A comprehensive survey of vision-language models: Pretrained models, fine-tuning, prompt engineering, adapters, and benchmark datasets. Information Fusion p. 103623 (2025)

  11. [11]

    Dan Hendrycks and Thomas Dietterich

    Feng, Y., Liu, Y., Yang, S., Cai, W., Zhang, J., Zhan, Q., Huang, Z., Yan, H., Wan, Q., Liu, C., et al.: Vision-language model for object detection and segmentation: A review and evaluation. arXiv preprint arXiv:2504.09480 (2025)

  12. [12]

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    Fu, C., Chen, P., Shen, Y., Qin, Y., Zhang, M., Lin, X., Yang, J., Zheng, X., Li, K., Sun, X., et al.: Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394 (2023)

  13. [13]

    In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Guan, T., Liu, F., Wu, X., Xian, R., Li, Z., Liu, X., Wang, X., Chen, L., Huang, F., Yacoob, Y., et al.: Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 14375–14385 (2024)

  14. [14]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Gunjal, A., Yin, J., Bas, E.: Detecting and preventing hallucinations in large vision language models. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 18135–18143 (2024)

  15. [15]

    Computers, Materials & Continua82(2) (2025)

    Ho, H.T., Nguyen, L.V., Pham, M.T., Pham, Q.H., Tran, Q.D., Huy, D.N.M., Nguyen, T.H.: A review on vision-language-based approaches: Challenges and ap- plications. Computers, Materials & Continua82(2) (2025)

  16. [16]

    arXiv preprint arXiv:2505.16146 (2025) 16 J

    Hua, Z., He, J., Yao, Z., Han, T., Guo, H., Jia, Y., Fang, J.: Steering lvlms via sparse autoencoder for hallucination mitigation. arXiv preprint arXiv:2505.16146 (2025) 16 J. Wang et al

  17. [17]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Huang, Q., Dong, X., Zhang, P., Wang, B., He, C., Wang, J., Lin, D., Zhang, W., Yu, N.: Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13418–13427 (2024)

  18. [18]

    Visual Intelligence 2(1), 17 (2024)

    Jiang, Y., Yan, X., Ji, G.P., Fu, K., Sun, M., Xiong, H., Fan, D.P., Khan, F.S.: Effectiveness assessment of recent large vision-language models. Visual Intelligence 2(1), 17 (2024)

  19. [19]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Kaul, P., Li, Z., Yang, H., Dukler, Y., Swaminathan, A., Taylor, C., Soatto, S.: Throne: An object-based hallucination benchmark for the free-form generations of large vision-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 27228–27238 (2024)

  20. [20]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Leng, S., Zhang, H., Chen, G., Li, X., Lu, S., Miao, C., Bing, L.: Mitigating object hallucinations in large vision-language models through visual contrastive decoding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13872–13882 (2024)

  21. [21]

    arXiv preprint arXiv:2503.14895 (2025)

    Li,S.,Sun,J.,Zheng,G.,Fan,X.,Shen,Y.,Lu,Y.,Xi,Z.,Yang,Y.,Tan,W.,Ji,T., et al.: Mitigating object hallucinations in mllms via multi-frequency perturbations. arXiv preprint arXiv: 2503.14895 (2025)

  22. [22]

    arXiv preprint arXiv:2603.03857 (2026)

    Li, Y., Zhan, H., Chen, J., Gong, Y., Liu, Q., Lu, Y.: Deepscan: A training-free framework for visually grounded reasoning in large vision-language models. arXiv preprint arXiv:2603.03857 (2026)

  23. [23]

    arXiv preprint arXiv:2505.10118 (2025)

    Li, Y., Zhan, H., Chen, T., Liu, Q., Lu, Y.: Why 1+ 1< 1 in visual token pruning: Beyond naïve integration via multi-objective balanced covering. arXiv preprint arXiv:2505.10118 (2025)

  24. [24]

    In: Proceedings of the 2023 conference on empirical methods in natural language processing

    Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, W.X., Wen, J.R.: Evaluating object hal- lucination in large vision-language models. In: Proceedings of the 2023 conference on empirical methods in natural language processing. pp. 292–305 (2023)

  25. [25]

    Expert Systems with Applications p

    Lin, X., Wang, J., Fu, L., Yan, H., Ye, Q.: Instance-aware visual prompting helps multimodal models see better. Expert Systems with Applications p. 129373 (2025)

  26. [26]

    Advances in neural information processing systems36, 34892–34916 (2023)

    Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36, 34892–34916 (2023)

  27. [27]

    In: Proceedings of the 3rd Workshop on Advances in Language and Vision Research (ALVR)

    Lovenia, H., Dai, W., Cahyawijaya, S., Ji, Z., Fung, P.: Negative object presence evaluation (nope) to measure object hallucination in vision-language models. In: Proceedings of the 3rd Workshop on Advances in Language and Vision Research (ALVR). pp. 37–58 (2024)

  28. [28]

    arXiv preprint arXiv:2509.11287 (2025)

    Lu, Y., Zhang, Z., Yuan, C., Gao, J., Zhang, C., Qi, X., Li, B., Hu, W.: Mitigat- ing hallucinations in large vision-language models by self-injecting hallucinations. arXiv preprint arXiv:2509.11287 (2025)

  29. [29]

    In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

    Parcalabescu,L.,Cafagna,M.,Muradjan,L.,Frank,A.,Calixto,I.,Gatt,A.:Valse: A task-independent benchmark for vision and language models centered on linguis- tic phenomena. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 8253–8280 (2022)

  30. [30]

    In: Conference on Empirical Methods in Natu- ral Language Processing (2018),https://api.semanticscholar.org/CorpusID: 52176506

    Rohrbach, A., Hendricks, L.A., Burns, K., Darrell, T., Saenko, K.: Object hal- lucination in image captioning. In: Conference on Empirical Methods in Natu- ral Language Processing (2018),https://api.semanticscholar.org/CorpusID: 52176506

  31. [31]

    In: Proceedings of the 2nd Workshop on Uncertainty-Aware NLP (UncertaiNLP 2025)

    Seth, A., Manocha, D., Agarwal, C.: Hallucinogen: Benchmarking hallucination in implicit reasoning within large vision language models. In: Proceedings of the 2nd Workshop on Uncertainty-Aware NLP (UncertaiNLP 2025). pp. 89–102 (2025) DO-Bench 17

  32. [32]

    In: Proceedings of the 33rd ACM International Conference on Multimedia

    Shang, Y., Zeng, X., Zhu, Y., Yang, X., Fang, Z., Zhang, J., Chen, J., Liu, Z., Tian, Y.: From pixels to tokens: Revisiting object hallucinations in large vision- language models. In: Proceedings of the 33rd ACM International Conference on Multimedia. p. 10496–10505. MM ’25, Association for Computing Machinery, New York, NY, USA (2025).https://doi.org/10....

  33. [33]

    Wang, F., Ding, L., Rao, J., Liu, Y., juan Shen, L., Ding, C.: Can linguistic knowl- edge improve multimodal alignment in vision-language pretraining? ACM Trans- actions on Multimedia Computing, Communications and Applications20, 1 – 22 (2023),https://api.semanticscholar.org/CorpusID:261101220

  34. [34]

    arXiv preprint arXiv:2311.07397 , year=

    Wang, J., Wang, Y., Xu, G., Zhang, J., Gu, Y., Jia, H., Wang, J., Xu, H., Yan, M., Zhang, J., et al.: Amber: An llm-free multi-dimensional benchmark for mllms hallucination evaluation. arXiv preprint arXiv:2311.07397 (2023)

  35. [35]

    IEEE Transactions on Pattern Analysis and Machine Intelligence47(7), 5238–5255 (2024)

    Yang, Z., Wang, J., Ye, X., Tang, Y., Chen, K., Zhao, H., Torr, P.H.: Language- aware vision transformer for referring segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence47(7), 5238–5255 (2024)

  36. [36]

    Remote Sensing16(7), 1168 (2024)

    Ye, P., Xiao, G., Liu, J.: Multimodal features alignment for vision–language object tracking. Remote Sensing16(7), 1168 (2024)

  37. [37]

    Science China Information Sciences67(12), 220105 (2024)

    Yin, S., Fu, C., Zhao, S., Xu, T., Wang, H., Sui, D., Shen, Y., Li, K., Sun, X., Chen, E.: Woodpecker: Hallucination correction for multimodal large language models. Science China Information Sciences67(12), 220105 (2024)

  38. [38]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Yue, X., Ni, Y., Zhang, K., Zheng, T., Liu, R., Zhang, G., Stevens, S., Jiang, D., Ren, W., Sun, Y., et al.: Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9556–9567 (2024)

  39. [39]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Zeng, Y., Huang, Y., Zhang, J., Jie, Z., Chai, Z., Wang, L.: Investigating compo- sitional challenges in vision-language models for visual grounding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14141–14151 (2024)

  40. [40]

    Zhai, B., Yang, S., Zhao, X., Xu, C., Shen, S., Zhao, D., Keutzer, K., Li, M., Yan, T., Fan, X.: Halle-switch: Rethinking and controlling object existence hallucina- tions in large vision-language models for detailed caption (2023)

  41. [41]

    IEEE transactions on pattern analysis and machine intelligence46(8), 5625–5644 (2024)

    Zhang, J., Huang, J., Jin, S., Lu, S.: Vision-language models for vision tasks: A survey. IEEE transactions on pattern analysis and machine intelligence46(8), 5625–5644 (2024)

  42. [42]

    In: 2025 IEEE 8th International Conference on Multimedia Information Processing and Retrieval (MIPR)

    Zhao, F., Zhang, C., Zhang, R., Wang, T., Li, X.: Mitigating image captioning hal- lucinations in vision-language models. In: 2025 IEEE 8th International Conference on Multimedia Information Processing and Retrieval (MIPR). pp. 297–302. IEEE (2025)

  43. [43]

    Vl-checklist: Evaluating pre-trained vision-language models with objects, attributes and relations.arXiv preprint arXiv:2207.00221, 2022

    Zhao, T., Zhang, T., Zhu, M., Shen, H., Lee, K., Lu, X., Yin, J.: Vl-checklist: Eval- uating pre-trained vision-language models with objects, attributes and relations. arXiv preprint arXiv:2207.00221 (2022)

  44. [44]

    Analyzing and mitigating object hallucination in large vision-language models,

    Zhou, Y., Cui, C., Yoon, J., Zhang, L., Deng, Z., Finn, C., Bansal, M., Yao, H.: Analyzing and mitigating object hallucination in large vision-language models. arXiv preprint arXiv:2310.00754 (2023)

  45. [45]

    In: International Conference on Multimedia Modeling

    Zhu, F., Liu, Z., Yao, N.X., Wu, H., Wang, W., Feng, F., Wang, C., Luan, H., Chua, T.S.: Mmdocbench: Benchmarking large vision-language models for fine-grained visual document understanding and grounding. In: International Conference on Multimedia Modeling. pp. 74–88. Springer (2026)