Recognition: no theorem link
DetailVerifyBench: A Benchmark for Dense Hallucination Localization in Long Image Captions
Pith reviewed 2026-05-10 18:34 UTC · model grok-4.3
The pith
DetailVerifyBench supplies 1,000 images with token-level hallucination annotations in captions averaging over 200 words.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors construct DetailVerifyBench as a collection of 1,000 high-quality images spanning five distinct domains, each accompanied by captions longer than 200 words on average and equipped with dense, token-level annotations that identify multiple categories of hallucinations, thereby creating the most demanding test currently available for exact localization of errors in long image captions.
What carries the argument
DetailVerifyBench, a dataset of images and densely annotated long captions that supports token-by-token evaluation of hallucination localization.
If this is right
- Models can now be scored on their ability to name the exact location and type of each error instead of receiving only a pass-fail verdict.
- Development of localization-aware training methods becomes measurable across a range of caption lengths and visual domains.
- Comparison of multimodal models gains a shared, granular reference that isolates failures at the word or phrase level.
- The five-domain coverage makes it possible to check whether localization performance holds outside narrow image categories.
Where Pith is reading between the lines
- Widespread use of the benchmark would likely push research toward methods that output explicit error locations rather than confidence scores alone.
- The dataset could serve as a training signal for models that learn to revise their own captions by first identifying hallucinated segments.
- Similar dense-annotation approaches may prove useful for evaluating long-form outputs in related tasks such as video narration or document description.
Load-bearing premise
The human annotations correctly and comprehensively mark every hallucination without systematic bias or omission.
What would settle it
Independent re-annotation of a random subset of the captions by separate annotators produces substantially different hallucination spans or types.
Figures
read the original abstract
Accurately detecting and localizing hallucinations is a critical task for ensuring high reliability of image captions. In the era of Multimodal Large Language Models (MLLMs), captions have evolved from brief sentences into comprehensive narratives, often spanning hundreds of words. This shift exponentially increases the challenge: models must now pinpoint specific erroneous spans or words within extensive contexts, rather than merely flag response-level inconsistencies. However, existing benchmarks lack the fine granularity and domain diversity required to evaluate this capability. To bridge this gap, we introduce DetailVerifyBench, a rigorous benchmark comprising 1,000 high-quality images across five distinct domains. With an average caption length of over 200 words and dense, token-level annotations of multiple hallucination types, it stands as the most challenging benchmark for precise hallucination localization in the field of long image captioning to date. Our benchmark is available at https://zyx-hhnkh.github.io/DetailVerifyBench/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces DetailVerifyBench, a benchmark consisting of 1,000 high-quality images across five domains, with long captions (average length >200 words) and dense token-level annotations for multiple hallucination types in MLLM-generated image captions. It positions the resource as the most challenging benchmark to date for precise hallucination localization in long-form captioning and provides a public link for access.
Significance. If the annotations can be validated as accurate and consistent, the benchmark would address a clear gap in existing resources by enabling fine-grained evaluation of hallucination localization rather than coarse response-level detection. The scale, domain diversity, and public release represent strengths that could support reproducible progress in MLLM reliability research.
major comments (2)
- [Abstract] Abstract: The claim that DetailVerifyBench 'stands as the most challenging benchmark for precise hallucination localization... to date' rests on assertions of scale, caption length, annotation density, and domain diversity, yet the manuscript provides no details on annotation methodology, inter-annotator agreement scores, annotation guidelines, or any held-out validation subset. This information is load-bearing for the central claim of rigor and superiority.
- [Abstract] The weakest assumption underlying the benchmark's utility is that the human annotations are accurate, comprehensive, and free of systematic bias or omissions. Without reported agreement metrics or quality-control procedures, inconsistencies in localizing erroneous spans within >200-word narratives could undermine comparative difficulty claims and downstream model evaluations.
Simulated Author's Rebuttal
Thank you for the constructive feedback on our manuscript introducing DetailVerifyBench. We have carefully considered the referee's concerns regarding the documentation of the annotation process and the substantiation of our claims. We agree that additional details are necessary to fully support the benchmark's positioning and will incorporate them in the revised version. Our point-by-point responses follow.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim that DetailVerifyBench 'stands as the most challenging benchmark for precise hallucination localization... to date' rests on assertions of scale, caption length, annotation density, and domain diversity, yet the manuscript provides no details on annotation methodology, inter-annotator agreement scores, annotation guidelines, or any held-out validation subset. This information is load-bearing for the central claim of rigor and superiority.
Authors: We agree that the current version of the manuscript lacks sufficient details on the annotation methodology to fully substantiate the claim. The positioning as the most challenging benchmark is based on the described attributes: 1,000 images across five domains, average caption lengths exceeding 200 words, and dense token-level annotations for multiple hallucination types. These features differentiate it from existing benchmarks. In the revised manuscript, we will expand the abstract and add a dedicated methods section describing the annotation guidelines, process, inter-annotator agreement where measured, and validation procedures. This will provide the load-bearing information requested. revision: yes
-
Referee: [Abstract] The weakest assumption underlying the benchmark's utility is that the human annotations are accurate, comprehensive, and free of systematic bias or omissions. Without reported agreement metrics or quality-control procedures, inconsistencies in localizing erroneous spans within >200-word narratives could undermine comparative difficulty claims and downstream model evaluations.
Authors: We acknowledge this as a valid concern and the importance of transparent quality assurance for long-form annotations. The annotations in DetailVerifyBench were created with the goal of high accuracy and comprehensiveness using structured guidelines to identify hallucinations at a fine-grained level. The revised version will include a thorough description of the quality-control procedures implemented to reduce bias and omissions. Additionally, we will report any inter-annotator agreement metrics or other validation steps performed. These additions should alleviate concerns about potential inconsistencies affecting evaluations. revision: partial
- Specific inter-annotator agreement scores and detailed validation subset results, which were not computed or documented in the original benchmark creation process.
Circularity Check
No circularity: benchmark introduction is self-contained resource creation
full rationale
The paper presents DetailVerifyBench as an externally constructed dataset (1,000 images across five domains, >200-word captions, dense token-level hallucination annotations) without any derivation chain, equations, fitted parameters, or predictions that reduce to self-defined inputs. No self-citation load-bearing steps, uniqueness theorems, or ansatzes appear in the provided text; the 'most challenging' claim rests on descriptive scale and granularity rather than reducing by construction to prior author work or fitted quantities. This matches the default expectation of no significant circularity for benchmark papers.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
[n. d.]. Gemini 3.1 Pro. https://deepmind.google/models/gemini/pro/
-
[2]
[n. d.]. Qwen3.5. https://qwen.ai/blog?id=qwen3.5
-
[3]
2026. GPT-5.2. https://openai.com/index/introducing-gpt-5-2/
2026
-
[4]
2026. GPT-5.4. https://openai.com/index/introducing-gpt-5-4/
2026
-
[5]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. 2025. Qwen2.5-VL Technical Rep...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Zechen Bai, Pichao Wang, Tianjun Xiao, Tong He, Zongbo Han, Zheng Zhang, and Mike Zheng Shou. 2025. Hallucination of Multimodal Large Language Models: A Survey. arXiv:2404.18930 [cs.CV] https://arxiv.org/abs/2404.18930
work page internal anchor Pith review arXiv 2025
-
[8]
Assaf Ben-Kish, Moran Yanuka, Morris Alper, Raja Giryes, and Hadar Averbuch- Elor. 2024. Mitigating Open-Vocabulary Caption Hallucinations. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Association for Computational Linguistics, Miami, Florida, USA, 22680...
- [9]
-
[10]
Microsoft COCO Captions: Data Collection and Evaluation Server
Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, and C. Lawrence Zitnick. 2015. Microsoft COCO Captions: Data Col- lection and Evaluation Server. doi:10.48550/arXiv.1504.00325 arXiv:1504.00325
work page internal anchor Pith review doi:10.48550/arxiv.1504.00325 2015
- [12]
-
[13]
Deqing Fu, Tong Xiao, Rui Wang, Wang Zhu, Pengchuan Zhang, Guan Pang, Robin Jia, and Lawrence Chen. 2025. TLDR: Token-Level Detective Reward Model for Large Vision Language Models. InThe Thirteenth International Conference on Learning Representations. https://openreview.net/forum?id=Zy2XgaGpDw
2025
-
[14]
Roopal Garg, Andrea Burns, Burcu Karagol Ayan, Yonatan Bitton, Ceslee Mont- gomery, Yasumasa Onoe, Andrew Bunner, Ranjay Krishna, Jason Michael Baldridge, and Radu Soricut. 2024. ImageInWords: Unlocking Hyper-Detailed Image Descriptions. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Yaser Al-Onaizan, Mohit Bansa...
-
[15]
Anisha Gunjal, Jihan Yin, and Erhan Bas. 2024. Detecting and preventing halluci- nations in large vision language models. InProceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence...
-
[16]
Ailin Huang, Chengyuan Yao, Chunrui Han, Fanqi Wan, Hangyu Guo, Haoran Lv, Hongyu Zhou, Jia Wang, Jian Zhou, Jianjian Sun, Jingcheng Hu, Kangheng Lin, Liang Zhao, Mitt Huang, Song Yuan, Wenwen Qu, Xiangfeng Wang, Yanlin Lai, Yingxiu Zhao, Yinmin Zhang, Yukang Shi, Yuyang Chen, Zejia Weng, Ziyang Meng, Ang Li, Aobo Kong, Bo Dong, Changyi Wan, David Wang, D...
-
[17]
Liqiang Jing, Ruosen Li, Yunmo Chen, and Xinya Du. 2024. FaithScore: Fine- grained Evaluations of Hallucinations in Large Vision-Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2024, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Association for Com- putational Linguistics, Miami, Florida, USA, 5042–5063. do...
-
[18]
Prannay Kaul, Zhizhong Li, Hao Yang, Yonatan Dukler, Ashwin Swaminathan, C. J. Taylor, and Stefano Soatto. 2024. THRONE: An Object-Based Hallucination Benchmark for the Free-Form Generations of Large Vision-Language Models. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 27218–27228. doi:10.1109/CVPR52733.2024.02571
-
[19]
Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, and Lidong Bing. 2024. Mitigating object hallucinations in large vision-language models through visual contrastive decoding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13872–13882
2024
-
[20]
Kaixin Li, Meng ziyang, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiy- ong Huang, and Tat-Seng Chua. 2025. ScreenSpot-Pro: GUI Grounding for Profes- sional High-Resolution Computer Use. InWorkshop on Reasoning and Planning for Large Language Models. https://openreview.net/forum?id=XaKNDIAHas
2025
-
[21]
Xiaotong Li, Fan Zhang, Haiwen Diao, Yueze Wang, Xinlong Wang, and LINGYU DUAN. 2024. DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track. https://openreview.net/ forum?id=hej9QGCHT6
2024
-
[22]
Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen. 2023. Evaluating Object Hallucination in Large Vision-Language Models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 292–305. doi:10.18653...
-
[23]
Lawrence Zitnick, and Piotr Dollár
Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár
-
[24]
Microsoft COCO: Common Objects in Context. doi:10.48550/arXiv.1405. 0312 arXiv:1405.0312
-
[25]
Hongbo Liu, Jingwen He, Yi Jin, Dian Zheng, Yuhao Dong, Fan Zhang, Ziqi Huang, Yinan He, Yangguang Li, Weichao Chen, Yu Qiao, Wanli Ouyang, Shengjie Zhao, and Ziwei Liu. 2025. ShotBench: Expert-Level Cinematic Understanding in Vision- Language Models. arXiv:2506.21356 [cs.CV] https://arxiv.org/abs/2506.21356
-
[26]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual In- struction Tuning. InThirty-seventh Conference on Neural Information Processing Systems. https://openreview.net/forum?id=w0H2xGHlkw
2023
-
[27]
Zhihang Liu, Chen-Wei Xie, Bin Wen, Feiwu Yu, JixuanChen, Pandeng Li, Bo- qiang Zhang, Nianzu Yang, YingluLi, Zuan Gao, Yun Zheng, and Hongtao Xie
-
[28]
InThe Thirty-ninth Annual Confer- ence on Neural Information Processing Systems Datasets and Benchmarks Track
CAPability: A Comprehensive Visual Caption Benchmark for Evaluat- ing Both Correctness and Thoroughness. InThe Thirty-ninth Annual Confer- ence on Neural Information Processing Systems Datasets and Benchmarks Track. https://openreview.net/forum?id=w7CAtdP5XC
-
[29]
Fan Lu, Wei Wu, Kecheng Zheng, Shuailei Ma, Biao Gong, Jiawei Liu, Wei Zhai, Yang Cao, Yujun Shen, and Zheng-Jun Zha. 2025. Benchmarking Large Vision- Language Models via Directed Scene Graph for Comprehensive Image Caption- ing. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 19618–19627. doi:10.1109/CVPR52734.2025.01827
-
[30]
Yiting Lu, Jiakang Yuan, Zhen Li, Shitian Zhao, Qi Qin, Xinyue Li, Le Zhuo, Licheng Wen, Dongyang Liu, Yuewen Cao, Xiangchao Yan, Xin Li, Tianshuo Peng, Shufei Zhang, Botian Shi, Tao Chen, Zhibo Chen, Lei Bai, Peng Gao, and Bo Zhang
-
[31]
Omnicaptioner: One captioner to rule them all.arXiv preprint arXiv:2504.07089, 2025
OmniCaptioner: One Captioner to Rule Them All. arXiv:2504.07089 [cs.CV] https://arxiv.org/abs/2504.07089
-
[32]
Shota Nakada, Kazuhiro Saito, Yuchi Ishikawa, Hokuto Munakata, Tatsuya Ko- matsu, and Masayoshi Kondo. 2025. Hallucination Localization in Video Cap- tioning. arXiv:2510.25225 [cs.MM] https://arxiv.org/abs/2510.25225 Xinran Wang, Yuxuan Zhang, Xiao Zhang, Haolong Yan, Muxi Diao, Songyu Xu, Zhonghao Yan, Hongbing Li, Kongming Liang, and Zhanyu Ma
-
[33]
Yasumasa Onoe, Sunayana Rane, Zachary Berger, Yonatan Bitton, Jaemin Cho, Roopal Garg, Alexander Ku, Zarana Parekh, Jordi Pont-Tuset, Garrett Tanzer, Su Wang, and Jason Baldridge. 2024. DOCCI: Descriptions of Connected and Contrasting Images. arXiv:2404.19753 [cs.CV] https://arxiv.org/abs/2404.19753
-
[34]
Eunkyu Park, Minyeong Kim, and Gunhee Kim. 2025. HalLoc: Token-level Localization of Hallucinations for Vision Language Models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 29893–29903
2025
-
[35]
Chan, Anish Kachinthaya, Haodi Zou, John Canny, Joseph E
Suzanne Petryk, David M. Chan, Anish Kachinthaya, Haodi Zou, John Canny, Joseph E. Gonzalez, and Trevor Darrell. 2024. ALOHa: A New Measure for Hallucination in Captioning Models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers), Kevin ...
-
[36]
Jordi Pont-Tuset, Jasper Uijlings, Soravit Changpinyo, Radu Soricut, and Vittorio Ferrari. 2020. Connecting Vision and Language with Localized Narratives. In Computer Vision – ECCV 2020, Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (Eds.). Springer International Publishing, Cham, 647–664
2020
- [37]
-
[38]
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade W Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa R Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. 2022. LAION-5B: An open large-scale dataset for training next generation image-te...
2022
-
[39]
Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. 2018. Concep- tual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Iryna Gurevych and Yusuke Miyao (Eds.). Association for Computational Lin...
-
[40]
Core Team, Bangjun Xiao, Bingquan Xia, Bo Yang, Bofei Gao, Bowen Shen, Chen Zhang, Chenhong He, Chiheng Lou, Fuli Luo, Gang Wang, Gang Xie, Hailin Zhang, Hanglong Lv, Hanyu Li, Heyu Chen, Hongshen Xu, Houbin Zhang, Huaqiu Liu, Jiangshan Duo, Jianyu Wei, Jiebao Xiao, Jinhao Dong, Jun Shi, Junhao Hu, Kainan Bao, Kang Zhou, Lei Li, Liang Zhao, Linghao Zhang,...
work page internal anchor Pith review arXiv 2026
-
[41]
Gemini 2.5 Team. 2025. Gemini 2.5: Pushing the Frontier with Advanced Rea- soning, Multimodality, Long Context, and Next Generation Agentic Capabilities. arXiv:2507.06261 [cs.CL] https://arxiv.org/abs/2507.06261
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[42]
Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, S. H. Cai, Yuan Cao, Y. Charles, H. S. Che, Cheng Chen, Guanduo Chen, Huarong Chen, Jia Chen, Jiahao Chen, Jianlong Chen, Jun Chen, Kefan Chen, Liang Chen, Ruijue Chen, Xinhao Chen, Yanru Chen, Yanxu Chen, Yicun Chen, Yimin Chen, Yingjiang Chen, Yuankun Chen, Yujie Chen, Yutian Chen, Zhirong Chen, Ziwei Chen...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[43]
Seed2.0 Team. [n. d.]. ByteDance Seed. https://seed.bytedance.com/en/seed2
-
[44]
V Team, Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, Shuaiqi Duan, Weihan Wang, Yan Wang, Yean Cheng, Zehai He, Zhe Su, Zhen Yang, Ziyang Pan, Aohan Zeng, Baoxu Wang, Bin Chen, Boyan Shi, Changyu Pang, Chenhui Zhang, Da Yin, Fan Yang, Guoqing Chen, Haochen Li, Jiale Zhu, Jiali Chen, Ji...
work page internal anchor Pith review arXiv 2026
-
[45]
Junyang Wang, Yiyang Zhou, Guohai Xu, Pengcheng Shi, Chenlin Zhao, Haiyang Xu, Qinghao Ye, Ming Yan, Ji Zhang, Jihua Zhu, Jitao Sang, and Haoyu Tang
-
[46]
Evaluation and analysis of hal- lucination in large vision-language models
Evaluation and Analysis of Hallucination in Large Vision-Language Models. arXiv preprint arXiv:2308.15126(2023)
-
[47]
Xinran Wang, Songyu Xu, Shan Xiangxuan, Yuxuan Zhang, Muxi Diao, Xueyan Duan, Yanhua huang, Kongming Liang, and Zhanyu Ma. 2025. CineTechBench: A Benchmark for Cinematographic Technique Understanding and Generation. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track. https://openreview.net/forum?id=...
2025
- [48]
-
[49]
Long Xing, Qidong Huang, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Jinsong Li, Shuangrui Ding, Weiming Zhang, Nenghai Yu, Jiaqi Wang, Feng Wu, and Dahua Lin. 2025. ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality Debiasing. arXiv:2506.19848 [cs.CV] https://arxiv.org/abs/2506. 19848
-
[50]
Tianwei Xiong, Yuqing Wang, Daquan Zhou, Zhijie Lin, Jiashi Feng, and Xihui Liu. 2024. LVD-2M: A Long-take Video Dataset with Temporally Dense Captions. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track. https://openreview.net/forum?id=H5bUdfM55S
2024
-
[51]
Zhucun Xue, Jiangning Zhang, Teng Hu, Haoyang He, Yinan Chen, yuxuan cai, Yabiao Wang, Chengjie Wang, Yong Liu, Xiangtai Li, and Dacheng Tao. 2025. UltraVideo: High-Quality UHD Video Dataset with Comprehensive Captions. In The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track. https://openreview.net/foru...
2025
-
[52]
Haolong Yan, Zheng Chang, Binghao Tang, Boda Lin, Min Luo, Yanxian Bi, and Si Li. 2025. Bi-directional dual contrastive adapting method for alleviating hallucination in visual question answering.Expert Systems with Applications291 (2025), 128392
2025
-
[53]
Qiying Yu, Quan Sun, Xiaosong Zhang, Yufeng Cui, Fan Zhang, Yue Cao, Xinlong Wang, and Jingjing Liu. 2024. CapsFusion: Rethinking Image-Text Data at Scale. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 14022–14032. doi:10.1109/CVPR52733.2024.01330
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.