arxiv: 2604.05623 · v1 · submitted 2026-04-07 · 💻 cs.CV · cs.CL· cs.MM

Recognition: no theorem link

DetailVerifyBench: A Benchmark for Dense Hallucination Localization in Long Image Captions

Xinran Wang , Yuxuan Zhang , Xiao Zhang , Haolong Yan , Muxi Diao , Songyu Xu , Zhonghao Yan , Hongbing Li

show 2 more authors

Kongming Liang Zhanyu Ma

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:34 UTC · model grok-4.3

classification 💻 cs.CV cs.CLcs.MM

keywords hallucination localizationimage captioningmultimodal large language modelsbenchmark datasetdense annotationslong-form captionserror detection

0 comments

The pith

DetailVerifyBench supplies 1,000 images with token-level hallucination annotations in captions averaging over 200 words.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DetailVerifyBench to fill the gap left by existing benchmarks that only detect broad inconsistencies in image captions rather than pinpointing exact erroneous words or spans. Current multimodal models generate long, narrative-style captions, so evaluation must move to dense, fine-grained localization across hundreds of tokens and multiple hallucination types. The benchmark uses 1,000 images drawn from five domains, each paired with lengthy human-written captions and exhaustive annotations marking where and what kind of errors occur. This setup matters because reliable detailed captioning requires models to know precisely where they have invented or distorted visual content.

Core claim

The authors construct DetailVerifyBench as a collection of 1,000 high-quality images spanning five distinct domains, each accompanied by captions longer than 200 words on average and equipped with dense, token-level annotations that identify multiple categories of hallucinations, thereby creating the most demanding test currently available for exact localization of errors in long image captions.

What carries the argument

DetailVerifyBench, a dataset of images and densely annotated long captions that supports token-by-token evaluation of hallucination localization.

If this is right

Models can now be scored on their ability to name the exact location and type of each error instead of receiving only a pass-fail verdict.
Development of localization-aware training methods becomes measurable across a range of caption lengths and visual domains.
Comparison of multimodal models gains a shared, granular reference that isolates failures at the word or phrase level.
The five-domain coverage makes it possible to check whether localization performance holds outside narrow image categories.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Widespread use of the benchmark would likely push research toward methods that output explicit error locations rather than confidence scores alone.
The dataset could serve as a training signal for models that learn to revise their own captions by first identifying hallucinated segments.
Similar dense-annotation approaches may prove useful for evaluating long-form outputs in related tasks such as video narration or document description.

Load-bearing premise

The human annotations correctly and comprehensively mark every hallucination without systematic bias or omission.

What would settle it

Independent re-annotation of a random subset of the captions by separate annotators produces substantially different hallucination spans or types.

Figures

Figures reproduced from arXiv: 2604.05623 by Haolong Yan, Hongbing Li, Kongming Liang, Muxi Diao, Songyu Xu, Xiao Zhang, Xinran Wang, Yuxuan Zhang, Zhanyu Ma, Zhonghao Yan.

**Figure 2.** Figure 2: The pipeline for building the DetailVerifyBench. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Distribution of 10 hallucination dimensions across [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: (a) Comparison of injection methods; (b) Hallucination counts & [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Visualization of a GUI image example with real [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

read the original abstract

Accurately detecting and localizing hallucinations is a critical task for ensuring high reliability of image captions. In the era of Multimodal Large Language Models (MLLMs), captions have evolved from brief sentences into comprehensive narratives, often spanning hundreds of words. This shift exponentially increases the challenge: models must now pinpoint specific erroneous spans or words within extensive contexts, rather than merely flag response-level inconsistencies. However, existing benchmarks lack the fine granularity and domain diversity required to evaluate this capability. To bridge this gap, we introduce DetailVerifyBench, a rigorous benchmark comprising 1,000 high-quality images across five distinct domains. With an average caption length of over 200 words and dense, token-level annotations of multiple hallucination types, it stands as the most challenging benchmark for precise hallucination localization in the field of long image captioning to date. Our benchmark is available at https://zyx-hhnkh.github.io/DetailVerifyBench/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DetailVerifyBench supplies a larger-scale set of long captions with token-level hallucination labels, but the 'most challenging' claim rests on annotation details that the abstract leaves out.

read the letter

The paper's main output is DetailVerifyBench: 1,000 images across five domains, captions averaging more than 200 words, and dense token-level labels for several hallucination types. This targets the shift toward long-form image descriptions that current short-sentence benchmarks do not cover well. The domain variety and granularity are the concrete additions here. It does a straightforward job of showing why localization inside extended text is harder than whole-response checks. The scale and multi-type labeling give evaluators a more demanding test bed than the narrower sets referenced in the abstract. The soft spot is the missing support for the annotation quality. Labeling exact error spans in 200-word narratives is subjective, and the abstract supplies no inter-annotator agreement figures, guideline examples, or validation subset results. Without those, it is difficult to judge whether the benchmark is truly harder or simply noisier than prior work. The paper itself stays mostly descriptive, with the benchmark release as the central piece and limited new analysis or methods. This is useful for groups that evaluate or fine-tune multimodal models on detailed caption reliability. Readers who need a tougher localization test set will get direct value; others can skip it. Send it to peer review. Referees can ask for the annotation process details and any held-out validation data, which would clarify how solid the resource actually is.

Referee Report

2 major / 0 minor

Summary. The paper introduces DetailVerifyBench, a benchmark consisting of 1,000 high-quality images across five domains, with long captions (average length >200 words) and dense token-level annotations for multiple hallucination types in MLLM-generated image captions. It positions the resource as the most challenging benchmark to date for precise hallucination localization in long-form captioning and provides a public link for access.

Significance. If the annotations can be validated as accurate and consistent, the benchmark would address a clear gap in existing resources by enabling fine-grained evaluation of hallucination localization rather than coarse response-level detection. The scale, domain diversity, and public release represent strengths that could support reproducible progress in MLLM reliability research.

major comments (2)

[Abstract] Abstract: The claim that DetailVerifyBench 'stands as the most challenging benchmark for precise hallucination localization... to date' rests on assertions of scale, caption length, annotation density, and domain diversity, yet the manuscript provides no details on annotation methodology, inter-annotator agreement scores, annotation guidelines, or any held-out validation subset. This information is load-bearing for the central claim of rigor and superiority.
[Abstract] The weakest assumption underlying the benchmark's utility is that the human annotations are accurate, comprehensive, and free of systematic bias or omissions. Without reported agreement metrics or quality-control procedures, inconsistencies in localizing erroneous spans within >200-word narratives could undermine comparative difficulty claims and downstream model evaluations.

Simulated Author's Rebuttal

2 responses · 1 unresolved

Thank you for the constructive feedback on our manuscript introducing DetailVerifyBench. We have carefully considered the referee's concerns regarding the documentation of the annotation process and the substantiation of our claims. We agree that additional details are necessary to fully support the benchmark's positioning and will incorporate them in the revised version. Our point-by-point responses follow.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that DetailVerifyBench 'stands as the most challenging benchmark for precise hallucination localization... to date' rests on assertions of scale, caption length, annotation density, and domain diversity, yet the manuscript provides no details on annotation methodology, inter-annotator agreement scores, annotation guidelines, or any held-out validation subset. This information is load-bearing for the central claim of rigor and superiority.

Authors: We agree that the current version of the manuscript lacks sufficient details on the annotation methodology to fully substantiate the claim. The positioning as the most challenging benchmark is based on the described attributes: 1,000 images across five domains, average caption lengths exceeding 200 words, and dense token-level annotations for multiple hallucination types. These features differentiate it from existing benchmarks. In the revised manuscript, we will expand the abstract and add a dedicated methods section describing the annotation guidelines, process, inter-annotator agreement where measured, and validation procedures. This will provide the load-bearing information requested. revision: yes
Referee: [Abstract] The weakest assumption underlying the benchmark's utility is that the human annotations are accurate, comprehensive, and free of systematic bias or omissions. Without reported agreement metrics or quality-control procedures, inconsistencies in localizing erroneous spans within >200-word narratives could undermine comparative difficulty claims and downstream model evaluations.

Authors: We acknowledge this as a valid concern and the importance of transparent quality assurance for long-form annotations. The annotations in DetailVerifyBench were created with the goal of high accuracy and comprehensiveness using structured guidelines to identify hallucinations at a fine-grained level. The revised version will include a thorough description of the quality-control procedures implemented to reduce bias and omissions. Additionally, we will report any inter-annotator agreement metrics or other validation steps performed. These additions should alleviate concerns about potential inconsistencies affecting evaluations. revision: partial

standing simulated objections not resolved

Specific inter-annotator agreement scores and detailed validation subset results, which were not computed or documented in the original benchmark creation process.

Circularity Check

0 steps flagged

No circularity: benchmark introduction is self-contained resource creation

full rationale

The paper presents DetailVerifyBench as an externally constructed dataset (1,000 images across five domains, >200-word captions, dense token-level hallucination annotations) without any derivation chain, equations, fitted parameters, or predictions that reduce to self-defined inputs. No self-citation load-bearing steps, uniqueness theorems, or ansatzes appear in the provided text; the 'most challenging' claim rests on descriptive scale and granularity rather than reducing by construction to prior author work or fitted quantities. This matches the default expectation of no significant circularity for benchmark papers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the assumption that the created annotations and image selection constitute a valid and challenging test, but the abstract introduces no explicit free parameters, mathematical axioms, or new invented entities.

pith-pipeline@v0.9.0 · 5495 in / 1028 out tokens · 64548 ms · 2026-05-10T18:34:55.156835+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 30 canonical work pages · 8 internal anchors

[1]

[n. d.]. Gemini 3.1 Pro. https://deepmind.google/models/gemini/pro/
[2]

[n. d.]. Qwen3.5. https://qwen.ai/blog?id=qwen3.5
[3]

2026. GPT-5.2. https://openai.com/index/introducing-gpt-5-2/

2026
[4]

2026. GPT-5.4. https://openai.com/index/introducing-gpt-5-4/

2026
[5]

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. 2025. Qwen2.5-VL Technical Rep...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Zechen Bai, Pichao Wang, Tianjun Xiao, Tong He, Zongbo Han, Zheng Zhang, and Mike Zheng Shou. 2025. Hallucination of Multimodal Large Language Models: A Survey. arXiv:2404.18930 [cs.CV] https://arxiv.org/abs/2404.18930

work page internal anchor Pith review arXiv 2025
[8]

Assaf Ben-Kish, Moran Yanuka, Morris Alper, Raja Giryes, and Hadar Averbuch- Elor. 2024. Mitigating Open-Vocabulary Caption Hallucinations. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Association for Computational Linguistics, Miami, Florida, USA, 22680...

work page doi:10.18653/v1/ 2024
[9]

Diego Bonilla-Salvador, Marcelino Martínez-Sober, Joan Vila-Francés, An- tonio José Serrano-López, Pablo Rodríguez-Belenguer, and Fernando Ma- teo. 2024. PixLore: A Dataset-driven Approach to Rich Image Captioning. arXiv:2312.05349 [cs.CV] https://arxiv.org/abs/2312.05349

work page arXiv 2024
[10]

Microsoft COCO Captions: Data Collection and Evaluation Server

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, and C. Lawrence Zitnick. 2015. Microsoft COCO Captions: Data Col- lection and Evaluation Server. doi:10.48550/arXiv.1504.00325 arXiv:1504.00325

work page internal anchor Pith review doi:10.48550/arxiv.1504.00325 2015
[12]

Hongyuan Dong, Jiawen Li, Bohong Wu, Jiacong Wang, Yuan Zhang, and Haoyuan Guo. 2024. Benchmarking and Improving Detail Image Caption. arXiv:2405.19092 [cs.CV] https://arxiv.org/abs/2405.19092

work page arXiv 2024
[13]

Deqing Fu, Tong Xiao, Rui Wang, Wang Zhu, Pengchuan Zhang, Guan Pang, Robin Jia, and Lawrence Chen. 2025. TLDR: Token-Level Detective Reward Model for Large Vision Language Models. InThe Thirteenth International Conference on Learning Representations. https://openreview.net/forum?id=Zy2XgaGpDw

2025
[14]

Roopal Garg, Andrea Burns, Burcu Karagol Ayan, Yonatan Bitton, Ceslee Mont- gomery, Yasumasa Onoe, Andrew Bunner, Ranjay Krishna, Jason Michael Baldridge, and Radu Soricut. 2024. ImageInWords: Unlocking Hyper-Detailed Image Descriptions. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Yaser Al-Onaizan, Mohit Bansa...

work page doi:10.18653/v1/2024.emnlp-main.6 2024
[15]

Anisha Gunjal, Jihan Yin, and Erhan Bas. 2024. Detecting and preventing halluci- nations in large vision language models. InProceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence...

work page doi:10.1609/aaai.v38i16.29771 2024
[16]

Ailin Huang, Chengyuan Yao, Chunrui Han, Fanqi Wan, Hangyu Guo, Haoran Lv, Hongyu Zhou, Jia Wang, Jian Zhou, Jianjian Sun, Jingcheng Hu, Kangheng Lin, Liang Zhao, Mitt Huang, Song Yuan, Wenwen Qu, Xiangfeng Wang, Yanlin Lai, Yingxiu Zhao, Yinmin Zhang, Yukang Shi, Yuyang Chen, Zejia Weng, Ziyang Meng, Ang Li, Aobo Kong, Bo Dong, Changyi Wan, David Wang, D...

work page arXiv 2026
[17]

Liqiang Jing, Ruosen Li, Yunmo Chen, and Xinya Du. 2024. FaithScore: Fine- grained Evaluations of Hallucinations in Large Vision-Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2024, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Association for Com- putational Linguistics, Miami, Florida, USA, 5042–5063. do...

work page doi:10.18653/v1/2024 2024
[18]

Prannay Kaul, Zhizhong Li, Hao Yang, Yonatan Dukler, Ashwin Swaminathan, C. J. Taylor, and Stefano Soatto. 2024. THRONE: An Object-Based Hallucination Benchmark for the Free-Form Generations of Large Vision-Language Models. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 27218–27228. doi:10.1109/CVPR52733.2024.02571

work page doi:10.1109/cvpr52733.2024.02571 2024
[19]

Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, and Lidong Bing. 2024. Mitigating object hallucinations in large vision-language models through visual contrastive decoding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13872–13882

2024
[20]

Kaixin Li, Meng ziyang, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiy- ong Huang, and Tat-Seng Chua. 2025. ScreenSpot-Pro: GUI Grounding for Profes- sional High-Resolution Computer Use. InWorkshop on Reasoning and Planning for Large Language Models. https://openreview.net/forum?id=XaKNDIAHas

2025
[21]

Xiaotong Li, Fan Zhang, Haiwen Diao, Yueze Wang, Xinlong Wang, and LINGYU DUAN. 2024. DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track. https://openreview.net/ forum?id=hej9QGCHT6

2024
[22]

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen. 2023. Evaluating Object Hallucination in Large Vision-Language Models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 292–305. doi:10.18653...

work page doi:10.18653/v1/2023.emnlp-main.20 2023
[23]

Lawrence Zitnick, and Piotr Dollár

Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár
[24]

doi:10.48550/arXiv.1405

Microsoft COCO: Common Objects in Context. doi:10.48550/arXiv.1405. 0312 arXiv:1405.0312

work page doi:10.48550/arxiv.1405
[25]

Hongbo Liu, Jingwen He, Yi Jin, Dian Zheng, Yuhao Dong, Fan Zhang, Ziqi Huang, Yinan He, Yangguang Li, Weichao Chen, Yu Qiao, Wanli Ouyang, Shengjie Zhao, and Ziwei Liu. 2025. ShotBench: Expert-Level Cinematic Understanding in Vision- Language Models. arXiv:2506.21356 [cs.CV] https://arxiv.org/abs/2506.21356

work page arXiv 2025
[26]

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual In- struction Tuning. InThirty-seventh Conference on Neural Information Processing Systems. https://openreview.net/forum?id=w0H2xGHlkw

2023
[27]

Zhihang Liu, Chen-Wei Xie, Bin Wen, Feiwu Yu, JixuanChen, Pandeng Li, Bo- qiang Zhang, Nianzu Yang, YingluLi, Zuan Gao, Yun Zheng, and Hongtao Xie
[28]

InThe Thirty-ninth Annual Confer- ence on Neural Information Processing Systems Datasets and Benchmarks Track

CAPability: A Comprehensive Visual Caption Benchmark for Evaluat- ing Both Correctness and Thoroughness. InThe Thirty-ninth Annual Confer- ence on Neural Information Processing Systems Datasets and Benchmarks Track. https://openreview.net/forum?id=w7CAtdP5XC
[29]

Fan Lu, Wei Wu, Kecheng Zheng, Shuailei Ma, Biao Gong, Jiawei Liu, Wei Zhai, Yang Cao, Yujun Shen, and Zheng-Jun Zha. 2025. Benchmarking Large Vision- Language Models via Directed Scene Graph for Comprehensive Image Caption- ing. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 19618–19627. doi:10.1109/CVPR52734.2025.01827

work page doi:10.1109/cvpr52734.2025.01827 2025
[30]

Yiting Lu, Jiakang Yuan, Zhen Li, Shitian Zhao, Qi Qin, Xinyue Li, Le Zhuo, Licheng Wen, Dongyang Liu, Yuewen Cao, Xiangchao Yan, Xin Li, Tianshuo Peng, Shufei Zhang, Botian Shi, Tao Chen, Zhibo Chen, Lei Bai, Peng Gao, and Bo Zhang
[31]

Omnicaptioner: One captioner to rule them all.arXiv preprint arXiv:2504.07089, 2025

OmniCaptioner: One Captioner to Rule Them All. arXiv:2504.07089 [cs.CV] https://arxiv.org/abs/2504.07089

work page arXiv
[32]

Shota Nakada, Kazuhiro Saito, Yuchi Ishikawa, Hokuto Munakata, Tatsuya Ko- matsu, and Masayoshi Kondo. 2025. Hallucination Localization in Video Cap- tioning. arXiv:2510.25225 [cs.MM] https://arxiv.org/abs/2510.25225 Xinran Wang, Yuxuan Zhang, Xiao Zhang, Haolong Yan, Muxi Diao, Songyu Xu, Zhonghao Yan, Hongbing Li, Kongming Liang, and Zhanyu Ma

work page arXiv 2025
[33]

Yasumasa Onoe, Sunayana Rane, Zachary Berger, Yonatan Bitton, Jaemin Cho, Roopal Garg, Alexander Ku, Zarana Parekh, Jordi Pont-Tuset, Garrett Tanzer, Su Wang, and Jason Baldridge. 2024. DOCCI: Descriptions of Connected and Contrasting Images. arXiv:2404.19753 [cs.CV] https://arxiv.org/abs/2404.19753

work page arXiv 2024
[34]

Eunkyu Park, Minyeong Kim, and Gunhee Kim. 2025. HalLoc: Token-level Localization of Hallucinations for Vision Language Models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 29893–29903

2025
[35]

Chan, Anish Kachinthaya, Haodi Zou, John Canny, Joseph E

Suzanne Petryk, David M. Chan, Anish Kachinthaya, Haodi Zou, John Canny, Joseph E. Gonzalez, and Trevor Darrell. 2024. ALOHa: A New Measure for Hallucination in Captioning Models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers), Kevin ...

work page doi:10.18653/v1/2024.naacl-short.30 2024
[36]

Jordi Pont-Tuset, Jasper Uijlings, Soravit Changpinyo, Radu Soricut, and Vittorio Ferrari. 2020. Connecting Vision and Language with Localized Narratives. In Computer Vision – ECCV 2020, Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (Eds.). Springer International Publishing, Cham, 647–664

2020
[37]

Han Qiu, Jiaxing Huang, Peng Gao, Qin Qi, Xiaoqin Zhang, Ling Shao, and Shijian Lu. 2024. LongHalQA: Long-Context Hallucination Evaluation for MultiModal Large Language Models. arXiv:2410.09962 [cs.CV] https://arxiv.org/abs/2410. 09962

work page arXiv 2024
[38]

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade W Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa R Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. 2022. LAION-5B: An open large-scale dataset for training next generation image-te...

2022
[39]

Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. 2018. Concep- tual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Iryna Gurevych and Yusuke Miyao (Eds.). Association for Computational Lin...

work page doi:10.18653/v1/p18-1238 2018
[40]

Core Team, Bangjun Xiao, Bingquan Xia, Bo Yang, Bofei Gao, Bowen Shen, Chen Zhang, Chenhong He, Chiheng Lou, Fuli Luo, Gang Wang, Gang Xie, Hailin Zhang, Hanglong Lv, Hanyu Li, Heyu Chen, Hongshen Xu, Houbin Zhang, Huaqiu Liu, Jiangshan Duo, Jianyu Wei, Jiebao Xiao, Jinhao Dong, Jun Shi, Junhao Hu, Kainan Bao, Kang Zhou, Lei Li, Liang Zhao, Linghao Zhang,...

work page internal anchor Pith review arXiv 2026
[41]

Gemini 2.5 Team. 2025. Gemini 2.5: Pushing the Frontier with Advanced Rea- soning, Multimodality, Long Context, and Next Generation Agentic Capabilities. arXiv:2507.06261 [cs.CL] https://arxiv.org/abs/2507.06261

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, S. H. Cai, Yuan Cao, Y. Charles, H. S. Che, Cheng Chen, Guanduo Chen, Huarong Chen, Jia Chen, Jiahao Chen, Jianlong Chen, Jun Chen, Kefan Chen, Liang Chen, Ruijue Chen, Xinhao Chen, Yanru Chen, Yanxu Chen, Yicun Chen, Yimin Chen, Yingjiang Chen, Yuankun Chen, Yujie Chen, Yutian Chen, Zhirong Chen, Ziwei Chen...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[43]

Seed2.0 Team. [n. d.]. ByteDance Seed. https://seed.bytedance.com/en/seed2
[44]

V Team, Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, Shuaiqi Duan, Weihan Wang, Yan Wang, Yean Cheng, Zehai He, Zhe Su, Zhen Yang, Ziyang Pan, Aohan Zeng, Baoxu Wang, Bin Chen, Boyan Shi, Changyu Pang, Chenhui Zhang, Da Yin, Fan Yang, Guoqing Chen, Haochen Li, Jiale Zhu, Jiali Chen, Ji...

work page internal anchor Pith review arXiv 2026
[45]

Junyang Wang, Yiyang Zhou, Guohai Xu, Pengcheng Shi, Chenlin Zhao, Haiyang Xu, Qinghao Ye, Ming Yan, Ji Zhang, Jihua Zhu, Jitao Sang, and Haoyu Tang
[46]

Evaluation and analysis of hal- lucination in large vision-language models

Evaluation and Analysis of Hallucination in Large Vision-Language Models. arXiv preprint arXiv:2308.15126(2023)

work page arXiv 2023
[47]

Xinran Wang, Songyu Xu, Shan Xiangxuan, Yuxuan Zhang, Muxi Diao, Xueyan Duan, Yanhua huang, Kongming Liang, and Zhanyu Ma. 2025. CineTechBench: A Benchmark for Cinematographic Technique Understanding and Generation. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track. https://openreview.net/forum?id=...

2025
[48]

Long Xing, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Jianze Liang, Qidong Huang, Jiaqi Wang, Feng Wu, and Dahua Lin. 2025. CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning. arXiv:2509.22647 [cs.CV] https://arxiv.org/abs/2509.22647

work page arXiv 2025
[49]

Long Xing, Qidong Huang, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Jinsong Li, Shuangrui Ding, Weiming Zhang, Nenghai Yu, Jiaqi Wang, Feng Wu, and Dahua Lin. 2025. ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality Debiasing. arXiv:2506.19848 [cs.CV] https://arxiv.org/abs/2506. 19848

work page arXiv 2025
[50]

Tianwei Xiong, Yuqing Wang, Daquan Zhou, Zhijie Lin, Jiashi Feng, and Xihui Liu. 2024. LVD-2M: A Long-take Video Dataset with Temporally Dense Captions. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track. https://openreview.net/forum?id=H5bUdfM55S

2024
[51]

Zhucun Xue, Jiangning Zhang, Teng Hu, Haoyang He, Yinan Chen, yuxuan cai, Yabiao Wang, Chengjie Wang, Yong Liu, Xiangtai Li, and Dacheng Tao. 2025. UltraVideo: High-Quality UHD Video Dataset with Comprehensive Captions. In The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track. https://openreview.net/foru...

2025
[52]

Haolong Yan, Zheng Chang, Binghao Tang, Boda Lin, Min Luo, Yanxian Bi, and Si Li. 2025. Bi-directional dual contrastive adapting method for alleviating hallucination in visual question answering.Expert Systems with Applications291 (2025), 128392

2025
[53]

Qiying Yu, Quan Sun, Xiaosong Zhang, Yufeng Cui, Fan Zhang, Yue Cao, Xinlong Wang, and Jingjing Liu. 2024. CapsFusion: Rethinking Image-Text Data at Scale. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 14022–14032. doi:10.1109/CVPR52733.2024.01330

work page doi:10.1109/cvpr52733.2024.01330 2024