VCap: Hypergeometric Rewards for Weak-to-Strong Visual Captioning

Bin Wen; Changyi Liu; Chun Yuan; Fan Yang; Han Li; Haonan Fan; Jinpeng Wang; Kaiyu Jiang; Tianke Zhang; Tingting Gao

arxiv: 2605.28023 · v1 · pith:2NDS3QTKnew · submitted 2026-05-27 · 💻 cs.CV · cs.AI· cs.CL· cs.MM

VCap: Hypergeometric Rewards for Weak-to-Strong Visual Captioning

Xingyu Lu , Jinpeng Wang , Yi-Fan Zhang , Yankai Yang , Yancheng Long , Yiyang Fan , Xuanyu Zheng , Haonan Fan

show 8 more authors

Kaiyu Jiang Tianke Zhang Changyi Liu Bin Wen Fan Yang Tingting Gao Han Li Chun Yuan

This is my paper

Pith reviewed 2026-06-29 13:11 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CLcs.MM

keywords visual captioningreinforcement learninghypergeometric rewardweak-to-strong generalizationfactual consistencymultimodal modelsimage captioningvideo captioning

0 comments

The pith

VCap pairs reference captions with visual signals as a Witness-Adjudicator reward to deliver hypergeometric-precision verification of factual consistency, enabling an 8B model to outperform larger SOTA systems on captioning benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces VCap to improve reinforcement learning for visual captioning by creating a reward that checks whether generated captions match reference captions when both are grounded in the actual image or video content. Current reward designs give noisy or coarse signals that limit how well models can reduce hallucinations or omissions. VCap instead treats the reference as a witness and the visual input as an adjudicator to score consistency at a level of precision matching a hypergeometric distribution. This setup is intended to work even when references are imperfect, supporting training where a weaker model learns from its own outputs to reach stronger performance. Experiments show the resulting 8B model exceeds both open and closed source leaders on standard image and video captioning tests, with human raters confirming better factual alignment.

Core claim

VCap is a Witness-Adjudicator reward that explicitly verifies factual consistency between the reference and policy-generated captions grounded in the visual signal, delivering a reward signal with hypergeometric-distribution-level precision for caption quality verification. This design enables effective learning even from imperfect references, facilitating weak-to-strong generalization in RL training.

What carries the argument

The Witness-Adjudicator reward, which pairs the reference caption as witness with the visual signal as adjudicator to score factual consistency at hypergeometric precision.

If this is right

An 8B model trained with VCap outperforms open- and closed-source SOTA models on multiple image and video captioning benchmarks.
Human evaluation confirms strong alignment with factual correctness.
VCap improves MLLM perceptual capability and generalizes across tasks.
VCap surpasses best-of-N distillation, challenging prior assumptions about RLVR.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar witness-adjudicator rewards could be tested on other generation tasks where reference data is noisy but a grounding signal like video frames is available.
The approach may lower the data quality threshold needed for effective RL fine-tuning of captioners.
If the precision claim holds, it suggests that verification mechanisms grounded in raw input can substitute for some of the benefits usually attributed to larger model scale.

Load-bearing premise

The visual signal can serve as a reliable, unbiased adjudicator that verifies factual consistency between reference and generated captions at the claimed hypergeometric precision without introducing new biases.

What would settle it

Train an 8B model with VCap and a control model with a standard reward on the same data, then have human raters score factual correctness on a held-out captioning set; if the VCap model shows no statistically significant advantage in factual alignment, the central claim is falsified.

Figures

Figures reproduced from arXiv: 2605.28023 by Bin Wen, Changyi Liu, Chun Yuan, Fan Yang, Han Li, Haonan Fan, Jinpeng Wang, Kaiyu Jiang, Tianke Zhang, Tingting Gao, Xingyu Lu, Xuanyu Zheng, Yancheng Long, Yankai Yang, Yi-Fan Zhang, Yiyang Fan.

**Figure 2.** Figure 2: VCap overview. (a) VCap’s reward mechanism: reference (witness) and image (adjudicator) jointly produce Correctness, Completeness, and Text Quality scores. (b) For video, a global reward and a per-segment reward are combined. (c) Self-improvement: the policy model iteratively regenerates stronger references, which sharpen the reward signal to further refine the policy. model on the per-segment caption aga… view at source ↗

**Figure 3.** Figure 3: Human evaluation on the 500-image set. Left: per-model statistics. |M|/|I|: total Judgeproposed missing/inconsistent propositions. Mˆ / ˆI: human-confirmed true missing/inconsistent counts. r¯H/r˜H: mean and median per-image human rank, where 1 denotes the best model. r¯V: mean perimage rank under the VCap reward. w¯: average caption length in words. Right: per-image pairwise agreement (%) between the VC… view at source ↗

**Figure 4.** Figure 4: Reward-model instruction for image captioning when a reference caption is available. [PITH_FULL_IMAGE:figures/full_fig_p022_4.png] view at source ↗

**Figure 5.** Figure 5: Reward-model instruction for image captioning without a reference caption, used in the [PITH_FULL_IMAGE:figures/full_fig_p023_5.png] view at source ↗

**Figure 6.** Figure 6: Reward-model instruction for the global pass over a full video, scoring Reasonability, [PITH_FULL_IMAGE:figures/full_fig_p023_6.png] view at source ↗

**Figure 7.** Figure 7: Reward-model instruction for the per-segment local pass on a randomly sampled temporal [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗

**Figure 8.** Figure 8: Case-study image. A flowering tree with reddish-brown leaves and pink blossoms dominates [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗

**Figure 9.** Figure 9: Policy starting caption from the unmodified backbone, used as the reference [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗

**Figure 10.** Figure 10: VCap (e1) caption: same backbone after one round of Witness–Adjudicator RL against [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗

**Figure 11.** Figure 11: VCap (e2) caption: same backbone after a second round of RL with the reference [PITH_FULL_IMAGE:figures/full_fig_p027_11.png] view at source ↗

read the original abstract

Visual captioning requires models to capture visual content faithfully while minimizing both omission and hallucination. As the dominant paradigm for captioning, MLLMs have achieved strong performance through scaling and high-quality data. Recently, RL has emerged as a key route to driving MLLMs toward higher precision and broader coverage, however, existing reward designs for captioning fail to provide fine-grained and reliable signals for factual verification, limiting their effectiveness. To address this, we propose VCap, a Witness-Adjudicator reward that pairs the reference caption (a witness) with the visual signal (an adjudicator). By explicitly verifying factual consistency between the reference and policy-generated captions grounded in the visual signal, VCap delivers a reward signal with hypergeometric-distribution-level precision for caption quality verification. This design enables effective learning even from imperfect references, facilitating weak-to-strong generalization in RL training. In our experiments, an 8B model trained with VCap outperforms open- and closed-source SOTA models on multiple image and video captioning benchmarks. Human evaluation further confirms its strong alignment with factual correctness. Additionally, VCap improves MLLM perceptual capability, generalizes across tasks, and surpasses best-of-N distillation, challenging prior assumptions about RLVR.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces a witness-adjudicator reward for RL in visual captioning that claims hypergeometric precision from visual verification, and reports an 8B model beating SOTAs on benchmarks.

read the letter

The main point is a reward design for RL training of captioning models. It pairs a reference caption as witness with the visual input as adjudicator to check factual consistency between reference and generated captions. The authors frame this as producing a hypergeometric-distribution reward signal that supports weak-to-strong generalization even with imperfect references.

What stands out as new is the explicit witness-adjudicator construction tied to that distribution for captioning rewards. Earlier RL work in this area has used various verification signals, but this pairing and the hypergeometric claim appear distinct from the cited priors.

The paper reports solid empirical outcomes: an 8B model trained this way outperforms both open and closed-source models on image and video captioning benchmarks, with human evaluations backing better factual alignment. It also notes gains in perceptual capability, cross-task generalization, and improvement over best-of-N distillation.

The soft spots center on the reward math. The abstract states the hypergeometric precision without showing the derivation or how the distribution emerges from the witness-adjudicator setup, so it is difficult to confirm the precision level or check for fitting. The visual signal's reliability as adjudicator is presented as an empirical result rather than a proven premise, and any systematic biases in that verification would affect the whole pipeline. The comparisons to closed-source models also need clear details on prompting and evaluation protocols to hold up.

This work is aimed at researchers doing RL alignment for multimodal models, especially those focused on reward design for captioning and factual reliability. Readers working on weak-to-strong generalization in vision-language tasks would get the most from it.

The empirical claims and concrete method are strong enough to merit a serious referee, even if the math section requires close checking.

Referee Report

2 major / 0 minor

Summary. The paper proposes VCap, a witness-adjudicator reward for RL-based visual captioning in which a reference caption serves as witness and the visual signal as adjudicator to verify factual consistency between reference and policy-generated captions. The method is claimed to produce a reward with hypergeometric-distribution-level precision, enabling effective weak-to-strong generalization even from imperfect references. Experiments report that an 8B MLLM trained with VCap outperforms open- and closed-source SOTA models on multiple image and video captioning benchmarks, with human evaluation confirming factual alignment; additional claims include improved perceptual capability, cross-task generalization, and superiority over best-of-N distillation.

Significance. If the empirical results and the hypergeometric reward construction hold under scrutiny, the work would be significant for RLVR in multimodal models: it offers a concrete mechanism to obtain fine-grained factual rewards without perfect references, demonstrates smaller models surpassing larger SOTA via this signal, and challenges the assumption that distillation is preferable to RL for captioning. The human-evaluation alignment and generalization results would strengthen the case for visual-signal adjudication in captioning pipelines.

major comments (2)

[Abstract] Abstract and method description: the central claim that the witness-adjudicator pairing yields a reward with 'hypergeometric-distribution-level precision' is load-bearing for the weak-to-strong generalization argument, yet no derivation, probability model, or explicit mapping from the visual-adjudicator verification to the hypergeometric distribution is provided; without this, it is impossible to determine whether the distribution is derived or assumed and whether any parameters are fitted.
[Experiments] Experiments section (implied by benchmark claims): the outperformance of the 8B model over SOTA on image and video captioning benchmarks is the primary empirical support, but the abstract provides no details on the exact benchmarks, metrics, baselines, or statistical significance tests; this information is required to evaluate whether the gains are attributable to the VCap reward rather than other training choices.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and commit to revisions that strengthen the clarity of the hypergeometric reward construction and the presentation of experimental details.

read point-by-point responses

Referee: [Abstract] Abstract and method description: the central claim that the witness-adjudicator pairing yields a reward with 'hypergeometric-distribution-level precision' is load-bearing for the weak-to-strong generalization argument, yet no derivation, probability model, or explicit mapping from the visual-adjudicator verification to the hypergeometric distribution is provided; without this, it is impossible to determine whether the distribution is derived or assumed and whether any parameters are fitted.

Authors: We agree that the abstract and current method description lack an explicit derivation and probability model. The manuscript presents the witness-adjudicator pairing and states the resulting precision level, but does not include the formal mapping. In the revised version we will insert a dedicated subsection deriving the reward from the hypergeometric distribution, specifying the underlying probability model, the combinatorial verification process, and any fitted parameters. This addition will make clear that the claimed precision follows directly from the distribution rather than being assumed. revision: yes
Referee: [Experiments] Experiments section (implied by benchmark claims): the outperformance of the 8B model over SOTA on image and video captioning benchmarks is the primary empirical support, but the abstract provides no details on the exact benchmarks, metrics, baselines, or statistical significance tests; this information is required to evaluate whether the gains are attributable to the VCap reward rather than other training choices.

Authors: The abstract is length-limited and therefore omits granular experimental details; these are provided in full in the Experiments section, which specifies the image benchmarks (COCO, NoCaps, Flickr30K), video benchmarks (MSVD, MSR-VTT), metrics (CIDEr, BLEU-4, METEOR, SPICE), all baselines, and statistical significance testing. To address the referee's concern we will expand the abstract with a concise enumeration of the primary benchmarks and metrics while retaining the overall length constraint. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The abstract and method description present VCap as an empirical reward design (witness-adjudicator pairing) whose hypergeometric precision is stated as an outcome of the construction rather than derived via equations that reduce to fitted inputs or self-citations. No load-bearing derivations, predictions, or uniqueness theorems are exhibited in the provided text. The central result is benchmark performance and human evaluation, which remain independent of any internal algebraic reduction. This is the most common honest finding for an empirical RL paper without visible first-principles claims.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only view yields no explicit free parameters, axioms, or invented entities; the hypergeometric framing and visual-adjudicator reliability are implicit assumptions whose status cannot be audited without the full text.

pith-pipeline@v0.9.1-grok · 5806 in / 1202 out tokens · 29177 ms · 2026-06-29T13:11:39.540279+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Kwai Keye-VL-2.0 Technical Report
cs.CV 2026-06 unverdicted novelty 4.0

Kwai Keye-VL-2.0-30B-A3B is a 30B MoE model with 3B active parameters using DSA adaptation and MOPD distillation that reports SOTA results on video understanding and agent benchmarks.

Reference graph

Works this paper leans on

71 extracted references · 44 canonical work pages · cited by 1 Pith paper · 10 internal anchors

[1]

Wenhao Chai, Enxin Song, Yilun Du, Chenlin Meng, Vashisht Madhavan, Omer Bar-Tal, Jenq-Neng Hwang, Saining Xie, and Christopher D. Manning. Auroracap: Efficient, performant video detailed captioning and a new benchmark. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. URL...

2025
[2]

Reiss, N

Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. InIEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pages 3558–3568. Computer Vision Foundation / IEEE, 2021. doi: 10.1109/CVPR46437.2021.00356. URL...

work page doi:10.1109/cvpr46437.2021.00356 2021
[3]

ALLaVA: Harnessing GPT4V-Synthesized Data for Lite Vision-Language Models

Guiming Hardy Chen, Shunian Chen, Ruifei Zhang, Junying Chen, Xiangbo Wu, Zhiyi Zhang, Zhihong Chen, Jianquan Li, Xiang Wan, and Benyou Wang. Allava: Harnessing gpt4v-synthesized data for lite vision-language models.ArXiv preprint, abs/2402.11684, 2024. URL https://arxiv.org/abs/2402. 11684

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

URLhttps://arxiv.org/abs/2311.12793

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Sharegpt4video: Improving video understanding and generation with better captions

Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Lin Bin, Zhenyu Tang, Li Yuan, Yu Qiao, Dahua Lin, Feng Zhao, and Jiaqi Wang. Sharegpt4video: Improving video understanding and generation with better captions. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, a...

2024
[7]

URL https://proceedings.mlr

Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, and Sergey Tulyakov. Panda-70m: Captioning 70m videos with multiple cross-modality teachers. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-2...

work page doi:10.1109/cvpr52733.2024.01265 2024
[8]

Microsoft COCO Captions: Data Collection and Evaluation Server

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO captions: Data collection and evaluation server.ArXiv preprint, abs/1504.00325, 2015. URLhttps://arxiv.org/abs/1504.00325

work page internal anchor Pith review Pith/arXiv arXiv 2015
[10]

URLhttps://arxiv.org/abs/2506.09079

work page arXiv
[11]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, et al. Deepseek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.ArXiv preprint, abs/2501.12948, 2025. URL https://arxiv.org/abs/2501. 12948

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, Jiasen Lu, Taira Anderson, Erin Bransom, Kiana Ehsani, Huong Ngo, YenSung Chen, Ajay Patel, Mark Yatskar, Chris Callison-Burch, Andrew Head, Rose Hendrix, Favyen Bastani, Eli VanderBilt, Nathan Lambert, Yvon...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Improving CLIP train- ing with language rewrites

Lijie Fan, Dilip Krishnan, Phillip Isola, Dina Katabi, and Yonglong Tian. Improving CLIP train- ing with language rewrites. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, ...

2023
[14]

Edge-grpo: Entropy-driven grpo with guided error correction for advantage diversity.arXiv preprint arXiv:2507.21848, 2025a

Roopal Garg, Andrea Burns, Burcu Karagol Ayan, Yonatan Bitton, Ceslee Montgomery, Yasumasa Onoe, Andrew Bunner, Ranjay Krishna, Jason Michael Baldridge, and Radu Soricut. ImageInWords: Unlocking hyper-detailed image descriptions. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natur...

work page doi:10.18653/v1/2024 2024
[15]

Clarification as supervision: Reinforcement learning for vision-language interfaces.ArXiv preprint, abs/2509.26594, 2025

John Gkountouras and Ivan Titov. Clarification as supervision: Reinforcement learning for vision-language interfaces.ArXiv preprint, abs/2509.26594, 2025. URLhttps://arxiv.org/abs/2509.26594

work page arXiv 2025
[16]

URL https://proceedings.mlr

Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou. Hallusionbench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. InIEEE/CVF Conference on Computer Vision and Pattern Recogniti...

work page doi:10.1109/cvpr52733.2024.01363 2024
[17]

Rubicap: Rubric- guided reinforcement learning for dense image captioning.ArXiv preprint, abs/2603.09160, 2026

Tzu-Heng Huang, Sirajul Salekin, Javier Movellan, Frederic Sala, and Manjot Bilkhu. Rubicap: Rubric- guided reinforcement learning for dense image captioning.ArXiv preprint, abs/2603.09160, 2026. URL https://arxiv.org/abs/2603.09160

work page arXiv 2026
[18]

Rehman, M

Liqiang Jing, Ruosen Li, Yunmo Chen, and Xinya Du. FaithScore: Fine-grained evaluations of hallucina- tions in large vision-language models. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Findings of the Association for Computational Linguistics: EMNLP 2024, pages 5042–5063, Miami, Florida, USA, 2024. Association for Computational Linguist...

work page doi:10.18653/v1/2024.findings-emnlp 2024
[19]

URLhttps://aclanthology.org/2024.findings-emnlp.290/

2024
[20]

Miradata: A large-scale video dataset with long durations and structured cap- tions

Xuan Ju, Yiming Gao, Zhaoyang Zhang, Ziyang Yuan, Xintao Wang, Ailing Zeng, Yu Xiong, Qiang Xu, and Ying Shan. Miradata: A large-scale video dataset with long durations and structured cap- tions. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors,Advances in Neural Information Proces...

2024
[21]

Toward robust hyper-detailed image captioning: A multiagent approach and dual evaluation metrics for factuality and coverage

Saehyung Lee, Seunghyun Yoon, Trung Bui, Jing Shi, and Sungroh Yoon. Toward robust hyper-detailed image captioning: A multiagent approach and dual evaluation metrics for factuality and coverage. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors,Forty-second International ...

2025
[22]

Junnan Li, Dongxu Li, Caiming Xiong, and Steven C. H. Hoi. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato, editors,International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Mar...

2022
[23]

Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. BLIP-2: bootstrapping language- image pre-training with frozen image encoders and large language models. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,Inter- national Conference on Machine Learning, ICML 2023, 23-29 July 2023, ...

2023
[24]

Densefusion- 1m: Merging vision experts for comprehensive multimodal perception

Xiaotong Li, Fan Zhang, Haiwen Diao, Yueze Wang, Xinlong Wang, and Lingyu Duan. Densefusion- 1m: Merging vision experts for comprehensive multimodal perception. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors,Advances in Neural Information Processing Systems 38: Annual Conference ...

2024
[25]

VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, and Limin Wang. Videochat-R1: Enhancing spatio-temporal perception via reinforcement fine-tuning. ArXiv preprint, abs/2504.06958, 2025. URLhttps://arxiv.org/abs/2504.06958

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Evaluating object hallucination in large vision-language models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 292–305, Singapore,

2023
[27]

doi: 10.18653/v1/2023.emnlp-main.20

Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.20. URL https: //aclanthology.org/2023.emnlp-main.20/

work page doi:10.18653/v1/2023.emnlp-main.20 2023
[28]

CAPability: A comprehensive visual caption benchmark for evaluating both correctness and thoroughness.ArXiv preprint, abs/2502.14914, 2025

Zhihang Liu, Chen-Wei Xie, Bin Wen, Feiwu Yu, Jixuan Chen, Pandeng Li, Boqiang Zhang, Nianzu Yang, Yinglu Li, Zuan Gao, Yun Zheng, and Hongtao Xie. CAPability: A comprehensive visual caption benchmark for evaluating both correctness and thoroughness.ArXiv preprint, abs/2502.14914, 2025. URL https://arxiv.org/abs/2502.14914

work page arXiv 2025
[29]

Benchmarking large vision-language models via directed scene graph for comprehensive image captioning.ArXiv preprint, abs/2412.08614, 2024

Fan Lu, Wei Wu, Kecheng Zheng, Shuailei Ma, Biao Gong, Jiawei Liu, Wei Zhai, Yang Cao, Yujun Shen, and Zheng-Jun Zha. Benchmarking large vision-language models via directed scene graph for comprehensive image captioning.ArXiv preprint, abs/2412.08614, 2024. URL https://arxiv.org/ abs/2412.08614

work page arXiv 2024
[30]

Contextrl: Enhancing mllm’s knowledge discovery efficiency with context-augmented rl.ArXiv preprint, abs/2602.22623, 2026

Xingyu Lu, Jinpeng Wang, YiFan Zhang, Shijie Ma, Xiao Hu, Tianke Zhang, Haonan fan, Kaiyu Jiang, Changyi Liu, Kaiyu Tang, Bin Wen, Fan Yang, Tingting Gao, Han Li, and Chun Yuan. Contextrl: Enhancing mllm’s knowledge discovery efficiency with context-augmented rl.ArXiv preprint, abs/2602.22623, 2026. URLhttps://arxiv.org/abs/2602.22623

work page arXiv 2026
[31]

Videocap-R1: Enhancing MLLMs for video captioning via structured thinking.ArXiv preprint, abs/2506.01725, 2025

Desen Meng, Rui Huang, Zhilin Dai, Xinhao Li, Yifan Xu, Jun Zhang, Zhenpeng Huang, Meng Zhang, Lingshu Zhang, Yi Liu, and Limin Wang. Videocap-R1: Enhancing MLLMs for video captioning via structured thinking.ArXiv preprint, abs/2506.01725, 2025. URL https://arxiv.org/abs/2506. 01725

work page arXiv 2025
[32]

OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation

Kepan Nan, Rui Xie, Penghao Zhou, Tiehan Fan, Zhenheng Yang, Zhijie Chen, Xiang Li, Jian Yang, and Ying Tai. Openvid-1M: A large-scale high-quality dataset for text-to-video generation.ArXiv preprint, abs/2407.02371, 2024. URLhttps://arxiv.org/abs/2407.02371

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human fee...

2022
[34]

Prism: A framework for decoupling and assessing the capabilities of vlms

Yuxuan Qiao, Haodong Duan, Xinyu Fang, Junming Yang, Lin Chen, Songyang Zhang, Jiaqi Wang, Dahua Lin, and Kai Chen. Prism: A framework for decoupling and assessing the capabilities of vlms. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors,Advances in Neural Information Processing S...

2024
[35]

ARGUS: Hallucination and omission evaluation in video-LLMs.ArXiv preprint, abs/2506.07371, 2025

Ruchit Rawal, Reza Shirkavand, Heng Huang, Gowthami Somepalli, and Tom Goldstein. ARGUS: Hallucination and omission evaluation in video-LLMs.ArXiv preprint, abs/2506.07371, 2025. URL https://arxiv.org/abs/2506.07371

work page arXiv 2025
[36]

LAION-5B: an open 12 large-scale dataset for training next generation image-text models

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa Kun- durthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. LAION-5B: an open 12 large-scale dataset for training next generation image-text mo...

2022
[37]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y .K. Li, Y . Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.ArXiv preprint, abs/2402.03300, 2024. URL https://arxiv.org/abs/2402. 03300

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, Ruochen Xu, and Tiancheng Zhao. Vlm-R1: A stable and generalizable R1-style large vision-language model.ArXiv preprint, abs/2504.07615, 2025. URL https://arxiv. org/abs/2504.07615

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

From pixels to prose: A large dataset of dense image captions.ArXiv preprint, abs/2406.10328, 2024

Vasu Singla, Kaiyu Yue, Sukriti Paul, Reza Shirkavand, Mayuka Jayawardhana, Alireza Ganjdanesh, Heng Huang, Abhinav Bhatele, Gowthami Somepalli, and Tom Goldstein. From pixels to prose: A large dataset of dense image captions.ArXiv preprint, abs/2406.10328, 2024. URL https://arxiv.org/abs/2406. 10328

work page arXiv 2024
[40]

Enhancing descriptive captions with visual attributes for multimodal perception.ArXiv preprint, abs/2412.14233, 2024

Yanpeng Sun, Jing Hao, Ke Zhu, Jiang-Jiang Liu, Yuxiang Zhao, Xiaofan Li, Na Zhao, Zechao Li, and Jingdong Wang. Enhancing descriptive captions with visual attributes for multimodal perception.ArXiv preprint, abs/2412.14233, 2024. URLhttps://arxiv.org/abs/2412.14233

work page arXiv 2024
[41]

Aligning large multimodal models with factually augmented RLHF

Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liangyan Gui, Yu-Xiong Wang, Yiming Yang, Kurt Keutzer, and Trevor Darrell. Aligning large multimodal models with factually augmented RLHF. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Findings of the Association for Computational Linguistics: ACL 2024,...

work page doi:10.18653/v1/2024.findings-acl.775 2024
[42]

arXiv:2506.15220 , year =

Changli Tang, Yixuan Li, Yudong Yang, Jimin Zhuang, Guangzhi Sun, Wei Li, Zejun Ma, and Chao Zhang. video-SALMONN 2: Caption-enhanced audio-visual large language models.ArXiv preprint, abs/2506.15220, 2025. URLhttps://arxiv.org/abs/2506.15220

work page arXiv 2025
[43]

Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li

Bart Thomee, David A. Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. YFCC100M: The new data in multimedia research.Communications of the ACM, 59 (2):64–73, 2016

2016
[44]

URL https://proceedings.mlr

Jack Urbanek, Florian Bordes, Pietro Astolfi, Mary Williamson, Vasu Sharma, and Adriana Romero- Soriano. A picture is worth more than 77 text tokens: Evaluating clip-style models on dense captions. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 26690–26699. IEEE, 2024. doi: 10.1109/CV...

work page doi:10.1109/cvpr52733.2024.02521 2024
[45]

Huang, Nan Xu, Sheng Zhang, Hoifung Poon, and Muhao Chen

Fei Wang, Wenxuan Zhou, James Y . Huang, Nan Xu, Sheng Zhang, Hoifung Poon, and Muhao Chen. mDPO: Conditional preference optimization for multimodal large language models. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 8078–8088, Miami, Florida, US...

work page doi:10.18653/v1/2024.emnlp-main.460 2024
[46]

Tarsier: Recipes for training and evaluating large video description models.ArXiv preprint, abs/2407.00634, 2024

Jiawei Wang, Liping Yuan, Yuchen Zhang, and Haomiao Sun. Tarsier: Recipes for training and evaluating large video description models.ArXiv preprint, abs/2407.00634, 2024. URL https://arxiv.org/abs/ 2407.00634

work page arXiv 2024
[47]

AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation

Junyang Wang, Yuhang Wang, Guohai Xu, Jing Zhang, Yukai Gu, Haitao Jia, Jiaqi Wang, Haiyang Xu, Ming Yan, Ji Zhang, and Jitao Sang. AMBER: An LLM-free multi-dimensional benchmark for MLLMs hallucination evaluation.ArXiv preprint, abs/2311.07397, 2023. URL https://arxiv.org/abs/2311. 07397

work page internal anchor Pith review Pith/arXiv arXiv 2023
[48]

Vdc-agent: When video detailed captioners evolve themselves via agentic self-reflection.ArXiv preprint, abs/2511.19436, 2025

Qiang Wang, Xinyuan Gao, SongLin Dong, Jizhou Han, Jiangyang Li, Yuhang He, and Yihong Gong. Vdc-agent: When video detailed captioners evolve themselves via agentic self-reflection.ArXiv preprint, abs/2511.19436, 2025. URLhttps://arxiv.org/abs/2511.19436. 13

work page arXiv 2025
[49]

Vicrit: A verifiable reinforcement learning proxy task for visual perception in VLMs.ArXiv preprint, abs/2506.10128, 2025

Xiyao Wang, Zhengyuan Yang, Chao Feng, Yongyuan Liang, Yuhang Zhou, Xiaoyu Liu, Ziyi Zang, Ming Li, Chung-Ching Lin, Kevin Lin, Linjie Li, Furong Huang, and Lijuan Wang. Vicrit: A verifiable reinforcement learning proxy task for visual perception in VLMs.ArXiv preprint, abs/2506.10128, 2025. URLhttps://arxiv.org/abs/2506.10128

work page arXiv 2025
[50]

Detecting and mitigating hallucination in large vision language models via fine-grained AI feedback

Wenyi Xiao, Ziwei Huang, Leilei Gan, Wanggui He, Haoyuan Li, Zhelun Yu, Fangxun Shu, Hao Jiang, and Linchao Zhu. Detecting and mitigating hallucination in large vision language models via fine-grained AI feedback. In Toby Walsh, Julie Shah, and Zico Kolter, editors,Thirty-Ninth AAAI Conference on Artificial Intelligence, Thirty-Seventh Conference on Innov...

work page doi:10.1609/aaai.v39i24.34744 2025
[51]

NeMo guardrails: A toolkit for controllable and safe LLM applications with pro- grammable rails

Yuxi Xie, Guanzhen Li, Xiao Xu, and Min-Yen Kan. V-DPO: Mitigating hallucination in large vision language models via vision-guided direct preference optimization. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Findings of the Association for Computational Linguistics: EMNLP 2024, pages 13258–13273, Miami, Florida, USA, 2024. Association for...

work page doi:10.18653/v1/ 2024
[53]

Caprl: Stimulating dense image caption capabilities via reinforcement learning.ArXiv preprint, abs/2509.22647, 2025

Long Xing, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Jianze Liang, Qidong Huang, Jiaqi Wang, Feng Wu, and Dahua Lin. Caprl: Stimulating dense image caption capabilities via reinforcement learning.ArXiv preprint, abs/2509.22647, 2025. URLhttps://arxiv.org/abs/2509.22647

work page arXiv 2025
[54]

LLaV A-Critic: Learning to evaluate multimodal models.ArXiv preprint, abs/2410.02712, 2024

Tianyi Xiong, Xiyao Wang, Dong Guo, Qinghao Ye, Haoqi Fan, Quanquan Gu, Heng Huang, and Chunyuan Li. LLaV A-Critic: Learning to evaluate multimodal models.ArXiv preprint, abs/2410.02712, 2024. URL https://arxiv.org/abs/2410.02712

work page arXiv 2024
[55]

Vript: A video is worth thousands of words

Dongjie Yang, Suyuan Huang, Chengqiang Lu, Xiaodong Han, Haoxin Zhang, Yan Gao, Yao Hu, and Hai Zhao. Vript: A video is worth thousands of words. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors,Advances in Neural Information Processing Systems 38: Annual Conference on Neural Infor...

2024
[56]

Painting with words: Elevating detailed image captioning with benchmark and alignment learning

Qinghao Ye, Xianhan Zeng, Fu Li, Chunyuan Li, and Haoqi Fan. Painting with words: Elevating detailed image captioning with benchmark and alignment learning. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. URL https://openreview.net/forum?id=636M0nNbPs

2025
[57]

Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions.Transactions of the Association for Computational Linguistics, 2:67–78, 2014. doi: 10.1162/tacl_a_00166. URL https: //aclanthology.org/Q14-1006/

work page doi:10.1162/tacl_a_00166 2014
[58]

URL https://proceedings.mlr

Qiying Yu, Quan Sun, Xiaosong Zhang, Yufeng Cui, Fan Zhang, Yue Cao, Xinlong Wang, and Jingjing Liu. Capsfusion: Rethinking image-text data at scale. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 14022–14032. IEEE, 2024. doi: 10.1109/CVPR52733.2024.01330. URLhttps://doi.org/10.1109/C...

work page doi:10.1109/cvpr52733.2024.01330 2024
[59]

URL https://proceedings.mlr

Tianyu Yu, Yuan Yao, Haoye Zhang, Taiwen He, Yifeng Han, Ganqu Cui, Jinyi Hu, Zhiyuan Liu, Hai-Tao Zheng, and Maosong Sun. RLHF-V: towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 13807–13816...

work page doi:10.1109/cvpr52733.2024.01310 2024
[60]

Showui: One vision-language- action model for GUI visual agent

Tianyu Yu, Haoye Zhang, Qiming Li, Qixin Xu, Yuan Yao, Da Chen, Xiaoman Lu, Ganqu Cui, Yunkai Dang, Taiwen He, Xiaocheng Feng, Jun Song, Bo Zheng, Zhiyuan Liu, Tat-Seng Chua, and Maosong Sun. RLAIF-V: open-source AI feedback leads to super GPT-4V trustworthiness. InIEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN,...

work page doi:10.1109/cvpr52734.2025.01861 2025
[61]

Tarsier2: Advancing large vision-language models from detailed video description to comprehensive video understanding.arXiv:2501.07888, 2025b

Liping Yuan, Jiawei Wang, Haomiao Sun, Yuchen Zhang, and Yuan Lin. Tarsier2: Advancing large vision-language models from detailed video description to comprehensive video understanding.ArXiv preprint, abs/2501.07888, 2025. URLhttps://arxiv.org/abs/2501.07888

work page arXiv 2025
[62]

Sc-captioner: Improving image captioning with self-correction by reinforcement learning.ArXiv preprint, abs/2508.06125, 2025

Lin Zhang, Xianfang Zeng, Kangcong Li, Gang Yu, and Tao Chen. Sc-captioner: Improving image captioning with self-correction by reinforcement learning.ArXiv preprint, abs/2508.06125, 2025. URL https://arxiv.org/abs/2508.06125

work page arXiv 2025
[63]

− modality

Chunlin Zhong, Qiuxia Hou, Zhangjun Zhou, Shuang Hao, Haonan Lu, Yanhao Zhang, He Tang, and Xiang Bai. Owlcap: Harmonizing motion-detail for video captioning via HMD-270K and caption set equivalence reward.ArXiv preprint, abs/2508.18634, 2025. URLhttps://arxiv.org/abs/2508.18634. 15 A Appendix Appendix Contents A.1 Experimental Setup Details . . . . . . ....

work page arXiv 2025
[64]

all elements have been described

Analyze the Generated Caption on three dimensions: • Correctness: does it accurately represent the image, free of objects not present, inaccuracies, or contradictions? Fewer mistakes is better. • Completeness: does it cover all objects in detail with no omissions of details or objects? Fewer omissions is better. • Text Quality: is it logically fluent, coh...
[65]

Analysis

Provide an integer score in{0,1, . . . ,10}for each dimension. Input. Reference Caption:ref_answer Generated Caption:gen_solution Output format (strict).Return exactly one JSON object: {"Analysis":⟨your analysis⟩, "Correctness": score1, "Completeness": score2, "Text Quality": score3} Figure 4: Reward-model instruction for image captioning when a reference...
[66]

Read the Generated Description; inspect frames in timestamp order; compare against the Refer- ence Description and decide by frames on disagreements
[67]

List confirmed hallucinations, inaccuracies, omissions, and timestamp mismatches per dimension
[68]

Analysis

Provide an integer score in{0,1, . . . ,10}for each dimension. Input. Reference Description:REF_DESC Generated Description:GEN_DESC Output format (strict).Return exactly one JSON object: {"Analysis":⟨your analysis⟩, "Reasonability": score1, "Correctness": score2, "Completeness": score3} Figure 6: Reward-model instruction for the global pass over a full vi...
[69]

the image actually shows A, but the caption says B,

inconsistent (the image clearly contradicts the assertion), or3. undecidable (the referenced object is absent, or visible but not resolvable to the required level of detail due to occlusion, low resolution, or ambiguous angle). For an inconsistent proposition with the form “the image actually shows A, but the caption says B,” Step 1 only adjudicates the “...
[70]

Step 2: proposition vs

undecidable is reserved for genuinely unresolvable cases: whenever the relevant object is visible and clear, the annotator must commit to either consistent or inconsistent. Step 2: proposition vs. caption (conditional).Step 2 is executed only when Step 1 returns
[71]

consistent ; otherwise it is left blank. The annotator first locates the relevant object in the image, then locates the corresponding span in the caption (most captions follow a fore- ground/midground/background or left/center/right organization), and compares the proposition, 27 Table 8: Quick reference for the two-step proposition labels used in the hum...
[72]

image showsA

consistent 1. not holding Image really contains the asserted content; caption is too vague to decide1. consistent 3. ambiguous Image does not contain the asserted content2. inconsistent— Object is occluded / blurred / unresolvable in the image3. undecidable— Proposition: “image showsA”; image is indeedA; caption saysB1. consistent 2. holding Proposition: ...
[73]

image showsA

consistent 1. not holding Proposition: “image showsA”; image actually showsC(̸=A,̸=B)2. inconsistent— the caption span, and the image jointly. The verdict is one of: 1. not holding (the proposition does not in fact hold against the caption: a missing proposition whose content is already covered by the caption via a synonym or hypernym, or an inconsistent ...
[74]

subjective

holding , because models frequently distribute related content across non-adjacent paragraphs. If a caption self-contradicts, the existence of the erroneous wording is sufficient to label2. holding on the inconsistent side. Step 3 per-image rankings.After all propositions for the five captions of an image are anno- tated, the annotator produces three inde...

[1] [1]

Wenhao Chai, Enxin Song, Yilun Du, Chenlin Meng, Vashisht Madhavan, Omer Bar-Tal, Jenq-Neng Hwang, Saining Xie, and Christopher D. Manning. Auroracap: Efficient, performant video detailed captioning and a new benchmark. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. URL...

2025

[2] [2]

Reiss, N

Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. InIEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pages 3558–3568. Computer Vision Foundation / IEEE, 2021. doi: 10.1109/CVPR46437.2021.00356. URL...

work page doi:10.1109/cvpr46437.2021.00356 2021

[3] [3]

ALLaVA: Harnessing GPT4V-Synthesized Data for Lite Vision-Language Models

Guiming Hardy Chen, Shunian Chen, Ruifei Zhang, Junying Chen, Xiangbo Wu, Zhiyi Zhang, Zhihong Chen, Jianquan Li, Xiang Wan, and Benyou Wang. Allava: Harnessing gpt4v-synthesized data for lite vision-language models.ArXiv preprint, abs/2402.11684, 2024. URL https://arxiv.org/abs/2402. 11684

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [5]

URLhttps://arxiv.org/abs/2311.12793

work page internal anchor Pith review Pith/arXiv arXiv

[5] [6]

Sharegpt4video: Improving video understanding and generation with better captions

Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Lin Bin, Zhenyu Tang, Li Yuan, Yu Qiao, Dahua Lin, Feng Zhao, and Jiaqi Wang. Sharegpt4video: Improving video understanding and generation with better captions. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, a...

2024

[6] [7]

URL https://proceedings.mlr

Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, and Sergey Tulyakov. Panda-70m: Captioning 70m videos with multiple cross-modality teachers. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-2...

work page doi:10.1109/cvpr52733.2024.01265 2024

[7] [8]

Microsoft COCO Captions: Data Collection and Evaluation Server

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO captions: Data collection and evaluation server.ArXiv preprint, abs/1504.00325, 2015. URLhttps://arxiv.org/abs/1504.00325

work page internal anchor Pith review Pith/arXiv arXiv 2015

[8] [10]

URLhttps://arxiv.org/abs/2506.09079

work page arXiv

[9] [11]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, et al. Deepseek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.ArXiv preprint, abs/2501.12948, 2025. URL https://arxiv.org/abs/2501. 12948

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [12]

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, Jiasen Lu, Taira Anderson, Erin Bransom, Kiana Ehsani, Huong Ngo, YenSung Chen, Ajay Patel, Mark Yatskar, Chris Callison-Burch, Andrew Head, Rose Hendrix, Favyen Bastani, Eli VanderBilt, Nathan Lambert, Yvon...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [13]

Improving CLIP train- ing with language rewrites

Lijie Fan, Dilip Krishnan, Phillip Isola, Dina Katabi, and Yonglong Tian. Improving CLIP train- ing with language rewrites. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, ...

2023

[12] [14]

Edge-grpo: Entropy-driven grpo with guided error correction for advantage diversity.arXiv preprint arXiv:2507.21848, 2025a

Roopal Garg, Andrea Burns, Burcu Karagol Ayan, Yonatan Bitton, Ceslee Montgomery, Yasumasa Onoe, Andrew Bunner, Ranjay Krishna, Jason Michael Baldridge, and Radu Soricut. ImageInWords: Unlocking hyper-detailed image descriptions. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natur...

work page doi:10.18653/v1/2024 2024

[13] [15]

Clarification as supervision: Reinforcement learning for vision-language interfaces.ArXiv preprint, abs/2509.26594, 2025

John Gkountouras and Ivan Titov. Clarification as supervision: Reinforcement learning for vision-language interfaces.ArXiv preprint, abs/2509.26594, 2025. URLhttps://arxiv.org/abs/2509.26594

work page arXiv 2025

[14] [16]

URL https://proceedings.mlr

Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou. Hallusionbench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. InIEEE/CVF Conference on Computer Vision and Pattern Recogniti...

work page doi:10.1109/cvpr52733.2024.01363 2024

[15] [17]

Rubicap: Rubric- guided reinforcement learning for dense image captioning.ArXiv preprint, abs/2603.09160, 2026

Tzu-Heng Huang, Sirajul Salekin, Javier Movellan, Frederic Sala, and Manjot Bilkhu. Rubicap: Rubric- guided reinforcement learning for dense image captioning.ArXiv preprint, abs/2603.09160, 2026. URL https://arxiv.org/abs/2603.09160

work page arXiv 2026

[16] [18]

Rehman, M

Liqiang Jing, Ruosen Li, Yunmo Chen, and Xinya Du. FaithScore: Fine-grained evaluations of hallucina- tions in large vision-language models. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Findings of the Association for Computational Linguistics: EMNLP 2024, pages 5042–5063, Miami, Florida, USA, 2024. Association for Computational Linguist...

work page doi:10.18653/v1/2024.findings-emnlp 2024

[17] [19]

URLhttps://aclanthology.org/2024.findings-emnlp.290/

2024

[18] [20]

Miradata: A large-scale video dataset with long durations and structured cap- tions

Xuan Ju, Yiming Gao, Zhaoyang Zhang, Ziyang Yuan, Xintao Wang, Ailing Zeng, Yu Xiong, Qiang Xu, and Ying Shan. Miradata: A large-scale video dataset with long durations and structured cap- tions. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors,Advances in Neural Information Proces...

2024

[19] [21]

Toward robust hyper-detailed image captioning: A multiagent approach and dual evaluation metrics for factuality and coverage

Saehyung Lee, Seunghyun Yoon, Trung Bui, Jing Shi, and Sungroh Yoon. Toward robust hyper-detailed image captioning: A multiagent approach and dual evaluation metrics for factuality and coverage. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors,Forty-second International ...

2025

[20] [22]

Junnan Li, Dongxu Li, Caiming Xiong, and Steven C. H. Hoi. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato, editors,International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Mar...

2022

[21] [23]

Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. BLIP-2: bootstrapping language- image pre-training with frozen image encoders and large language models. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,Inter- national Conference on Machine Learning, ICML 2023, 23-29 July 2023, ...

2023

[22] [24]

Densefusion- 1m: Merging vision experts for comprehensive multimodal perception

Xiaotong Li, Fan Zhang, Haiwen Diao, Yueze Wang, Xinlong Wang, and Lingyu Duan. Densefusion- 1m: Merging vision experts for comprehensive multimodal perception. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors,Advances in Neural Information Processing Systems 38: Annual Conference ...

2024

[23] [25]

VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, and Limin Wang. Videochat-R1: Enhancing spatio-temporal perception via reinforcement fine-tuning. ArXiv preprint, abs/2504.06958, 2025. URLhttps://arxiv.org/abs/2504.06958

work page internal anchor Pith review Pith/arXiv arXiv 2025

[24] [26]

Evaluating object hallucination in large vision-language models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 292–305, Singapore,

2023

[25] [27]

doi: 10.18653/v1/2023.emnlp-main.20

Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.20. URL https: //aclanthology.org/2023.emnlp-main.20/

work page doi:10.18653/v1/2023.emnlp-main.20 2023

[26] [28]

CAPability: A comprehensive visual caption benchmark for evaluating both correctness and thoroughness.ArXiv preprint, abs/2502.14914, 2025

Zhihang Liu, Chen-Wei Xie, Bin Wen, Feiwu Yu, Jixuan Chen, Pandeng Li, Boqiang Zhang, Nianzu Yang, Yinglu Li, Zuan Gao, Yun Zheng, and Hongtao Xie. CAPability: A comprehensive visual caption benchmark for evaluating both correctness and thoroughness.ArXiv preprint, abs/2502.14914, 2025. URL https://arxiv.org/abs/2502.14914

work page arXiv 2025

[27] [29]

Benchmarking large vision-language models via directed scene graph for comprehensive image captioning.ArXiv preprint, abs/2412.08614, 2024

Fan Lu, Wei Wu, Kecheng Zheng, Shuailei Ma, Biao Gong, Jiawei Liu, Wei Zhai, Yang Cao, Yujun Shen, and Zheng-Jun Zha. Benchmarking large vision-language models via directed scene graph for comprehensive image captioning.ArXiv preprint, abs/2412.08614, 2024. URL https://arxiv.org/ abs/2412.08614

work page arXiv 2024

[28] [30]

Contextrl: Enhancing mllm’s knowledge discovery efficiency with context-augmented rl.ArXiv preprint, abs/2602.22623, 2026

Xingyu Lu, Jinpeng Wang, YiFan Zhang, Shijie Ma, Xiao Hu, Tianke Zhang, Haonan fan, Kaiyu Jiang, Changyi Liu, Kaiyu Tang, Bin Wen, Fan Yang, Tingting Gao, Han Li, and Chun Yuan. Contextrl: Enhancing mllm’s knowledge discovery efficiency with context-augmented rl.ArXiv preprint, abs/2602.22623, 2026. URLhttps://arxiv.org/abs/2602.22623

work page arXiv 2026

[29] [31]

Videocap-R1: Enhancing MLLMs for video captioning via structured thinking.ArXiv preprint, abs/2506.01725, 2025

Desen Meng, Rui Huang, Zhilin Dai, Xinhao Li, Yifan Xu, Jun Zhang, Zhenpeng Huang, Meng Zhang, Lingshu Zhang, Yi Liu, and Limin Wang. Videocap-R1: Enhancing MLLMs for video captioning via structured thinking.ArXiv preprint, abs/2506.01725, 2025. URL https://arxiv.org/abs/2506. 01725

work page arXiv 2025

[30] [32]

OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation

Kepan Nan, Rui Xie, Penghao Zhou, Tiehan Fan, Zhenheng Yang, Zhijie Chen, Xiang Li, Jian Yang, and Ying Tai. Openvid-1M: A large-scale high-quality dataset for text-to-video generation.ArXiv preprint, abs/2407.02371, 2024. URLhttps://arxiv.org/abs/2407.02371

work page internal anchor Pith review Pith/arXiv arXiv 2024

[31] [33]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human fee...

2022

[32] [34]

Prism: A framework for decoupling and assessing the capabilities of vlms

Yuxuan Qiao, Haodong Duan, Xinyu Fang, Junming Yang, Lin Chen, Songyang Zhang, Jiaqi Wang, Dahua Lin, and Kai Chen. Prism: A framework for decoupling and assessing the capabilities of vlms. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors,Advances in Neural Information Processing S...

2024

[33] [35]

ARGUS: Hallucination and omission evaluation in video-LLMs.ArXiv preprint, abs/2506.07371, 2025

Ruchit Rawal, Reza Shirkavand, Heng Huang, Gowthami Somepalli, and Tom Goldstein. ARGUS: Hallucination and omission evaluation in video-LLMs.ArXiv preprint, abs/2506.07371, 2025. URL https://arxiv.org/abs/2506.07371

work page arXiv 2025

[34] [36]

LAION-5B: an open 12 large-scale dataset for training next generation image-text models

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa Kun- durthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. LAION-5B: an open 12 large-scale dataset for training next generation image-text mo...

2022

[35] [37]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y .K. Li, Y . Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.ArXiv preprint, abs/2402.03300, 2024. URL https://arxiv.org/abs/2402. 03300

work page internal anchor Pith review Pith/arXiv arXiv 2024

[36] [38]

VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, Ruochen Xu, and Tiancheng Zhao. Vlm-R1: A stable and generalizable R1-style large vision-language model.ArXiv preprint, abs/2504.07615, 2025. URL https://arxiv. org/abs/2504.07615

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [39]

From pixels to prose: A large dataset of dense image captions.ArXiv preprint, abs/2406.10328, 2024

Vasu Singla, Kaiyu Yue, Sukriti Paul, Reza Shirkavand, Mayuka Jayawardhana, Alireza Ganjdanesh, Heng Huang, Abhinav Bhatele, Gowthami Somepalli, and Tom Goldstein. From pixels to prose: A large dataset of dense image captions.ArXiv preprint, abs/2406.10328, 2024. URL https://arxiv.org/abs/2406. 10328

work page arXiv 2024

[38] [40]

Enhancing descriptive captions with visual attributes for multimodal perception.ArXiv preprint, abs/2412.14233, 2024

Yanpeng Sun, Jing Hao, Ke Zhu, Jiang-Jiang Liu, Yuxiang Zhao, Xiaofan Li, Na Zhao, Zechao Li, and Jingdong Wang. Enhancing descriptive captions with visual attributes for multimodal perception.ArXiv preprint, abs/2412.14233, 2024. URLhttps://arxiv.org/abs/2412.14233

work page arXiv 2024

[39] [41]

Aligning large multimodal models with factually augmented RLHF

Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liangyan Gui, Yu-Xiong Wang, Yiming Yang, Kurt Keutzer, and Trevor Darrell. Aligning large multimodal models with factually augmented RLHF. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Findings of the Association for Computational Linguistics: ACL 2024,...

work page doi:10.18653/v1/2024.findings-acl.775 2024

[40] [42]

arXiv:2506.15220 , year =

Changli Tang, Yixuan Li, Yudong Yang, Jimin Zhuang, Guangzhi Sun, Wei Li, Zejun Ma, and Chao Zhang. video-SALMONN 2: Caption-enhanced audio-visual large language models.ArXiv preprint, abs/2506.15220, 2025. URLhttps://arxiv.org/abs/2506.15220

work page arXiv 2025

[41] [43]

Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li

Bart Thomee, David A. Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. YFCC100M: The new data in multimedia research.Communications of the ACM, 59 (2):64–73, 2016

2016

[42] [44]

URL https://proceedings.mlr

Jack Urbanek, Florian Bordes, Pietro Astolfi, Mary Williamson, Vasu Sharma, and Adriana Romero- Soriano. A picture is worth more than 77 text tokens: Evaluating clip-style models on dense captions. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 26690–26699. IEEE, 2024. doi: 10.1109/CV...

work page doi:10.1109/cvpr52733.2024.02521 2024

[43] [45]

Huang, Nan Xu, Sheng Zhang, Hoifung Poon, and Muhao Chen

Fei Wang, Wenxuan Zhou, James Y . Huang, Nan Xu, Sheng Zhang, Hoifung Poon, and Muhao Chen. mDPO: Conditional preference optimization for multimodal large language models. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 8078–8088, Miami, Florida, US...

work page doi:10.18653/v1/2024.emnlp-main.460 2024

[44] [46]

Tarsier: Recipes for training and evaluating large video description models.ArXiv preprint, abs/2407.00634, 2024

Jiawei Wang, Liping Yuan, Yuchen Zhang, and Haomiao Sun. Tarsier: Recipes for training and evaluating large video description models.ArXiv preprint, abs/2407.00634, 2024. URL https://arxiv.org/abs/ 2407.00634

work page arXiv 2024

[45] [47]

AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation

Junyang Wang, Yuhang Wang, Guohai Xu, Jing Zhang, Yukai Gu, Haitao Jia, Jiaqi Wang, Haiyang Xu, Ming Yan, Ji Zhang, and Jitao Sang. AMBER: An LLM-free multi-dimensional benchmark for MLLMs hallucination evaluation.ArXiv preprint, abs/2311.07397, 2023. URL https://arxiv.org/abs/2311. 07397

work page internal anchor Pith review Pith/arXiv arXiv 2023

[46] [48]

Vdc-agent: When video detailed captioners evolve themselves via agentic self-reflection.ArXiv preprint, abs/2511.19436, 2025

Qiang Wang, Xinyuan Gao, SongLin Dong, Jizhou Han, Jiangyang Li, Yuhang He, and Yihong Gong. Vdc-agent: When video detailed captioners evolve themselves via agentic self-reflection.ArXiv preprint, abs/2511.19436, 2025. URLhttps://arxiv.org/abs/2511.19436. 13

work page arXiv 2025

[47] [49]

Vicrit: A verifiable reinforcement learning proxy task for visual perception in VLMs.ArXiv preprint, abs/2506.10128, 2025

Xiyao Wang, Zhengyuan Yang, Chao Feng, Yongyuan Liang, Yuhang Zhou, Xiaoyu Liu, Ziyi Zang, Ming Li, Chung-Ching Lin, Kevin Lin, Linjie Li, Furong Huang, and Lijuan Wang. Vicrit: A verifiable reinforcement learning proxy task for visual perception in VLMs.ArXiv preprint, abs/2506.10128, 2025. URLhttps://arxiv.org/abs/2506.10128

work page arXiv 2025

[48] [50]

Detecting and mitigating hallucination in large vision language models via fine-grained AI feedback

Wenyi Xiao, Ziwei Huang, Leilei Gan, Wanggui He, Haoyuan Li, Zhelun Yu, Fangxun Shu, Hao Jiang, and Linchao Zhu. Detecting and mitigating hallucination in large vision language models via fine-grained AI feedback. In Toby Walsh, Julie Shah, and Zico Kolter, editors,Thirty-Ninth AAAI Conference on Artificial Intelligence, Thirty-Seventh Conference on Innov...

work page doi:10.1609/aaai.v39i24.34744 2025

[49] [51]

NeMo guardrails: A toolkit for controllable and safe LLM applications with pro- grammable rails

Yuxi Xie, Guanzhen Li, Xiao Xu, and Min-Yen Kan. V-DPO: Mitigating hallucination in large vision language models via vision-guided direct preference optimization. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Findings of the Association for Computational Linguistics: EMNLP 2024, pages 13258–13273, Miami, Florida, USA, 2024. Association for...

work page doi:10.18653/v1/ 2024

[50] [53]

Caprl: Stimulating dense image caption capabilities via reinforcement learning.ArXiv preprint, abs/2509.22647, 2025

Long Xing, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Jianze Liang, Qidong Huang, Jiaqi Wang, Feng Wu, and Dahua Lin. Caprl: Stimulating dense image caption capabilities via reinforcement learning.ArXiv preprint, abs/2509.22647, 2025. URLhttps://arxiv.org/abs/2509.22647

work page arXiv 2025

[51] [54]

LLaV A-Critic: Learning to evaluate multimodal models.ArXiv preprint, abs/2410.02712, 2024

Tianyi Xiong, Xiyao Wang, Dong Guo, Qinghao Ye, Haoqi Fan, Quanquan Gu, Heng Huang, and Chunyuan Li. LLaV A-Critic: Learning to evaluate multimodal models.ArXiv preprint, abs/2410.02712, 2024. URL https://arxiv.org/abs/2410.02712

work page arXiv 2024

[52] [55]

Vript: A video is worth thousands of words

Dongjie Yang, Suyuan Huang, Chengqiang Lu, Xiaodong Han, Haoxin Zhang, Yan Gao, Yao Hu, and Hai Zhao. Vript: A video is worth thousands of words. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors,Advances in Neural Information Processing Systems 38: Annual Conference on Neural Infor...

2024

[53] [56]

Painting with words: Elevating detailed image captioning with benchmark and alignment learning

Qinghao Ye, Xianhan Zeng, Fu Li, Chunyuan Li, and Haoqi Fan. Painting with words: Elevating detailed image captioning with benchmark and alignment learning. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. URL https://openreview.net/forum?id=636M0nNbPs

2025

[54] [57]

Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions.Transactions of the Association for Computational Linguistics, 2:67–78, 2014. doi: 10.1162/tacl_a_00166. URL https: //aclanthology.org/Q14-1006/

work page doi:10.1162/tacl_a_00166 2014

[55] [58]

URL https://proceedings.mlr

Qiying Yu, Quan Sun, Xiaosong Zhang, Yufeng Cui, Fan Zhang, Yue Cao, Xinlong Wang, and Jingjing Liu. Capsfusion: Rethinking image-text data at scale. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 14022–14032. IEEE, 2024. doi: 10.1109/CVPR52733.2024.01330. URLhttps://doi.org/10.1109/C...

work page doi:10.1109/cvpr52733.2024.01330 2024

[56] [59]

URL https://proceedings.mlr

Tianyu Yu, Yuan Yao, Haoye Zhang, Taiwen He, Yifeng Han, Ganqu Cui, Jinyi Hu, Zhiyuan Liu, Hai-Tao Zheng, and Maosong Sun. RLHF-V: towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 13807–13816...

work page doi:10.1109/cvpr52733.2024.01310 2024

[57] [60]

Showui: One vision-language- action model for GUI visual agent

Tianyu Yu, Haoye Zhang, Qiming Li, Qixin Xu, Yuan Yao, Da Chen, Xiaoman Lu, Ganqu Cui, Yunkai Dang, Taiwen He, Xiaocheng Feng, Jun Song, Bo Zheng, Zhiyuan Liu, Tat-Seng Chua, and Maosong Sun. RLAIF-V: open-source AI feedback leads to super GPT-4V trustworthiness. InIEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN,...

work page doi:10.1109/cvpr52734.2025.01861 2025

[58] [61]

Tarsier2: Advancing large vision-language models from detailed video description to comprehensive video understanding.arXiv:2501.07888, 2025b

Liping Yuan, Jiawei Wang, Haomiao Sun, Yuchen Zhang, and Yuan Lin. Tarsier2: Advancing large vision-language models from detailed video description to comprehensive video understanding.ArXiv preprint, abs/2501.07888, 2025. URLhttps://arxiv.org/abs/2501.07888

work page arXiv 2025

[59] [62]

Sc-captioner: Improving image captioning with self-correction by reinforcement learning.ArXiv preprint, abs/2508.06125, 2025

Lin Zhang, Xianfang Zeng, Kangcong Li, Gang Yu, and Tao Chen. Sc-captioner: Improving image captioning with self-correction by reinforcement learning.ArXiv preprint, abs/2508.06125, 2025. URL https://arxiv.org/abs/2508.06125

work page arXiv 2025

[60] [63]

− modality

Chunlin Zhong, Qiuxia Hou, Zhangjun Zhou, Shuang Hao, Haonan Lu, Yanhao Zhang, He Tang, and Xiang Bai. Owlcap: Harmonizing motion-detail for video captioning via HMD-270K and caption set equivalence reward.ArXiv preprint, abs/2508.18634, 2025. URLhttps://arxiv.org/abs/2508.18634. 15 A Appendix Appendix Contents A.1 Experimental Setup Details . . . . . . ....

work page arXiv 2025

[61] [64]

all elements have been described

Analyze the Generated Caption on three dimensions: • Correctness: does it accurately represent the image, free of objects not present, inaccuracies, or contradictions? Fewer mistakes is better. • Completeness: does it cover all objects in detail with no omissions of details or objects? Fewer omissions is better. • Text Quality: is it logically fluent, coh...

[62] [65]

Analysis

Provide an integer score in{0,1, . . . ,10}for each dimension. Input. Reference Caption:ref_answer Generated Caption:gen_solution Output format (strict).Return exactly one JSON object: {"Analysis":⟨your analysis⟩, "Correctness": score1, "Completeness": score2, "Text Quality": score3} Figure 4: Reward-model instruction for image captioning when a reference...

[63] [66]

Read the Generated Description; inspect frames in timestamp order; compare against the Refer- ence Description and decide by frames on disagreements

[64] [67]

List confirmed hallucinations, inaccuracies, omissions, and timestamp mismatches per dimension

[65] [68]

Analysis

Provide an integer score in{0,1, . . . ,10}for each dimension. Input. Reference Description:REF_DESC Generated Description:GEN_DESC Output format (strict).Return exactly one JSON object: {"Analysis":⟨your analysis⟩, "Reasonability": score1, "Correctness": score2, "Completeness": score3} Figure 6: Reward-model instruction for the global pass over a full vi...

[66] [69]

the image actually shows A, but the caption says B,

inconsistent (the image clearly contradicts the assertion), or3. undecidable (the referenced object is absent, or visible but not resolvable to the required level of detail due to occlusion, low resolution, or ambiguous angle). For an inconsistent proposition with the form “the image actually shows A, but the caption says B,” Step 1 only adjudicates the “...

[67] [70]

Step 2: proposition vs

undecidable is reserved for genuinely unresolvable cases: whenever the relevant object is visible and clear, the annotator must commit to either consistent or inconsistent. Step 2: proposition vs. caption (conditional).Step 2 is executed only when Step 1 returns

[68] [71]

consistent ; otherwise it is left blank. The annotator first locates the relevant object in the image, then locates the corresponding span in the caption (most captions follow a fore- ground/midground/background or left/center/right organization), and compares the proposition, 27 Table 8: Quick reference for the two-step proposition labels used in the hum...

[69] [72]

image showsA

consistent 1. not holding Image really contains the asserted content; caption is too vague to decide1. consistent 3. ambiguous Image does not contain the asserted content2. inconsistent— Object is occluded / blurred / unresolvable in the image3. undecidable— Proposition: “image showsA”; image is indeedA; caption saysB1. consistent 2. holding Proposition: ...

[70] [73]

image showsA

consistent 1. not holding Proposition: “image showsA”; image actually showsC(̸=A,̸=B)2. inconsistent— the caption span, and the image jointly. The verdict is one of: 1. not holding (the proposition does not in fact hold against the caption: a missing proposition whose content is already covered by the caption via a synonym or hypernym, or an inconsistent ...

[71] [74]

subjective

holding , because models frequently distribute related content across non-adjacent paragraphs. If a caption self-contradicts, the existence of the erroneous wording is sufficient to label2. holding on the inconsistent side. Step 3 per-image rankings.After all propositions for the five captions of an image are anno- tated, the annotator produces three inde...