pith. sign in

arxiv: 2606.09393 · v1 · pith:4QYG4AAMnew · submitted 2026-06-08 · 💻 cs.CV

CapRL++: Unified Reinforcement Learning with Verifiable Rewards for Dense Image and Video Captioning

Pith reviewed 2026-06-27 17:19 UTC · model grok-4.3

classification 💻 cs.CV
keywords reinforcement learningdense captioningvision-language modelsverifiable rewardsimage captioningvideo captioningmultimodal pretrainingquestion answering proxy
0
0 comments X

The pith

CapRL++ uses a text-only LLM's accuracy on questions about a caption as the reward signal to train vision-language models for better dense image and video descriptions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CapRL++, a reinforcement learning framework that scores generated captions by how accurately a separate vision-free language model can answer multiple-choice questions using only the caption text. This replaces supervised fine-tuning on fixed references and lets training scale to large unlabeled image and video datasets. The resulting captions improve downstream performance on spatial and temporal understanding tasks and allow smaller models to reach dense-captioning levels previously seen only in much larger systems. If the utility-based reward works, caption quality becomes measurable without human references and pretraining data can be generated automatically at scale.

Core claim

CapRL++ is a decoupled two-stage RL pipeline in which an LVLM first produces a caption and a vision-free LLM then answers multiple-choice questions about the visual scene using only that caption; the accuracy of those answers supplies the verifiable reward. Training with this signal yields captions that support stronger performance on more than twenty image and video benchmarks, strengthen caption-based pretraining, and let compact models match the dense-captioning results of Qwen2.5-VL-72B and Qwen3-VL-235B-A22B inside the Prism evaluation framework.

What carries the argument

The CapRL++ two-stage pipeline that converts caption utility, measured as vision-free LLM accuracy on MCQs, into a scalar reward for reinforcement learning.

If this is right

  • Compact models trained with CapRL++ reach dense-captioning performance comparable to models with 72B and 235B parameters.
  • Pretraining on image and video datasets annotated by CapRL++ produces substantial gains on spatial and temporal understanding benchmarks.
  • Caption quality is redefined as the ability of the text to enable accurate question answering without visual input.
  • The method removes dependence on expensive human reference annotations for open-ended captioning.
  • RLVR can be applied directly to multimodal captioning without reference captions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the MCQ proxy remains stable across domains, the same reward signal could generate training data for other open-ended visual generation tasks.
  • The approach might reduce the need for ever-larger model sizes by improving data efficiency instead.
  • One could test whether captions produced this way transfer to entirely new visual-question-answering datasets never seen during training.
  • Extending the pipeline to video with temporal MCQs could further strengthen long-range understanding claims.

Load-bearing premise

The accuracy of answers from a separate vision-free LLM on multiple-choice questions is a reliable and unbiased proxy for the quality and utility of the generated dense caption.

What would settle it

Train a model with CapRL++ and a control model with standard SFT on the same data, then measure whether the CapRL++ captions produce measurably higher accuracy when a vision-free LLM answers held-out questions or when the captions are used for downstream spatial-temporal tasks; equal or lower performance would falsify the claim.

read the original abstract

Image and video captioning are fundamental tasks that bridge the visual and linguistic domains, playing a critical role in pre-training Large Vision-Language Models (LVLMs). Current state-of-the-art captioning models are typically trained with Supervised Fine-Tuning (SFT), a paradigm that relies on expensive, non-scalable annotations and often causes models to memorize specific ground-truth answers, limiting their generality and ability to generate diverse, creative descriptions. To overcome these limitations, we propose applying Reinforcement Learning with Verifiable Rewards (RLVR) to the open-ended task of multimodal captioning. We introduce Captioning Reinforcement Learning++ (CapRL++), a novel reference-free training framework that redefines caption quality through its utility: a high-quality caption should enable a non-visual language model to accurately answer questions about the corresponding visual content. CapRL++ employs a decoupled two-stage pipeline where an LVLM generates a caption, and the objective reward is derived from the accuracy of a separate, vision-free LLM answering Multiple-Choice Questions based solely on that caption. Evaluations on more than 20 image and video benchmarks show that CapRL++ improves dense caption quality and strengthens caption-based pretraining across tasks such as spatial and temporal understanding. Pretraining on scalable image and video caption datasets annotated by CapRL++ yields substantial downstream gains. Furthermore, within the Prism Framework for caption quality evaluation, compact models trained with CapRL++ achieve dense captioning performance comparable to substantially larger models such as Qwen2.5-VL-72B and Qwen3-VL-235B-A22B. These results validate that CapRL++ effectively trains models to produce generalizable, high-fidelity descriptions, establishing a robust foundation beyond the limitations of traditional SFT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces CapRL++, a reference-free RL framework for dense image and video captioning. An LVLM generates a caption and receives a reward equal to the accuracy achieved by a separate vision-free LLM when answering MCQs derived from that caption alone. The authors report gains on more than 20 benchmarks, improved caption-based pretraining, and that compact models trained with CapRL++ reach dense-captioning performance comparable to Qwen2.5-VL-72B and Qwen3-VL-235B-A22B.

Significance. If the MCQ-based proxy is shown to be a faithful measure of caption utility, the method supplies a scalable, annotation-free alternative to SFT and could materially improve LVLM pretraining. The reported parity between compact CapRL++ models and models two orders of magnitude larger would be a notable empirical result.

major comments (3)
  1. [Abstract, §4] Abstract and §4 (Experiments): the central claim that CapRL++ yields captions whose utility matches much larger models rests entirely on the MCQ accuracy proxy, yet no correlation is reported between this proxy and either reference-based metrics (CIDEr, SPICE) or human judgments of caption quality. Without such anchoring data the mapping from reward to claimed caption fidelity remains unverified.
  2. [§3.2] §3.2 (Reward formulation): the two-stage pipeline decouples caption generation from the reward LLM; no ablation is presented that tests whether accuracy gains arise from richer visual descriptions or from the caption exploiting the reward LLM’s textual priors on the chosen MCQs. This directly affects the claim that the reward is “verifiable” and unbiased.
  3. [Table 2, §5.1] Table 2 and §5.1 (Downstream pretraining): the reported gains on spatial/temporal understanding tasks are attributed to CapRL++ captions, but the manuscript does not control for the choice of MCQ generator or the specific vision-free LLM, leaving open the possibility that results are tied to a particular external model rather than the captioning objective itself.
minor comments (2)
  1. [§3.1] Notation for the reward function r(c) is introduced without an explicit equation number; adding Eq. (X) would improve readability.
  2. [Figure 3] Figure 3 caption does not state the number of MCQs per image/video or the source of the question pool.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications on our design choices and indicate revisions that will be incorporated to strengthen the validation of the MCQ proxy and related claims.

read point-by-point responses
  1. Referee: [Abstract, §4] Abstract and §4 (Experiments): the central claim that CapRL++ yields captions whose utility matches much larger models rests entirely on the MCQ accuracy proxy, yet no correlation is reported between this proxy and either reference-based metrics (CIDEr, SPICE) or human judgments of caption quality. Without such anchoring data the mapping from reward to claimed caption fidelity remains unverified.

    Authors: We acknowledge that the current manuscript does not include explicit correlation coefficients between MCQ accuracy and reference-based metrics or human judgments. The primary validation instead comes from consistent improvements across more than 20 downstream benchmarks and caption-based pretraining gains, which demonstrate the practical utility of the proxy. To directly address the anchoring concern, we will add a new analysis subsection in §4 that computes and reports Pearson/Spearman correlations with CIDEr and SPICE on held-out data, plus agreement rates with human ratings on a 500-caption subset. This will be included in the revised manuscript. revision: yes

  2. Referee: [§3.2] §3.2 (Reward formulation): the two-stage pipeline decouples caption generation from the reward LLM; no ablation is presented that tests whether accuracy gains arise from richer visual descriptions or from the caption exploiting the reward LLM’s textual priors on the chosen MCQs. This directly affects the claim that the reward is “verifiable” and unbiased.

    Authors: The two-stage decoupling is a deliberate design choice to produce a reference-free, verifiable reward that depends only on caption utility for a vision-free LLM. Dynamic MCQ generation from each caption further reduces the chance of static prior exploitation. While the manuscript does not contain an explicit ablation isolating textual priors, the broad gains on spatial/temporal tasks across benchmarks provide supporting evidence that visual content is being captured. We will add a targeted ablation in the revision that re-uses identical MCQ sets across varied captions to quantify the contribution of description richness versus prior matching. revision: yes

  3. Referee: [Table 2, §5.1] Table 2 and §5.1 (Downstream pretraining): the reported gains on spatial/temporal understanding tasks are attributed to CapRL++ captions, but the manuscript does not control for the choice of MCQ generator or the specific vision-free LLM, leaving open the possibility that results are tied to a particular external model rather than the captioning objective itself.

    Authors: The reported experiments employ a reproducible MCQ generation procedure and a fixed, publicly documented vision-free LLM to ensure the framework is not tied to proprietary components. The improvements appear consistently across diverse tasks, suggesting the objective itself drives the gains. To strengthen this, the revision will include additional controls using an alternative vision-free LLM (e.g., a Llama-3 variant) and a second MCQ generator, with results reported alongside the original Table 2 to demonstrate robustness independent of the specific external model. revision: yes

Circularity Check

0 steps flagged

No significant circularity in CapRL++ derivation chain

full rationale

The paper defines a reward signal via an independent external component (vision-free LLM accuracy on MCQs generated from the LVLM caption) and applies RL to maximize it. This target is distinct from the downstream evaluation benchmarks (20+ image/video tasks) used to claim improvements and comparable performance to larger models. No equations or steps reduce the claimed results to the inputs by construction, no self-citations are load-bearing for uniqueness or ansatzes, and no fitted parameters are relabeled as predictions. The framework is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework depends on the domain assumption that LLM QA accuracy serves as a faithful proxy for caption quality; no free parameters or invented entities are identifiable from the abstract alone.

axioms (1)
  • domain assumption Accuracy of a vision-free LLM on MCQs derived from the caption is a valid and generalizable measure of caption quality.
    This premise defines the reward function and is invoked to justify the two-stage pipeline.

pith-pipeline@v0.9.1-grok · 5897 in / 1179 out tokens · 24076 ms · 2026-06-27T17:19:54.305345+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games

    cs.CV 2026-06 unverdicted novelty 7.0

    RNG-Bench evaluates MLLMs on hidden-observation reconstruction in non-Markov games, finds forgetting as the dominant error source, and shows fine-tuning on optimal rollouts improves performance with transfer to other ...

Reference graph

Works this paper leans on

102 extracted references · 35 canonical work pages · cited by 1 Pith paper · 20 internal anchors

  1. [1]

    SemDeDup: Data-efficient learning at web-scale through semantic deduplication

    Amro Abbas, Kushal Tirumala, Dániel Simig, Surya Ganguli, and Ari S Morcos. Semdedup: Data-efficient learning at web-scale through semantic deduplication.arXiv preprint arXiv:2303.09540, 2023. 3.5

  2. [2]

    A review of deep learning for video captioning.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

    Moloud Abdar, Meenakshi Kollati, Swaraja Kuraparthi, Farhad Pourpanah, Daniel McDuff, Mohammad Ghavamzadeh, Shuicheng Yan, Abduallah Mohamed, Abbas Khosravi, Erik Cambria, et al. A review of deep learning for video captioning.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. 2

  3. [3]

    YouTube-8M: A Large-Scale Video Classification Benchmark

    Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadara- jan, and Sudheendra Vijayanarasimhan. Youtube-8m: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675, 2016. 2

  4. [4]

    Frozen in time: A joint video and image encoder for end-to-end retrieval

    Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. InProceedings of the IEEE/CVF international conference on computer vision, pages 1728–1738, 2021. 2

  5. [5]

    Clair: Evaluating image captions with large language models

    David M Chan, Suzanne Petryk, Joseph E Gonzalez, Trevor Darrell, and John Canny. Clair: Evaluating image captions with large language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13638–13646, 2023. 4.2.2

  6. [6]

    Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts

    Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3558–3568, 2021. 2

  7. [7]

    ALLaVA: Harnessing GPT4V-Synthesized Data for Lite Vision-Language Models

    Guiming Hardy Chen, Shunian Chen, Ruifei Zhang, Junying Chen, Xiangbo Wu, Zhiyi Zhang, Zhihong Chen, Jianquan Li, Xiang Wan, and Benyou Wang. Allava: Harnessing gpt4v-synthesized data for lite vision-language models.arXiv preprint arXiv:2402.11684, 2024. 2

  8. [8]

    Sharegpt4v: Improving large multi-modal models with better captions

    Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. InEuropean Conference on Computer Vision, pages 370–387. Springer, 2024. 1, 2

  9. [9]

    Are we on the right way for evaluating large vision-language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024

    Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024. 4.1

  10. [10]

    Sharegpt4video: Improving video understanding and generation with better captions.Advances in Neural Information Processing Systems, 37:19472–19495, 2024

    Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, et al. Sharegpt4video: Improving video understanding and generation with better captions.Advances in Neural Information Processing Systems, 37:19472–19495, 2024. 2, 4.3.1

  11. [11]

    Lin Chen and Long Xing. Open-llava-next: An open-source implementation of llava-next series for facilitating the large multi-modal model community.GitHub-xiaoachen98/Open-LLaVA-NeXT: Anopen- sourceimplementationfortrainingLLaVA-NeXT, 2024. 4.2.1, 4.3.1

  12. [12]

    Avocado: An audiovisual video captioner driven by temporal orchestration

    Xinlong Chen, Yue Ding, Weihong Lin, Jingyun Hua, Linli Yao, Yang Shi, Bozhou Li, Yuanxing Zhang, Qiang Liu, Pengfei Wan, et al. Avocado: An audiovisual video captioner driven by temporal orchestration. arXiv preprint arXiv:2510.10395, 2025. 3.2

  13. [13]

    SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

    Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V Le, Sergey Levine, and Yi Ma. SFT Memorizes, RL Generalizes: A comparative study of foundation model post-training.arXiv preprint arXiv:2501.17161, 2025. 2

  14. [14]

    Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

    Christopher Clark, Jieyu Zhang, Zixian Ma, Jae Sung Park, Mohammadreza Salehi, Rohun Tripathi, Sangho Lee, Zhongzheng Ren, Chris Dongjoo Kim, Yinuo Yang, et al. Molmo2: Open weights and data for vision-language models with video understanding and grounding.arXiv preprint arXiv:2601.10611,

  15. [15]

    4.3.1 20 CapRL++: Unified Reinforcement Learning with Verifiable Rewards for Dense Image and Video Captioning

  16. [16]

    Long-term recurrent convolutional networks for visual recognition and description

    Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. Long-term recurrent convolutional networks for visual recognition and description. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2625–2634, 2015. 1

  17. [17]

    Improving clip training with language rewrites.Advances in Neural Information Processing Systems, 36:35544–35575, 2023

    Lijie Fan, Dilip Krishnan, Phillip Isola, Dina Katabi, and Yonglong Tian. Improving clip training with language rewrites.Advances in Neural Information Processing Systems, 36:35544–35575, 2023. 2

  18. [18]

    Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24108–24118, 2025. 4.1

  19. [19]

    Tall: Temporal activity localization via language query

    Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. Tall: Temporal activity localization via language query. InProceedings of the IEEE international conference on computer vision, pages 5267–5275,

  20. [20]

    Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

    Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Bing Liu, and Sean Hendryx. Rubrics as rewards: Reinforcement learning beyond verifiable domains.arXiv preprint arXiv:2507.17746, 2025. 1, 3.1

  21. [21]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-R1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 1

  22. [22]

    Learning to reason for long-form story generation.arXiv preprint arXiv:2503.22828, 2025

    Alexander Gurung and Mirella Lapata. Learning to reason for long-form story generation.arXiv preprint arXiv:2503.22828, 2025. 1

  23. [23]

    Motionbench: Benchmarking and improving fine-grained video motion understand- ing for vision language models

    Wenyi Hong, Yean Cheng, Zhuoyi Yang, Weihan Wang, Lefan Wang, Xiaotao Gu, Shiyu Huang, Yuxiao Dong, and Jie Tang. Motionbench: Benchmarking and improving fine-grained video motion understand- ing for vision language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 8450–8460, 2025. 4.1

  24. [24]

    Gqa: A new dataset for real-world visual reasoning and compositional question answering

    Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019. 4.1

  25. [25]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276,

  26. [26]

    Deep visual-semantic alignments for generating image descriptions

    Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3128–3137, 2015. 1

  27. [27]

    A diagram is worth a dozen images

    Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. InEuropean conference on computer vision, pages 235–251. Springer,

  28. [28]

    Dense-captioning events in videos

    Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. Dense-captioning events in videos. InProceedings of the IEEE international conference on computer vision, pages 706–715, 2017. 4.1

  29. [29]

    Flux.https://github.com/black-forest-labs/flux, 2024

    Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024. 4.4

  30. [30]

    FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2, 2025

    Black Forest Labs. FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2, 2025. 4.4

  31. [31]

    Tulu 3: Pushing Frontiers in Open Language Model Post-Training

    Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, et al. Tulu 3: Pushing frontiers in open language model post-training.arXiv preprint arXiv:2411.15124, 2024. 1, 2

  32. [32]

    Toward robust hyper-detailed image captioning: A multiagent approach and dual evaluation metrics for factuality and coverage.arXiv preprint arXiv:2412.15484, 2024

    Saehyung Lee, Seunghyun Yoon, Trung Bui, Jing Shi, and Sungroh Yoon. Toward robust hyper-detailed image captioning: A multiagent approach and dual evaluation metrics for factuality and coverage.arXiv preprint arXiv:2412.15484, 2024. 4.2.2 21 CapRL++: Unified Reinforcement Learning with Verifiable Rewards for Dense Image and Video Captioning

  33. [33]

    Detecting moments and highlights in videos via natural language queries.Advances in Neural Information Processing Systems, 34:11846–11858, 2021

    Jie Lei, Tamara L Berg, and Mohit Bansal. Detecting moments and highlights in videos via natural language queries.Advances in Neural Information Processing Systems, 34:11846–11858, 2021. 4.1

  34. [34]

    Otter: A multi-modal model with in-context instruction tuning.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

    Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Fanyi Pu, Joshua Adrian Cahyono, Jingkang Yang, Chunyuan Li, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 1

  35. [35]

    Seed-bench-2-plus: Bench- marking multimodal large language models with text-rich visual comprehension.arXiv preprint arXiv:2404.16790, 2024

    Bohao Li, Yuying Ge, Yi Chen, Yixiao Ge, Ruimao Zhang, and Ying Shan. Seed-bench-2-plus: Bench- marking multimodal large language models with text-rich visual comprehension.arXiv preprint arXiv:2404.16790, 2024. 4.1

  36. [36]

    SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

    Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. Seed-bench: Benchmarking multimodal llms with generative comprehension.arXiv preprint arXiv:2307.16125, 2023. 4.1

  37. [37]

    Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInternational Conference on Machine Learning, pages 12888–12900. PMLR, 2022. 2, 4.2.1

  38. [38]

    Mvbench: A comprehensive multi-modal video understanding benchmark

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195–22206, 2024. 4.1

  39. [39]

    Densefusion-1m: Merging vision experts for comprehensive multimodal perception.arXiv preprint arXiv:2407.08303,

    Xiaotong Li, Fan Zhang, Haiwen Diao, Yueze Wang, Xinlong Wang, and Ling-Yu Duan. Densefusion-1m: Merging vision experts for comprehensive multimodal perception.arXiv preprint arXiv:2407.08303,

  40. [40]

    Uni-moe: Scaling unified multimodal llms with mixture of experts.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

    Yunxin Li, Shenyuan Jiang, Baotian Hu, Longyue Wang, Wanqi Zhong, Wenhan Luo, Lin Ma, and Min Zhang. Uni-moe: Scaling unified multimodal llms with mixture of experts.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 1

  41. [41]

    Rouge: A package for automatic evaluation of summaries

    Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, pages 74–81, 2004. 1

  42. [42]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 1

  43. [43]

    Mmbench: Is your multi-modal model an all-around player? In European conference on computer vision, pages 216–233

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? In European conference on computer vision, pages 216–233. Springer, 2024. 4.1

  44. [44]

    Tempcompass: Do video llms really understand videos? InFindings of the Association for Computational Linguistics: ACL 2024, pages 8731–8772, 2024

    Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. Tempcompass: Do video llms really understand videos? InFindings of the Association for Computational Linguistics: ACL 2024, pages 8731–8772, 2024. 4.1

  45. [45]

    Zijun Liu, Peiyi Wang, Runxin Xu, Shirong Ma, Chong Ruan, Peng Li, Yang Liu, and Yu Wu

    ZijunLiu, PeiyiWang, RunxinXu, ShirongMa, ChongRuan, PengLi, YangLiu, andYuWu. Inference-time scaling for generalist reward modeling.arXiv preprint arXiv:2504.02495, 2025. 1, 3.1

  46. [46]

    Visual-RFT: Visual Reinforcement Fine-Tuning

    Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning.arXiv preprint arXiv:2503.01785, 2025. 3.1

  47. [47]

    Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. InInternational Conference on Learning Representations, volume 2024, pages 23439–23554, 2024. 4.1

  48. [48]

    Writing-zero: Bridge the gap between non-verifiable problems and verifiable rewards.arXiv preprint arXiv:2506.00103, 2025

    Xun Lu. Writing-zero: Bridge the gap between non-verifiable problems and verifiable rewards.arXiv preprint arXiv:2506.00103, 2025. 1, 3.1

  49. [49]

    GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents

    Run Luo, Lu Wang, Wanwei He, and Xiaobo Xia. Gui-r1: A generalist r1-style vision-language action model for gui agents.arXiv preprint arXiv:2504.10458, 2025. 3.1 22 CapRL++: Unified Reinforcement Learning with Verifiable Rewards for Dense Image and Video Captioning

  50. [50]

    Robust visual question answering: Datasets, methods, and future challenges.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(8):5575–5594, 2024

    Jie Ma, Pinghui Wang, Dechen Kong, Zewei Wang, Jun Liu, Hongbin Pei, and Junzhou Zhao. Robust visual question answering: Datasets, methods, and future challenges.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(8):5575–5594, 2024. 3.3

  51. [51]

    Video-chatgpt: Towards detailed video understanding via large vision and language models

    Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12585–12602, 2024. 2

  52. [52]

    Chartqa: A benchmark for question answering about charts with visual and logical reasoning

    Ahmed Masry, Xuan Long Do, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. InFindings of the association for computational linguistics: ACL 2022, pages 2263–2279, 2022. 4.1

  53. [53]

    Chartqapro: A more diverse and challenging benchmark for chart question answering

    Ahmed Masry, Mohammed Saidul Islam, Mahir Ahmed, Aayush Bajaj, Firoz Kabir, Aaryaman Kartha, Md Tahmid Rahman Laskar, Mizanur Rahman, Shadikur Rahman, Mehrad Shahmohammadi, et al. Chartqapro: A more diverse and challenging benchmark for chart question answering. InFindings of the Association for Computational Linguistics: ACL 2025, pages 19123–19151, 2025. 4.1

  54. [54]

    Infographicvqa

    Minesh Mathew, Viraj Bagal, Rubèn Tito, Dimosthenis Karatzas, Ernest Valveny, and CV Jawahar. Infographicvqa. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1697–1706, 2022. 4.1

  55. [55]

    Docvqa: A dataset for vqa on document images

    Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. Docvqa: A dataset for vqa on document images. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209,

  56. [56]

    Howto100m: Learning a text-video embedding by watching hundred million narrated video clips

    Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2630–2640, 2019. 2

  57. [57]

    GPT-4V(ision) System Card

    OpenAI. GPT-4V(ision) System Card. https://cdn.openai.com/papers/GPTV_System_Card. pdf, 2023. Accessed: 2026-05-08. 2

  58. [58]

    Bleu: a method for automatic evaluation of machine translation

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002. 1

  59. [59]

    Aloha: A new measure for hallucination in captioning models

    Suzanne Petryk, David Chan, Anish Kachinthaya, Haodi Zou, John Canny, Joseph Gonzalez, and Trevor Darrell. Aloha: A new measure for hallucination in captioning models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers), pages 342–357, 202...

  60. [60]

    Runqi Qiao, Qiuna Tan, Guanting Dong, MinhuiWu MinhuiWu, Chong Sun, Xiaoshuai Song, Jiapeng Wang, Zhuoma Gongque, Shanglin Lei, Yifan Zhang, et al. We-math: Does your large multimodal model achieve human-like mathematical reasoning? InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 200...

  61. [61]

    Prism: A framework for decoupling and assessing the capabilities of vlms

    Yuxuan Qiao, Haodong Duan, Xinyu Fang, Junming Yang, Lin Chen, Songyang Zhang, Jiaqi Wang, Dahua Lin, and Kai Chen. Prism: A framework for decoupling and assessing the capabilities of vlms. Advances in Neural Information Processing Systems, 37:111863–111898, 2024. 4.1, 4.2.2

  62. [62]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR,

  63. [63]

    Fusecap: Leveraging large language models for enriched fused image captions

    Noam Rotstein, David Bensaid, Shaked Brody, Roy Ganz, and Ron Kimmel. Fusecap: Leveraging large language models for enriched fused image captions. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 5689–5700, 2024. 1 23 CapRL++: Unified Reinforcement Learning with Verifiable Rewards for Dense Image and Video Captioning

  64. [64]

    Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural information processing systems, 35:25278–25294, 2022

    Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural information processing systems, 35:25278–25294, 2022. 2

  65. [65]

    Tomato: Assessing visual temporal reasoning capabilities in multimodal foundation models

    Ziyao Shangguan, Chuhan Li, Yuxuan Ding, Yanan Zheng, Yilun Zhao, Tesca Fitzgerald, and Arman Cohan. Tomato: Assessing visual temporal reasoning capabilities in multimodal foundation models. In International Conference on Learning Representations, volume 2025, pages 7593–7734, 2025. 4.1

  66. [66]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. DeepseekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 2, 3.1

  67. [67]

    From show to tell: A survey on deep learning-based image captioning.IEEE transactions on pattern analysis and machine intelligence, 45(1):539–559, 2022

    Matteo Stefanini, Marcella Cornia, Lorenzo Baraldi, Silvia Cascianelli, Giuseppe Fiameni, and Rita Cucchiara. From show to tell: A survey on deep learning-based image captioning.IEEE transactions on pattern analysis and machine intelligence, 45(1):539–559, 2022. 1

  68. [68]

    Cross- ing the reward bridge: Expanding rl with verifiable rewards across diverse domains.arXiv preprint arXiv:2503.23829, 2025

    Yi Su, Dian Yu, Linfeng Song, Juntao Li, Haitao Mi, Zhaopeng Tu, Min Zhang, and Dong Yu. Cross- ing the reward bridge: Expanding rl with verifiable rewards across diverse domains.arXiv preprint arXiv:2503.23829, 2025. 1, 3.1

  69. [69]

    Descriptive caption enhancement with visual specialists for multimodal perception

    Yanpeng Sun, Jing Hao, Ke Zhu, Jiang-Jiang Liu, Yuxiang Zhao, Xiaofan Li, Gang Zhang, Zechao Li, and Jingdong Wang. Descriptive caption enhancement with visual specialists for multimodal perception. arXiv preprint arXiv:2412.14233, 2024. 2

  70. [70]

    Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi K1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025. 2

  71. [71]

    Yfcc100m: The new data in multimedia research.Communications of the ACM, 59(2):64–73, 2016

    Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. Yfcc100m: The new data in multimedia research.Communications of the ACM, 59(2):64–73, 2016. 2

  72. [72]

    Fastvlm: Efficient vision encoding for vision language models

    Pavan Kumar Anasosalu Vasu, Fartash Faghri, Chun-Liang Li, Cem Koc, Nate True, Albert Antony, Gokula Santhanam, James Gabriel, Peter Grasch, Oncel Tuzel, et al. Fastvlm: Efficient vision encoding for vision language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19769–19780, 2025. 1

  73. [73]

    Sequence to sequence-video to text

    Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond Mooney, Trevor Darrell, and Kate Saenko. Sequence to sequence-video to text. InProceedings of the IEEE international conference on computer vision, pages 4534–4542, 2015. 1

  74. [74]

    Show and tell: A neural image caption generator

    Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3156–3164, 2015

  75. [75]

    Show and tell: Lessons learned from the 2015 mscoco image captioning challenge.IEEE transactions on pattern analysis and machine intelligence, 39(4):652–663, 2016

    Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: Lessons learned from the 2015 mscoco image captioning challenge.IEEE transactions on pattern analysis and machine intelligence, 39(4):652–663, 2016. 1

  76. [76]

    Tarsier: Recipes for training and evaluating large video description models,

    Jiawei Wang, Liping Yuan, Yuchen Zhang, and Haomiao Sun. Tarsier: Recipes for training and evaluating large video description models.arXiv preprint arXiv:2407.00634, 2024. 4.1, 4.3.2

  77. [77]

    Measuring multimodal mathematical reasoning with math-vision dataset.Advances in Neural Information Processing Systems, 37:95095–95169, 2024

    Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathematical reasoning with math-vision dataset.Advances in Neural Information Processing Systems, 37:95095–95169, 2024. 4.1

  78. [78]

    On diversity in image captioning: Metrics and methods

    Qingzhong Wang, Jia Wan, and Antoni B Chan. On diversity in image captioning: Metrics and methods. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(2):1035–1049, 2020. 1 24 CapRL++: Unified Reinforcement Learning with Verifiable Rewards for Dense Image and Video Captioning

  79. [79]

    Video dataflywheel: Resolving the impossible data trinity in video-language understanding.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

    Xiao Wang, Jianlong Wu, Zijia Lin, Fuzheng Zhang, Di Zhang, and Liqiang Nie. Video dataflywheel: Resolving the impossible data trinity in video-language understanding.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 2

  80. [80]

    InternVideo: General Video Foundation Models via Generative and Discriminative Learning

    Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, et al. Internvideo: General video foundation models via generative and discriminative learning.arXiv preprint arXiv:2212.03191, 2022. 2

Showing first 80 references.