CapRL++: Unified Reinforcement Learning with Verifiable Rewards for Dense Image and Video Captioning
Pith reviewed 2026-06-27 17:19 UTC · model grok-4.3
The pith
CapRL++ uses a text-only LLM's accuracy on questions about a caption as the reward signal to train vision-language models for better dense image and video descriptions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CapRL++ is a decoupled two-stage RL pipeline in which an LVLM first produces a caption and a vision-free LLM then answers multiple-choice questions about the visual scene using only that caption; the accuracy of those answers supplies the verifiable reward. Training with this signal yields captions that support stronger performance on more than twenty image and video benchmarks, strengthen caption-based pretraining, and let compact models match the dense-captioning results of Qwen2.5-VL-72B and Qwen3-VL-235B-A22B inside the Prism evaluation framework.
What carries the argument
The CapRL++ two-stage pipeline that converts caption utility, measured as vision-free LLM accuracy on MCQs, into a scalar reward for reinforcement learning.
If this is right
- Compact models trained with CapRL++ reach dense-captioning performance comparable to models with 72B and 235B parameters.
- Pretraining on image and video datasets annotated by CapRL++ produces substantial gains on spatial and temporal understanding benchmarks.
- Caption quality is redefined as the ability of the text to enable accurate question answering without visual input.
- The method removes dependence on expensive human reference annotations for open-ended captioning.
- RLVR can be applied directly to multimodal captioning without reference captions.
Where Pith is reading between the lines
- If the MCQ proxy remains stable across domains, the same reward signal could generate training data for other open-ended visual generation tasks.
- The approach might reduce the need for ever-larger model sizes by improving data efficiency instead.
- One could test whether captions produced this way transfer to entirely new visual-question-answering datasets never seen during training.
- Extending the pipeline to video with temporal MCQs could further strengthen long-range understanding claims.
Load-bearing premise
The accuracy of answers from a separate vision-free LLM on multiple-choice questions is a reliable and unbiased proxy for the quality and utility of the generated dense caption.
What would settle it
Train a model with CapRL++ and a control model with standard SFT on the same data, then measure whether the CapRL++ captions produce measurably higher accuracy when a vision-free LLM answers held-out questions or when the captions are used for downstream spatial-temporal tasks; equal or lower performance would falsify the claim.
read the original abstract
Image and video captioning are fundamental tasks that bridge the visual and linguistic domains, playing a critical role in pre-training Large Vision-Language Models (LVLMs). Current state-of-the-art captioning models are typically trained with Supervised Fine-Tuning (SFT), a paradigm that relies on expensive, non-scalable annotations and often causes models to memorize specific ground-truth answers, limiting their generality and ability to generate diverse, creative descriptions. To overcome these limitations, we propose applying Reinforcement Learning with Verifiable Rewards (RLVR) to the open-ended task of multimodal captioning. We introduce Captioning Reinforcement Learning++ (CapRL++), a novel reference-free training framework that redefines caption quality through its utility: a high-quality caption should enable a non-visual language model to accurately answer questions about the corresponding visual content. CapRL++ employs a decoupled two-stage pipeline where an LVLM generates a caption, and the objective reward is derived from the accuracy of a separate, vision-free LLM answering Multiple-Choice Questions based solely on that caption. Evaluations on more than 20 image and video benchmarks show that CapRL++ improves dense caption quality and strengthens caption-based pretraining across tasks such as spatial and temporal understanding. Pretraining on scalable image and video caption datasets annotated by CapRL++ yields substantial downstream gains. Furthermore, within the Prism Framework for caption quality evaluation, compact models trained with CapRL++ achieve dense captioning performance comparable to substantially larger models such as Qwen2.5-VL-72B and Qwen3-VL-235B-A22B. These results validate that CapRL++ effectively trains models to produce generalizable, high-fidelity descriptions, establishing a robust foundation beyond the limitations of traditional SFT.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CapRL++, a reference-free RL framework for dense image and video captioning. An LVLM generates a caption and receives a reward equal to the accuracy achieved by a separate vision-free LLM when answering MCQs derived from that caption alone. The authors report gains on more than 20 benchmarks, improved caption-based pretraining, and that compact models trained with CapRL++ reach dense-captioning performance comparable to Qwen2.5-VL-72B and Qwen3-VL-235B-A22B.
Significance. If the MCQ-based proxy is shown to be a faithful measure of caption utility, the method supplies a scalable, annotation-free alternative to SFT and could materially improve LVLM pretraining. The reported parity between compact CapRL++ models and models two orders of magnitude larger would be a notable empirical result.
major comments (3)
- [Abstract, §4] Abstract and §4 (Experiments): the central claim that CapRL++ yields captions whose utility matches much larger models rests entirely on the MCQ accuracy proxy, yet no correlation is reported between this proxy and either reference-based metrics (CIDEr, SPICE) or human judgments of caption quality. Without such anchoring data the mapping from reward to claimed caption fidelity remains unverified.
- [§3.2] §3.2 (Reward formulation): the two-stage pipeline decouples caption generation from the reward LLM; no ablation is presented that tests whether accuracy gains arise from richer visual descriptions or from the caption exploiting the reward LLM’s textual priors on the chosen MCQs. This directly affects the claim that the reward is “verifiable” and unbiased.
- [Table 2, §5.1] Table 2 and §5.1 (Downstream pretraining): the reported gains on spatial/temporal understanding tasks are attributed to CapRL++ captions, but the manuscript does not control for the choice of MCQ generator or the specific vision-free LLM, leaving open the possibility that results are tied to a particular external model rather than the captioning objective itself.
minor comments (2)
- [§3.1] Notation for the reward function r(c) is introduced without an explicit equation number; adding Eq. (X) would improve readability.
- [Figure 3] Figure 3 caption does not state the number of MCQs per image/video or the source of the question pool.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications on our design choices and indicate revisions that will be incorporated to strengthen the validation of the MCQ proxy and related claims.
read point-by-point responses
-
Referee: [Abstract, §4] Abstract and §4 (Experiments): the central claim that CapRL++ yields captions whose utility matches much larger models rests entirely on the MCQ accuracy proxy, yet no correlation is reported between this proxy and either reference-based metrics (CIDEr, SPICE) or human judgments of caption quality. Without such anchoring data the mapping from reward to claimed caption fidelity remains unverified.
Authors: We acknowledge that the current manuscript does not include explicit correlation coefficients between MCQ accuracy and reference-based metrics or human judgments. The primary validation instead comes from consistent improvements across more than 20 downstream benchmarks and caption-based pretraining gains, which demonstrate the practical utility of the proxy. To directly address the anchoring concern, we will add a new analysis subsection in §4 that computes and reports Pearson/Spearman correlations with CIDEr and SPICE on held-out data, plus agreement rates with human ratings on a 500-caption subset. This will be included in the revised manuscript. revision: yes
-
Referee: [§3.2] §3.2 (Reward formulation): the two-stage pipeline decouples caption generation from the reward LLM; no ablation is presented that tests whether accuracy gains arise from richer visual descriptions or from the caption exploiting the reward LLM’s textual priors on the chosen MCQs. This directly affects the claim that the reward is “verifiable” and unbiased.
Authors: The two-stage decoupling is a deliberate design choice to produce a reference-free, verifiable reward that depends only on caption utility for a vision-free LLM. Dynamic MCQ generation from each caption further reduces the chance of static prior exploitation. While the manuscript does not contain an explicit ablation isolating textual priors, the broad gains on spatial/temporal tasks across benchmarks provide supporting evidence that visual content is being captured. We will add a targeted ablation in the revision that re-uses identical MCQ sets across varied captions to quantify the contribution of description richness versus prior matching. revision: yes
-
Referee: [Table 2, §5.1] Table 2 and §5.1 (Downstream pretraining): the reported gains on spatial/temporal understanding tasks are attributed to CapRL++ captions, but the manuscript does not control for the choice of MCQ generator or the specific vision-free LLM, leaving open the possibility that results are tied to a particular external model rather than the captioning objective itself.
Authors: The reported experiments employ a reproducible MCQ generation procedure and a fixed, publicly documented vision-free LLM to ensure the framework is not tied to proprietary components. The improvements appear consistently across diverse tasks, suggesting the objective itself drives the gains. To strengthen this, the revision will include additional controls using an alternative vision-free LLM (e.g., a Llama-3 variant) and a second MCQ generator, with results reported alongside the original Table 2 to demonstrate robustness independent of the specific external model. revision: yes
Circularity Check
No significant circularity in CapRL++ derivation chain
full rationale
The paper defines a reward signal via an independent external component (vision-free LLM accuracy on MCQs generated from the LVLM caption) and applies RL to maximize it. This target is distinct from the downstream evaluation benchmarks (20+ image/video tasks) used to claim improvements and comparable performance to larger models. No equations or steps reduce the claimed results to the inputs by construction, no self-citations are load-bearing for uniqueness or ansatzes, and no fitted parameters are relabeled as predictions. The framework is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Accuracy of a vision-free LLM on MCQs derived from the caption is a valid and generalizable measure of caption quality.
Forward citations
Cited by 1 Pith paper
-
Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games
RNG-Bench evaluates MLLMs on hidden-observation reconstruction in non-Markov games, finds forgetting as the dominant error source, and shows fine-tuning on optimal rollouts improves performance with transfer to other ...
Reference graph
Works this paper leans on
-
[1]
SemDeDup: Data-efficient learning at web-scale through semantic deduplication
Amro Abbas, Kushal Tirumala, Dániel Simig, Surya Ganguli, and Ari S Morcos. Semdedup: Data-efficient learning at web-scale through semantic deduplication.arXiv preprint arXiv:2303.09540, 2023. 3.5
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
A review of deep learning for video captioning.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024
Moloud Abdar, Meenakshi Kollati, Swaraja Kuraparthi, Farhad Pourpanah, Daniel McDuff, Mohammad Ghavamzadeh, Shuicheng Yan, Abduallah Mohamed, Abbas Khosravi, Erik Cambria, et al. A review of deep learning for video captioning.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. 2
2024
-
[3]
YouTube-8M: A Large-Scale Video Classification Benchmark
Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadara- jan, and Sudheendra Vijayanarasimhan. Youtube-8m: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675, 2016. 2
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[4]
Frozen in time: A joint video and image encoder for end-to-end retrieval
Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. InProceedings of the IEEE/CVF international conference on computer vision, pages 1728–1738, 2021. 2
2021
-
[5]
Clair: Evaluating image captions with large language models
David M Chan, Suzanne Petryk, Joseph E Gonzalez, Trevor Darrell, and John Canny. Clair: Evaluating image captions with large language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13638–13646, 2023. 4.2.2
2023
-
[6]
Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts
Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3558–3568, 2021. 2
2021
-
[7]
ALLaVA: Harnessing GPT4V-Synthesized Data for Lite Vision-Language Models
Guiming Hardy Chen, Shunian Chen, Ruifei Zhang, Junying Chen, Xiangbo Wu, Zhiyi Zhang, Zhihong Chen, Jianquan Li, Xiang Wan, and Benyou Wang. Allava: Harnessing gpt4v-synthesized data for lite vision-language models.arXiv preprint arXiv:2402.11684, 2024. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[8]
Sharegpt4v: Improving large multi-modal models with better captions
Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. InEuropean Conference on Computer Vision, pages 370–387. Springer, 2024. 1, 2
2024
-
[9]
Are we on the right way for evaluating large vision-language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024
Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024. 4.1
2024
-
[10]
Sharegpt4video: Improving video understanding and generation with better captions.Advances in Neural Information Processing Systems, 37:19472–19495, 2024
Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, et al. Sharegpt4video: Improving video understanding and generation with better captions.Advances in Neural Information Processing Systems, 37:19472–19495, 2024. 2, 4.3.1
2024
-
[11]
Lin Chen and Long Xing. Open-llava-next: An open-source implementation of llava-next series for facilitating the large multi-modal model community.GitHub-xiaoachen98/Open-LLaVA-NeXT: Anopen- sourceimplementationfortrainingLLaVA-NeXT, 2024. 4.2.1, 4.3.1
2024
-
[12]
Avocado: An audiovisual video captioner driven by temporal orchestration
Xinlong Chen, Yue Ding, Weihong Lin, Jingyun Hua, Linli Yao, Yang Shi, Bozhou Li, Yuanxing Zhang, Qiang Liu, Pengfei Wan, et al. Avocado: An audiovisual video captioner driven by temporal orchestration. arXiv preprint arXiv:2510.10395, 2025. 3.2
-
[13]
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V Le, Sergey Levine, and Yi Ma. SFT Memorizes, RL Generalizes: A comparative study of foundation model post-training.arXiv preprint arXiv:2501.17161, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding
Christopher Clark, Jieyu Zhang, Zixian Ma, Jae Sung Park, Mohammadreza Salehi, Rohun Tripathi, Sangho Lee, Zhongzheng Ren, Chris Dongjoo Kim, Yinuo Yang, et al. Molmo2: Open weights and data for vision-language models with video understanding and grounding.arXiv preprint arXiv:2601.10611,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
4.3.1 20 CapRL++: Unified Reinforcement Learning with Verifiable Rewards for Dense Image and Video Captioning
-
[16]
Long-term recurrent convolutional networks for visual recognition and description
Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. Long-term recurrent convolutional networks for visual recognition and description. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2625–2634, 2015. 1
2015
-
[17]
Improving clip training with language rewrites.Advances in Neural Information Processing Systems, 36:35544–35575, 2023
Lijie Fan, Dilip Krishnan, Phillip Isola, Dina Katabi, and Yonglong Tian. Improving clip training with language rewrites.Advances in Neural Information Processing Systems, 36:35544–35575, 2023. 2
2023
-
[18]
Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis
Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24108–24118, 2025. 4.1
2025
-
[19]
Tall: Temporal activity localization via language query
Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. Tall: Temporal activity localization via language query. InProceedings of the IEEE international conference on computer vision, pages 5267–5275,
-
[20]
Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains
Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Bing Liu, and Sean Hendryx. Rubrics as rewards: Reinforcement learning beyond verifiable domains.arXiv preprint arXiv:2507.17746, 2025. 1, 3.1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[21]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-R1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[22]
Learning to reason for long-form story generation.arXiv preprint arXiv:2503.22828, 2025
Alexander Gurung and Mirella Lapata. Learning to reason for long-form story generation.arXiv preprint arXiv:2503.22828, 2025. 1
-
[23]
Motionbench: Benchmarking and improving fine-grained video motion understand- ing for vision language models
Wenyi Hong, Yean Cheng, Zhuoyi Yang, Weihan Wang, Lefan Wang, Xiaotao Gu, Shiyu Huang, Yuxiao Dong, and Jie Tang. Motionbench: Benchmarking and improving fine-grained video motion understand- ing for vision language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 8450–8460, 2025. 4.1
2025
-
[24]
Gqa: A new dataset for real-world visual reasoning and compositional question answering
Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019. 4.1
2019
-
[25]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276,
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
Deep visual-semantic alignments for generating image descriptions
Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3128–3137, 2015. 1
2015
-
[27]
A diagram is worth a dozen images
Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. InEuropean conference on computer vision, pages 235–251. Springer,
-
[28]
Dense-captioning events in videos
Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. Dense-captioning events in videos. InProceedings of the IEEE international conference on computer vision, pages 706–715, 2017. 4.1
2017
-
[29]
Flux.https://github.com/black-forest-labs/flux, 2024
Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024. 4.4
2024
-
[30]
FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2, 2025
Black Forest Labs. FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2, 2025. 4.4
2025
-
[31]
Tulu 3: Pushing Frontiers in Open Language Model Post-Training
Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, et al. Tulu 3: Pushing frontiers in open language model post-training.arXiv preprint arXiv:2411.15124, 2024. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[32]
Saehyung Lee, Seunghyun Yoon, Trung Bui, Jing Shi, and Sungroh Yoon. Toward robust hyper-detailed image captioning: A multiagent approach and dual evaluation metrics for factuality and coverage.arXiv preprint arXiv:2412.15484, 2024. 4.2.2 21 CapRL++: Unified Reinforcement Learning with Verifiable Rewards for Dense Image and Video Captioning
-
[33]
Detecting moments and highlights in videos via natural language queries.Advances in Neural Information Processing Systems, 34:11846–11858, 2021
Jie Lei, Tamara L Berg, and Mohit Bansal. Detecting moments and highlights in videos via natural language queries.Advances in Neural Information Processing Systems, 34:11846–11858, 2021. 4.1
2021
-
[34]
Otter: A multi-modal model with in-context instruction tuning.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025
Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Fanyi Pu, Joshua Adrian Cahyono, Jingkang Yang, Chunyuan Li, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 1
2025
-
[35]
Bohao Li, Yuying Ge, Yi Chen, Yixiao Ge, Ruimao Zhang, and Ying Shan. Seed-bench-2-plus: Bench- marking multimodal large language models with text-rich visual comprehension.arXiv preprint arXiv:2404.16790, 2024. 4.1
-
[36]
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. Seed-bench: Benchmarking multimodal llms with generative comprehension.arXiv preprint arXiv:2307.16125, 2023. 4.1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[37]
Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInternational Conference on Machine Learning, pages 12888–12900. PMLR, 2022. 2, 4.2.1
2022
-
[38]
Mvbench: A comprehensive multi-modal video understanding benchmark
Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195–22206, 2024. 4.1
2024
-
[39]
Xiaotong Li, Fan Zhang, Haiwen Diao, Yueze Wang, Xinlong Wang, and Ling-Yu Duan. Densefusion-1m: Merging vision experts for comprehensive multimodal perception.arXiv preprint arXiv:2407.08303,
-
[40]
Uni-moe: Scaling unified multimodal llms with mixture of experts.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025
Yunxin Li, Shenyuan Jiang, Baotian Hu, Longyue Wang, Wanqi Zhong, Wenhan Luo, Lin Ma, and Min Zhang. Uni-moe: Scaling unified multimodal llms with mixture of experts.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 1
2025
-
[41]
Rouge: A package for automatic evaluation of summaries
Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, pages 74–81, 2004. 1
2004
-
[42]
Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 1
2023
-
[43]
Mmbench: Is your multi-modal model an all-around player? In European conference on computer vision, pages 216–233
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? In European conference on computer vision, pages 216–233. Springer, 2024. 4.1
2024
-
[44]
Tempcompass: Do video llms really understand videos? InFindings of the Association for Computational Linguistics: ACL 2024, pages 8731–8772, 2024
Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. Tempcompass: Do video llms really understand videos? InFindings of the Association for Computational Linguistics: ACL 2024, pages 8731–8772, 2024. 4.1
2024
-
[45]
Zijun Liu, Peiyi Wang, Runxin Xu, Shirong Ma, Chong Ruan, Peng Li, Yang Liu, and Yu Wu
ZijunLiu, PeiyiWang, RunxinXu, ShirongMa, ChongRuan, PengLi, YangLiu, andYuWu. Inference-time scaling for generalist reward modeling.arXiv preprint arXiv:2504.02495, 2025. 1, 3.1
-
[46]
Visual-RFT: Visual Reinforcement Fine-Tuning
Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning.arXiv preprint arXiv:2503.01785, 2025. 3.1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[47]
Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. InInternational Conference on Learning Representations, volume 2024, pages 23439–23554, 2024. 4.1
2024
-
[48]
Xun Lu. Writing-zero: Bridge the gap between non-verifiable problems and verifiable rewards.arXiv preprint arXiv:2506.00103, 2025. 1, 3.1
-
[49]
GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents
Run Luo, Lu Wang, Wanwei He, and Xiaobo Xia. Gui-r1: A generalist r1-style vision-language action model for gui agents.arXiv preprint arXiv:2504.10458, 2025. 3.1 22 CapRL++: Unified Reinforcement Learning with Verifiable Rewards for Dense Image and Video Captioning
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[50]
Robust visual question answering: Datasets, methods, and future challenges.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(8):5575–5594, 2024
Jie Ma, Pinghui Wang, Dechen Kong, Zewei Wang, Jun Liu, Hongbin Pei, and Junzhou Zhao. Robust visual question answering: Datasets, methods, and future challenges.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(8):5575–5594, 2024. 3.3
2024
-
[51]
Video-chatgpt: Towards detailed video understanding via large vision and language models
Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12585–12602, 2024. 2
2024
-
[52]
Chartqa: A benchmark for question answering about charts with visual and logical reasoning
Ahmed Masry, Xuan Long Do, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. InFindings of the association for computational linguistics: ACL 2022, pages 2263–2279, 2022. 4.1
2022
-
[53]
Chartqapro: A more diverse and challenging benchmark for chart question answering
Ahmed Masry, Mohammed Saidul Islam, Mahir Ahmed, Aayush Bajaj, Firoz Kabir, Aaryaman Kartha, Md Tahmid Rahman Laskar, Mizanur Rahman, Shadikur Rahman, Mehrad Shahmohammadi, et al. Chartqapro: A more diverse and challenging benchmark for chart question answering. InFindings of the Association for Computational Linguistics: ACL 2025, pages 19123–19151, 2025. 4.1
2025
-
[54]
Infographicvqa
Minesh Mathew, Viraj Bagal, Rubèn Tito, Dimosthenis Karatzas, Ernest Valveny, and CV Jawahar. Infographicvqa. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1697–1706, 2022. 4.1
2022
-
[55]
Docvqa: A dataset for vqa on document images
Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. Docvqa: A dataset for vqa on document images. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209,
-
[56]
Howto100m: Learning a text-video embedding by watching hundred million narrated video clips
Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2630–2640, 2019. 2
2019
-
[57]
GPT-4V(ision) System Card
OpenAI. GPT-4V(ision) System Card. https://cdn.openai.com/papers/GPTV_System_Card. pdf, 2023. Accessed: 2026-05-08. 2
2023
-
[58]
Bleu: a method for automatic evaluation of machine translation
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002. 1
2002
-
[59]
Aloha: A new measure for hallucination in captioning models
Suzanne Petryk, David Chan, Anish Kachinthaya, Haodi Zou, John Canny, Joseph Gonzalez, and Trevor Darrell. Aloha: A new measure for hallucination in captioning models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers), pages 342–357, 202...
2024
-
[60]
Runqi Qiao, Qiuna Tan, Guanting Dong, MinhuiWu MinhuiWu, Chong Sun, Xiaoshuai Song, Jiapeng Wang, Zhuoma Gongque, Shanglin Lei, Yifan Zhang, et al. We-math: Does your large multimodal model achieve human-like mathematical reasoning? InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 200...
2025
-
[61]
Prism: A framework for decoupling and assessing the capabilities of vlms
Yuxuan Qiao, Haodong Duan, Xinyu Fang, Junming Yang, Lin Chen, Songyang Zhang, Jiaqi Wang, Dahua Lin, and Kai Chen. Prism: A framework for decoupling and assessing the capabilities of vlms. Advances in Neural Information Processing Systems, 37:111863–111898, 2024. 4.1, 4.2.2
2024
-
[62]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR,
-
[63]
Fusecap: Leveraging large language models for enriched fused image captions
Noam Rotstein, David Bensaid, Shaked Brody, Roy Ganz, and Ron Kimmel. Fusecap: Leveraging large language models for enriched fused image captions. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 5689–5700, 2024. 1 23 CapRL++: Unified Reinforcement Learning with Verifiable Rewards for Dense Image and Video Captioning
2024
-
[64]
Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural information processing systems, 35:25278–25294, 2022
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural information processing systems, 35:25278–25294, 2022. 2
2022
-
[65]
Tomato: Assessing visual temporal reasoning capabilities in multimodal foundation models
Ziyao Shangguan, Chuhan Li, Yuxuan Ding, Yanan Zheng, Yilun Zhao, Tesca Fitzgerald, and Arman Cohan. Tomato: Assessing visual temporal reasoning capabilities in multimodal foundation models. In International Conference on Learning Representations, volume 2025, pages 7593–7734, 2025. 4.1
2025
-
[66]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. DeepseekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 2, 3.1
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[67]
From show to tell: A survey on deep learning-based image captioning.IEEE transactions on pattern analysis and machine intelligence, 45(1):539–559, 2022
Matteo Stefanini, Marcella Cornia, Lorenzo Baraldi, Silvia Cascianelli, Giuseppe Fiameni, and Rita Cucchiara. From show to tell: A survey on deep learning-based image captioning.IEEE transactions on pattern analysis and machine intelligence, 45(1):539–559, 2022. 1
2022
-
[68]
Yi Su, Dian Yu, Linfeng Song, Juntao Li, Haitao Mi, Zhaopeng Tu, Min Zhang, and Dong Yu. Cross- ing the reward bridge: Expanding rl with verifiable rewards across diverse domains.arXiv preprint arXiv:2503.23829, 2025. 1, 3.1
-
[69]
Descriptive caption enhancement with visual specialists for multimodal perception
Yanpeng Sun, Jing Hao, Ke Zhu, Jiang-Jiang Liu, Yuxiang Zhao, Xiaofan Li, Gang Zhang, Zechao Li, and Jingdong Wang. Descriptive caption enhancement with visual specialists for multimodal perception. arXiv preprint arXiv:2412.14233, 2024. 2
-
[70]
Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi K1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[71]
Yfcc100m: The new data in multimedia research.Communications of the ACM, 59(2):64–73, 2016
Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. Yfcc100m: The new data in multimedia research.Communications of the ACM, 59(2):64–73, 2016. 2
2016
-
[72]
Fastvlm: Efficient vision encoding for vision language models
Pavan Kumar Anasosalu Vasu, Fartash Faghri, Chun-Liang Li, Cem Koc, Nate True, Albert Antony, Gokula Santhanam, James Gabriel, Peter Grasch, Oncel Tuzel, et al. Fastvlm: Efficient vision encoding for vision language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19769–19780, 2025. 1
2025
-
[73]
Sequence to sequence-video to text
Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond Mooney, Trevor Darrell, and Kate Saenko. Sequence to sequence-video to text. InProceedings of the IEEE international conference on computer vision, pages 4534–4542, 2015. 1
2015
-
[74]
Show and tell: A neural image caption generator
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3156–3164, 2015
2015
-
[75]
Show and tell: Lessons learned from the 2015 mscoco image captioning challenge.IEEE transactions on pattern analysis and machine intelligence, 39(4):652–663, 2016
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: Lessons learned from the 2015 mscoco image captioning challenge.IEEE transactions on pattern analysis and machine intelligence, 39(4):652–663, 2016. 1
2015
-
[76]
Tarsier: Recipes for training and evaluating large video description models,
Jiawei Wang, Liping Yuan, Yuchen Zhang, and Haomiao Sun. Tarsier: Recipes for training and evaluating large video description models.arXiv preprint arXiv:2407.00634, 2024. 4.1, 4.3.2
-
[77]
Measuring multimodal mathematical reasoning with math-vision dataset.Advances in Neural Information Processing Systems, 37:95095–95169, 2024
Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathematical reasoning with math-vision dataset.Advances in Neural Information Processing Systems, 37:95095–95169, 2024. 4.1
2024
-
[78]
On diversity in image captioning: Metrics and methods
Qingzhong Wang, Jia Wan, and Antoni B Chan. On diversity in image captioning: Metrics and methods. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(2):1035–1049, 2020. 1 24 CapRL++: Unified Reinforcement Learning with Verifiable Rewards for Dense Image and Video Captioning
2020
-
[79]
Video dataflywheel: Resolving the impossible data trinity in video-language understanding.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025
Xiao Wang, Jianlong Wu, Zijia Lin, Fuzheng Zhang, Di Zhang, and Liqiang Nie. Video dataflywheel: Resolving the impossible data trinity in video-language understanding.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 2
2025
-
[80]
InternVideo: General Video Foundation Models via Generative and Discriminative Learning
Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, et al. Internvideo: General video foundation models via generative and discriminative learning.arXiv preprint arXiv:2212.03191, 2022. 2
work page internal anchor Pith review Pith/arXiv arXiv 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.