Recognition: 2 theorem links
· Lean TheoremDeltaRubric: Generative Multimodal Reward Modeling via Joint Planning and Verification
Pith reviewed 2026-05-12 04:38 UTC · model grok-4.3
The pith
Decomposing multimodal preference evaluation into instance-specific checklist generation and verification produces more reliable reward models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DeltaRubric reformulates multimodal preference evaluation as a plan-and-execute process within a single MLLM. The model first acts as a Disagreement Planner to generate a neutral, instance-specific verification checklist. It then transitions to a Checklist Verifier that executes these checks against the image and question to produce the final judgment. This is formulated as a multi-role reinforcement learning problem that jointly optimizes the planning and verification capabilities, resulting in improved accuracy on benchmarks like VL-RewardBench.
What carries the argument
The two-role mechanism of Disagreement Planner followed by Checklist Verifier, jointly trained via multi-role reinforcement learning to produce grounded multimodal judgments.
Load-bearing premise
Self-generated checklists stay neutral and accurately capture the key visual and factual differences without adding new biases or needing corrections.
What would settle it
Testing DeltaRubric on a benchmark with deliberately subtle visual discrepancies that checklists might miss would show whether accuracy gains hold or if performance reverts to baseline levels.
Figures
read the original abstract
Aligning Multimodal Large Language Models (MLLMs) requires reliable reward models, yet existing single-step evaluators can suffer from lazy judging, exploiting language priors over fine-grained visual verification. While rubric-based evaluation mitigates these biases in text-only settings, extending it to multimodal tasks is bottlenecked by the complexity of visual reasoning. The critical differences between responses often depend on instance-specific visual details. Robust evaluation requires dynamically synthesizing rubrics that isolate spatial and factual discrepancies. To address this, we introduce $\textbf{DeltaRubric}$, an approach that reformulates multimodal preference evaluation as a plan-and-execute process within a single MLLM. DeltaRubric operates in two steps: acting first as a $\textit{Disagreement Planner}$, the model generates a neutral, instance-specific verification checklist. Transitioning into a $\textit{Checklist Verifier}$, it executes these self-generated checks against the image and question to produce the final grounded judgment. We formulate DeltaRubric as a multi-role reinforcement learning problem, jointly optimizing planning and verification capabilities. Validated on Qwen3-VL 4B and 8B Instruct models, DeltaRubric achieves solid empirical gains. For instance, On VL-RewardBench, it improves base model overall accuracy by $\textbf{+22.6}$ (4B) and $\textbf{+18.8}$ (8B) points, largely outperforming standard no-rubric baselines. The results demonstrate that decomposing evaluation into structured, verifiable steps leads to more reliable and generalizable multimodal reward modeling.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. DeltaRubric reformulates multimodal preference evaluation as a plan-and-execute process within a single MLLM. The model first acts as a Disagreement Planner to generate a neutral, instance-specific verification checklist and then as a Checklist Verifier to execute those checks against the image and question, producing a grounded judgment. The approach is cast as a multi-role RL problem that jointly optimizes planning and verification capabilities. Experiments on Qwen3-VL 4B and 8B Instruct models report accuracy gains of +22.6 and +18.8 points on VL-RewardBench relative to base models, outperforming standard no-rubric baselines.
Significance. If the reported gains are robustly attributable to the structured decomposition rather than RL shaping or benchmark artifacts, the work would advance generative multimodal reward modeling by mitigating lazy judging and language priors through dynamic, instance-specific rubrics. The joint optimization of planning and verification roles is a technically interesting contribution that could improve reliability in MLLM alignment pipelines.
major comments (2)
- [Experimental Results] The central empirical claim (accuracy gains of +22.6 / +18.8 on VL-RewardBench) is load-bearing for the paper's contribution, yet the manuscript provides no details on baseline implementations, statistical significance, controls for data leakage, or whether no-rubric baselines received equivalent RL training. Without these, it cannot be established that the plan-verify decomposition itself drives the improvement over baselines.
- [Method (Disagreement Planner and multi-role RL)] The Disagreement Planner is asserted to produce neutral, instance-specific checklists that isolate visual and factual discrepancies without introducing new biases. Because planning and verification are jointly optimized via multi-role RL on final preference judgments, no external constraint or independent validation ensures neutrality or completeness; checklists could be gamed to the reward signal. Ablation studies or human evaluations of checklist quality are required to support the claim that the decomposition, rather than RL optimization, produces the gains.
minor comments (1)
- [Method] The multi-role RL objective would benefit from an explicit equation or pseudocode clarifying how the planner and verifier roles are jointly optimized and how role-specific rewards are balanced.
Simulated Author's Rebuttal
We thank the referee for their detailed and insightful comments on our manuscript. We address each of the major concerns below and outline the revisions we will make to strengthen the paper.
read point-by-point responses
-
Referee: [Experimental Results] The central empirical claim (accuracy gains of +22.6 / +18.8 on VL-RewardBench) is load-bearing for the paper's contribution, yet the manuscript provides no details on baseline implementations, statistical significance, controls for data leakage, or whether no-rubric baselines received equivalent RL training. Without these, it cannot be established that the plan-verify decomposition itself drives the improvement over baselines.
Authors: We agree that additional experimental details are necessary to substantiate our claims. In the revised manuscript, we will expand the experimental section to include: detailed specifications of all baseline implementations; results of statistical significance tests (such as bootstrap resampling or McNemar's test) with reported p-values; explicit verification that the VL-RewardBench splits used do not overlap with training data to rule out leakage; and confirmation that the no-rubric baselines were trained with equivalent RL procedures, differing only in the absence of the planning role. These enhancements will allow readers to better attribute the observed gains to the structured decomposition. revision: yes
-
Referee: [Method (Disagreement Planner and multi-role RL)] The Disagreement Planner is asserted to produce neutral, instance-specific checklists that isolate visual and factual discrepancies without introducing new biases. Because planning and verification are jointly optimized via multi-role RL on final preference judgments, no external constraint or independent validation ensures neutrality or completeness; checklists could be gamed to the reward signal. Ablation studies or human evaluations of checklist quality are required to support the claim that the decomposition, rather than RL optimization, produces the gains.
Authors: We recognize the importance of validating the quality and neutrality of the generated checklists independently of the final reward signal. While the joint optimization encourages checklists that improve judgment accuracy, we will incorporate ablation experiments that isolate the planning component (e.g., comparing to a verifier-only RL setup) and conduct human evaluations on a subset of checklists to assess their neutrality, completeness, and lack of bias. These additions will provide direct evidence supporting the value of the plan-verify decomposition beyond RL effects alone. revision: yes
Circularity Check
No circularity: empirical RL method with independent benchmark validation
full rationale
The paper describes DeltaRubric as a two-step plan-and-execute process inside a single MLLM, formulated as multi-role RL and evaluated empirically on VL-RewardBench and other benchmarks. No equations, predictions, or first-principles derivations are presented that reduce to fitted inputs or self-definitions by construction. The central claims rest on reported accuracy gains (+22.6 / +18.8 points) over no-rubric baselines rather than on any self-referential theorem or ansatz. Self-citations, if present, are not load-bearing for the method's validity, and the RL optimization is described as standard joint training without circular reduction to the target judgments.
Axiom & Free-Parameter Ledger
free parameters (1)
- RL role weights and hyperparameters
axioms (1)
- domain assumption Self-generated checklists can isolate spatial and factual discrepancies without bias
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
DeltaRubric operates in two steps: acting first as a Disagreement Planner, the model generates a neutral, instance-specific verification checklist. Transitioning into a Checklist Verifier, it executes these self-generated checks
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We formulate DeltaRubric as a multi-role reinforcement learning problem, jointly optimizing planning and verification capabilities
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[3]
Mllm-as-a-judge: Assessing multimodal llm-as- a-judge with vision-language benchmark
Dongping Chen, Ruoxi Chen, Shilin Zhang, Yaochen Wang, Yinuo Liu, Huichi Zhou, Qihui Zhang, Yao Wan, Pan Zhou, and Lichao Sun. Mllm-as-a-judge: Assessing multimodal llm-as- a-judge with vision-language benchmark. InForty-first International Conference on Machine Learning, 2024
work page 2024
-
[4]
Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185–24198, 2024
work page 2024
-
[5]
Runpeng Dai, Linfeng Song, Haolin Liu, Zhenwen Liang, Dian Yu, Haitao Mi, Zhaopeng Tu, Rui Liu, Tong Zheng, Hongtu Zhu, et al. Cde: Curiosity-driven exploration for efficient reinforcement learning in large language models.arXiv preprint arXiv:2509.09675, 2025
-
[6]
NVLM: Open frontier-class multimodal LLMs
Wenliang Dai, Nayeon Lee, Boxin Wang, Zhuolin Yang, Zihan Liu, Jon Barker, Tuomas Rintamaki, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Nvlm: Open frontier-class multimodal llms.arXiv preprint arXiv:2409.11402, 2024
-
[7]
Deepseek-v4: Towards highly efficient million-token context intelligence, 2026
DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, 2026
work page 2026
-
[8]
Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models
Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et al. Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 91–104, 2025
work page 2025
-
[9]
arXiv preprint arXiv:2512.05111 , year=
Shengyuan Ding, Xinyu Fang, Ziyu Liu, Yuhang Zang, Yuhang Cao, Xiangyu Zhao, Haodong Duan, Xiaoyi Dong, Jianze Liang, Bin Wang, et al. Arm-thinker: Reinforcing multimodal gener- ative reward models with agentic tool use and visual reasoning.arXiv preprint arXiv:2512.05111, 2025
-
[10]
Chaoyou Fu, Haojia Lin, Xiong Wang, Yi-Fan Zhang, Yunhang Shen, Xiaoyu Liu, Haoyu Cao, Zuwei Long, Heting Gao, Ke Li, et al. Vita-1.5: Towards gpt-4o level real-time vision and speech interaction.arXiv preprint arXiv:2501.01957, 2025
-
[11]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[12]
Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains
Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Yunzhong He, Bing Liu, and Sean Hendryx. Rubrics as rewards: Reinforcement learning beyond verifiable domains.arXiv preprint arXiv:2507.17746, 2025
work page internal anchor Pith review arXiv 2025
-
[13]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
Qidong Huang, Xiaoyi Dong, Pan Zhang, Bin Wang, Conghui He, Jiaqi Wang, Dahua Lin, Weiming Zhang, and Nenghai Yu. Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13418–13427, 2024. 10
work page 2024
-
[15]
Reinforcement learning with rubric anchors
Zenan Huang, Yihong Zhuang, Guoshan Lu, Zeyu Qin, Haokai Xu, Tianyu Zhao, Ru Peng, Jiaqi Hu, Zhanming Shen, Xiaomeng Hu, et al. Reinforcement learning with rubric anchors. arXiv preprint arXiv:2508.12790, 2025
-
[16]
AutoRubric: Rubric-Based Generative Rewards for Faithful Multimodal Reasoning
Mengzhao Jia, Zhihan Zhang, Ignacio Cases, Zheyuan Liu, Meng Jiang, and Peng Qi. Autorubric-r1v: Rubric-based generative rewards for faithful multimodal reasoning.arXiv preprint arXiv:2510.14738, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
Prometheus: Inducing fine-grained evaluation capability in language models
Seungone Kim, Jamin Shin, Yejin Cho, Joel Jang, Shayne Longpre, Hwaran Lee, Sangdoo Yun, Seongjin Shin, Sungdong Kim, James Thorne, et al. Prometheus: Inducing fine-grained evaluation capability in language models. InThe Twelfth International Conference on Learning Representations, 2023
work page 2023
-
[18]
Reinforcement Learning from Human Feedback
Nathan Lambert. Reinforcement learning from human feedback.arXiv preprint arXiv:2504.12501, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[19]
Tulu 3: Pushing Frontiers in Open Language Model Post-Training
Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, et al. Tulu 3: Pushing frontiers in open language model post-training.arXiv preprint arXiv:2411.15124, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
Rewardbench: Evaluating reward models for language modeling
Nathan Lambert, Valentina Pyatkin, Jacob Morrison, Lester James Validad Miranda, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, et al. Rewardbench: Evaluating reward models for language modeling. InFindings of the Association for Computa- tional Linguistics: NAACL 2025, pages 1755–1797, 2025
work page 2025
-
[21]
Vl-rewardbench: a challenging benchmark for vision-language generative reward models
Lei Li, Yuancheng Wei, Zhihui Xie, Xuqing Yang, Yifan Song, Peiyi Wang, Chenxin An, Tianyu Liu, Sujian Li, Bill Yuchen Lin, et al. Vl-rewardbench: a challenging benchmark for vision-language generative reward models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24657–24668, 2025
work page 2025
-
[22]
Self-Rewarding Vision-Language Model via Reasoning Decomposition
Zongxia Li, Wenhao Yu, Chengsong Huang, Rui Liu, Zhenwen Liang, Fuxiao Liu, Jingxi Che, Dian Yu, Jordan Boyd-Graber, Haitao Mi, et al. Self-rewarding vision-language model via reasoning decomposition.arXiv preprint arXiv:2508.19652, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[23]
Rui Liu, Dian Yu, Lei Ke, Haolin Liu, Yujun Zhou, Zhenwen Liang, Haitao Mi, Pratap Tokekar, and Dong Yu. Stable and efficient single-rollout rl for multimodal reasoning.arXiv preprint arXiv:2512.18215, 2025
-
[24]
Rui Liu, Dian Yu, Tong Zheng, Runpeng Dai, Zongxia Li, Wenhao Yu, Zhenwen Liang, Linfeng Song, Haitao Mi, Pratap Tokekar, et al. V ogue: Guiding exploration with visual uncertainty improves multimodal reasoning.arXiv preprint arXiv:2510.01444, 2025
-
[25]
Tianci Liu, Ran Xu, Tony Yu, Ilgee Hong, Carl Yang, Tuo Zhao, and Haoyu Wang. Openrubrics: Towards scalable synthetic rubric generation for reward modeling and llm alignment.arXiv preprint arXiv:2510.07743, 2025
-
[26]
Decoupled weight decay regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019. URL https://openreview.net/forum? id=Bkg6RiCqY7
work page 2019
-
[27]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022
work page 2022
-
[28]
Direct preference optimization: Your language model is secretly a reward model
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023
work page 2023
-
[29]
Self-critiquing models for assisting human evaluators
William Saunders, Catherine Yeh, Jeff Wu, Steven Bills, Long Ouyang, Jonathan Ward, and Jan Leike. Self-critiquing models for assisting human evaluators.arXiv preprint arXiv:2206.05802, 2022. 11
-
[30]
Rulin Shao, Akari Asai, Shannon Zejiang Shen, Hamish Ivison, Varsha Kishore, Jingming Zhuo, Xinran Zhao, Molly Park, Samuel G Finlayson, David Sontag, et al. Dr tulu: Reinforcement learning with evolving rubrics for deep research.arXiv preprint arXiv:2511.19399, 2025
-
[31]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[32]
Leheng Sheng, Wenchang Ma, Ruixin Hong, Xiang Wang, An Zhang, and Tat-Seng Chua. Rein- forcing chain-of-thought reasoning with self-evolving rubrics.arXiv preprint arXiv:2602.10885, 2026
-
[33]
A long way to go: Investigat- ing length correlations in RLHF.arXiv preprint arXiv:2310.03716,
Prasann Singhal, Tanya Goyal, Jiacheng Xu, and Greg Durrett. A long way to go: Investigating length correlations in rlhf.arXiv preprint arXiv:2310.03716, 2023
-
[34]
Aligning large multimodal models with factually augmented rlhf
Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liangyan Gui, Yu-Xiong Wang, Yiming Yang, et al. Aligning large multimodal models with factually augmented rlhf. InFindings of the Association for Computational Linguistics: ACL 2024, pages 13088–13110, 2024
work page 2024
-
[35]
ReFT: Reasoning with reinforced fine-tuning
Luong Trung, Xinbo Zhang, Zhanming Jie, Peng Sun, Xiaoran Jin, and Hang Li. ReFT: Reasoning with reinforced fine-tuning. In Lun-Wei Ku, Andre Martins, and Vivek Sriku- mar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers), pages 7601–7614, Bangkok, Thailand, August
-
[36]
doi: 10.18653/v1/2024.acl-long.410
Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.410. URL https://aclanthology.org/2024.acl-long.410/
-
[37]
In: NeurIPS (2025),https://arxiv.org/abs/2507.18624
Vijay Viswanathan, Yanchao Sun, Shuang Ma, Xiang Kong, Meng Cao, Graham Neubig, and Tongshuang Wu. Checklists are better than reward models for aligning language models.arXiv preprint arXiv:2507.18624, 2025
-
[38]
Chenglong Wang, Yifu Huo, Yang Gan, Qiaozhi He, Qi Meng, Bei Li, Yan Wang, Junfu Liu, Tianhua Zhou, Jingbo Zhu, et al. Msrl: Scaling generative multimodal reward modeling via multi-stage reinforcement learning.arXiv preprint arXiv:2603.25108, 2026
-
[39]
Yibin Wang, Zhimin Li, Yuhang Zang, Chunyu Wang, Qinglin Lu, Cheng Jin, and Jiaqi Wang. Unified multimodal chain-of-thought reward model through reinforcement fine-tuning.arXiv preprint arXiv:2505.03318, 2025
-
[40]
Chain-of-thought prompting elicits reasoning in large language models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022
work page 2022
-
[41]
Llava-critic: Learning to evaluate multimodal models
Tianyi Xiong, Xiyao Wang, Dong Guo, Qinghao Ye, Haoqi Fan, Quanquan Gu, Heng Huang, and Chunyuan Li. Llava-critic: Learning to evaluate multimodal models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13618–13628, 2025
work page 2025
-
[42]
arXiv preprint arXiv:2602.01511 , year=
Ran Xu, Tianci Liu, Zihan Dong, Tony Yu, Ilgee Hong, Carl Yang, Linjun Zhang, Tao Zhao, and Haoyu Wang. Alternating reinforcement learning for rubric-based reward modeling in non-verifiable llm post-training.arXiv preprint arXiv:2602.01511, 2026
-
[43]
Michihiro Yasunaga, Luke Zettlemoyer, and Marjan Ghazvininejad. Multimodal reward- bench: Holistic evaluation of reward models for vision language models.arXiv preprint arXiv:2502.14191, 2025
-
[44]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[45]
Tianyu Yu, Yuan Yao, Haoye Zhang, Taiwen He, Yifeng Han, Ganqu Cui, Jinyi Hu, Zhiyuan Liu, Hai-Tao Zheng, Maosong Sun, et al. Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13807–13816, 2024. 12
work page 2024
-
[46]
Rlaif-v: Open-source ai feedback leads to super gpt-4v trustworthiness
Tianyu Yu, Haoye Zhang, Qiming Li, Qixin Xu, Yuan Yao, Da Chen, Xiaoman Lu, Ganqu Cui, Yunkai Dang, Taiwen He, et al. Rlaif-v: Open-source ai feedback leads to super gpt-4v trustworthiness. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19985–19995, 2025
work page 2025
-
[47]
Jiawei Zhang, Tianyu Pang, Chao Du, Yi Ren, Bo Li, and Min Lin. Benchmarking large multimodal models against common corruptions.arXiv preprint arXiv:2401.11943, 2024
-
[48]
Yi-Fan Zhang, Xingyu Lu, Xiao Hu, Chaoyou Fu, Bin Wen, Tianke Zhang, Changyi Liu, Kaiyu Jiang, Kaibing Chen, Kaiyu Tang, et al. R1-reward: Training multimodal reward model through stable reinforcement learning.arXiv preprint arXiv:2505.02835, 2025
-
[49]
Mm-rlhf: The next step forward in multimodal llm alignment,
Yi-Fan Zhang, Tao Yu, Haochen Tian, Chaoyou Fu, Peiyan Li, Jianshu Zeng, Wulin Xie, Yang Shi, Huanyu Zhang, Junkang Wu, et al. Mm-rlhf: The next step forward in multimodal llm alignment.arXiv preprint arXiv:2502.10391, 2025
-
[50]
Basereward: A strong baseline for multimodal reward model
YiFan Zhang, Haihua Yang, Huanyu Zhang, Yang Shi, Zezhou Chen, Haochen Tian, Chaoyou Fu, Kai WU, Bo Cui, Xu Wang, Jianfei Pan, Haotian Wang, Zhang Zhang, and Liang Wang. Basereward: A strong baseline for multimodal reward model. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum? id=EuN5iszF0a
work page 2026
-
[51]
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023
work page 2023
-
[52]
Parallel-r1: Towards parallel thinking via reinforcement learning
Tong Zheng, Hongming Zhang, Wenhao Yu, Xiaoyang Wang, Runpeng Dai, Rui Liu, Huiwen Bao, Chengsong Huang, Heng Huang, and Dong Yu. Parallel-r1: Towards parallel thinking via reinforcement learning.arXiv preprint arXiv:2509.07980, 2025
-
[53]
Easyr1: An efficient, scalable, multi-modality rl training framework
Yaowei Zheng, Junting Lu, Shenzhi Wang, Zhangchi Feng, Dongdong Kuang, Yuwen Xiong, and Richong Zhang. Easyr1: An efficient, scalable, multi-modality rl training framework. https://github.com/hiyouga/EasyR1, 2025
work page 2025
-
[54]
Jialun Zhong, Wei Shen, Yanzeng Li, Songyang Gao, Hua Lu, Yicheng Chen, Yang Zhang, Wei Zhou, Jinjie Gu, and Lei Zou. A comprehensive survey of reward models: Taxonomy, applications, challenges, and future.arXiv preprint arXiv:2504.12328, 2025
-
[55]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 13 A Appendix A.1 Implementation Details We perform direct RL training on the Qwen3-VL-4B and 8B In...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[56]
Directly answers the question using the information relevant to the image
-
[57]
Makes factual claims that are consistent with the image and avoids unsupported details
-
[58]
Correctly identifies important visual information when it matters for the question
-
[59]
Uses sound reasoning and logical inference where needed
-
[60]
DeltaRubric Evaluation Prompt You are a fair judge
Gives a clear and complete answer. DeltaRubric Evaluation Prompt You are a fair judge. Decide which response better answers the question below based on the image. Use the verification checklist as a sequence of checks to execute, not as evidence. If a checklist item is irrelevant, too vague, or contradicted by the image/question, ignore that item. Questio...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.