arxiv: 2605.09269 · v1 · submitted 2026-05-10 · 💻 cs.CL · cs.CV

Recognition: 2 theorem links

· Lean Theorem

DeltaRubric: Generative Multimodal Reward Modeling via Joint Planning and Verification

Rui Liu , Dian Yu , Zhenwen Liang , Yucheng Shi , Tong Zheng , Runpeng Dai , Haitao Mi , Pratap Tokekar

show 1 more author

Leoweiliang

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:38 UTC · model grok-4.3

classification 💻 cs.CL cs.CV

keywords multimodal reward modelinggenerative evaluationchecklist verificationreinforcement learningvision language modelspreference alignmentrubric generation

0 comments

The pith

Decomposing multimodal preference evaluation into instance-specific checklist generation and verification produces more reliable reward models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that single-step evaluators for multimodal models often rely on language priors instead of actual visual checks, leading to biased rewards. It proposes DeltaRubric, where an MLLM first plans a neutral checklist of potential disagreements for a given image and response pair, then verifies each point against the image to decide the preference. This structured approach is trained with multi-role reinforcement learning that optimizes both planning and verification together. If successful, it would mean better alignment signals for multimodal large language models without needing external rubrics or multiple models.

Core claim

DeltaRubric reformulates multimodal preference evaluation as a plan-and-execute process within a single MLLM. The model first acts as a Disagreement Planner to generate a neutral, instance-specific verification checklist. It then transitions to a Checklist Verifier that executes these checks against the image and question to produce the final judgment. This is formulated as a multi-role reinforcement learning problem that jointly optimizes the planning and verification capabilities, resulting in improved accuracy on benchmarks like VL-RewardBench.

What carries the argument

The two-role mechanism of Disagreement Planner followed by Checklist Verifier, jointly trained via multi-role reinforcement learning to produce grounded multimodal judgments.

Load-bearing premise

Self-generated checklists stay neutral and accurately capture the key visual and factual differences without adding new biases or needing corrections.

What would settle it

Testing DeltaRubric on a benchmark with deliberately subtle visual discrepancies that checklists might miss would show whether accuracy gains hold or if performance reverts to baseline levels.

Figures

Figures reproduced from arXiv: 2605.09269 by Dian Yu, Haitao Mi, Leoweiliang, Pratap Tokekar, Rui Liu, Runpeng Dai, Tong Zheng, Yucheng Shi, Zhenwen Liang.

**Figure 1.** Figure 1: Overview of DeltaRubric. Given an input tuple (I, q, yA, yB), a single shared MLLM operates in two sequential roles. As a Disagreement Planner, the model analyzes the candidate responses to identify critical factual divergences and generates a neutral, instance-specific verification checklist. Example generated checklists are shown in [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Planner and Verifier training dynamics. (a) and (b) compare the Verifier’s training and validation accuracy (evaluated every 5 steps) against a no-rubric baseline. While both approaches improve, DeltaRubric overall achieves a higher mean accuracy in juding final responses. (c) tracks the Planner probe accuracy, an intermediate proxy measuring the fraction of sampled checklists that successfully guide a lig… view at source ↗

**Figure 3.** Figure 3: Qualitative comparison of evaluation methods. While the standard no-rubric baseline misses the hallucinated "cars" in Response A, DeltaRubric generates a targeted disagreement checklist. This explicitly enforces visual verification, allowing the model to successfully catch the hallucination and correctly select Response B. provides baseline structural guidance and improves overall accuracy (78.8), removing… view at source ↗

**Figure 4.** Figure 4: Qualitative comparison of evaluation methods. The standard no-rubric baseline fails to verify fine-grained visual details and incorrectly validates the hallucinated "white shoes" in Response B. In contrast, DeltaRubric generates a targeted checklist that explicitly isolates the conflicting shoe color. By enforcing active visual verification, DeltaRubric successfully catches the hallucination and correctly … view at source ↗

**Figure 5.** Figure 5: Qualitative comparison of evaluation methods. The standard no-rubric baseline exhibits severe logical inconsistency: it correctly notes the absence of a tree branch but unaccountably still chooses Response A. In contrast, DeltaRubric generates a targeted checklist that systematically verifies the visual evidence (e.g., granular food in an open palm, not a branch), strictly enforcing logical consistency and… view at source ↗

**Figure 6.** Figure 6: Qualitative comparison of evaluation methods. The standard no-rubric baseline struggles with fine-grained visual attribute binding, confusing the white tissues with the actual color of the box to incorrectly select Response B. In contrast, DeltaRubric generates a targeted checklist that explicitly isolates and verifies the "green exterior," preventing this attribute confusion and correctly selecting Respon… view at source ↗

read the original abstract

Aligning Multimodal Large Language Models (MLLMs) requires reliable reward models, yet existing single-step evaluators can suffer from lazy judging, exploiting language priors over fine-grained visual verification. While rubric-based evaluation mitigates these biases in text-only settings, extending it to multimodal tasks is bottlenecked by the complexity of visual reasoning. The critical differences between responses often depend on instance-specific visual details. Robust evaluation requires dynamically synthesizing rubrics that isolate spatial and factual discrepancies. To address this, we introduce $\textbf{DeltaRubric}$, an approach that reformulates multimodal preference evaluation as a plan-and-execute process within a single MLLM. DeltaRubric operates in two steps: acting first as a $\textit{Disagreement Planner}$, the model generates a neutral, instance-specific verification checklist. Transitioning into a $\textit{Checklist Verifier}$, it executes these self-generated checks against the image and question to produce the final grounded judgment. We formulate DeltaRubric as a multi-role reinforcement learning problem, jointly optimizing planning and verification capabilities. Validated on Qwen3-VL 4B and 8B Instruct models, DeltaRubric achieves solid empirical gains. For instance, On VL-RewardBench, it improves base model overall accuracy by $\textbf{+22.6}$ (4B) and $\textbf{+18.8}$ (8B) points, largely outperforming standard no-rubric baselines. The results demonstrate that decomposing evaluation into structured, verifiable steps leads to more reliable and generalizable multimodal reward modeling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DeltaRubric gets large accuracy gains on VL-RewardBench by turning evaluation into a self-generated plan-then-verify loop inside one MLLM, but the gains may owe more to the joint RL than to the checklist structure itself.

read the letter

The key takeaway is that DeltaRubric delivers large accuracy gains on VL-RewardBench by reformulating evaluation as a generative plan-then-verify process inside a single MLLM, but the contribution of the planning step over the RL optimization is not yet clear. The new part is treating rubric creation as a generative step where the model produces a neutral, instance-specific checklist of disagreements before verifying it against the image. They set this up as multi-role RL so the model learns both planning and verification together. This moves beyond the single-step evaluators and text-only rubrics mentioned in the abstract. The paper shows this works in practice on Qwen3-VL models, with the 4B version improving overall accuracy by 22.6 points and the 8B by 18.8 points compared to the base, and it outperforms no-rubric baselines. What it does well is highlight how breaking down the judgment into structured steps can reduce reliance on language priors in visual tasks. The joint optimization seems like a practical way to make the rubrics dynamic and tailored. The main soft spot is the lack of independent checks on the checklists. Because planning and verification are optimized together on the final preference signal, the generated checklists could be biased toward whatever helps the model score higher rather than staying neutral and complete. The stress-test note is on point here; without separate validation of checklist quality or ablations that isolate the planner, the gains could stem from RL shaping instead of the decomposition. The abstract also leaves out details on experimental controls, baseline details, and statistical significance, which makes it hard to assess reproducibility right now. This kind of work is useful for people building reward models for vision-language systems. A reader interested in multimodal alignment would find the empirical results and the plan-verify idea worth looking at. It deserves a serious referee to dig into the method and see if the claims hold with full details. I would bring this to a reading group to talk through the RL roles and whether the approach generalizes beyond the benchmark.

Referee Report

2 major / 1 minor

Summary. DeltaRubric reformulates multimodal preference evaluation as a plan-and-execute process within a single MLLM. The model first acts as a Disagreement Planner to generate a neutral, instance-specific verification checklist and then as a Checklist Verifier to execute those checks against the image and question, producing a grounded judgment. The approach is cast as a multi-role RL problem that jointly optimizes planning and verification capabilities. Experiments on Qwen3-VL 4B and 8B Instruct models report accuracy gains of +22.6 and +18.8 points on VL-RewardBench relative to base models, outperforming standard no-rubric baselines.

Significance. If the reported gains are robustly attributable to the structured decomposition rather than RL shaping or benchmark artifacts, the work would advance generative multimodal reward modeling by mitigating lazy judging and language priors through dynamic, instance-specific rubrics. The joint optimization of planning and verification roles is a technically interesting contribution that could improve reliability in MLLM alignment pipelines.

major comments (2)

[Experimental Results] The central empirical claim (accuracy gains of +22.6 / +18.8 on VL-RewardBench) is load-bearing for the paper's contribution, yet the manuscript provides no details on baseline implementations, statistical significance, controls for data leakage, or whether no-rubric baselines received equivalent RL training. Without these, it cannot be established that the plan-verify decomposition itself drives the improvement over baselines.
[Method (Disagreement Planner and multi-role RL)] The Disagreement Planner is asserted to produce neutral, instance-specific checklists that isolate visual and factual discrepancies without introducing new biases. Because planning and verification are jointly optimized via multi-role RL on final preference judgments, no external constraint or independent validation ensures neutrality or completeness; checklists could be gamed to the reward signal. Ablation studies or human evaluations of checklist quality are required to support the claim that the decomposition, rather than RL optimization, produces the gains.

minor comments (1)

[Method] The multi-role RL objective would benefit from an explicit equation or pseudocode clarifying how the planner and verifier roles are jointly optimized and how role-specific rewards are balanced.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and insightful comments on our manuscript. We address each of the major concerns below and outline the revisions we will make to strengthen the paper.

read point-by-point responses

Referee: [Experimental Results] The central empirical claim (accuracy gains of +22.6 / +18.8 on VL-RewardBench) is load-bearing for the paper's contribution, yet the manuscript provides no details on baseline implementations, statistical significance, controls for data leakage, or whether no-rubric baselines received equivalent RL training. Without these, it cannot be established that the plan-verify decomposition itself drives the improvement over baselines.

Authors: We agree that additional experimental details are necessary to substantiate our claims. In the revised manuscript, we will expand the experimental section to include: detailed specifications of all baseline implementations; results of statistical significance tests (such as bootstrap resampling or McNemar's test) with reported p-values; explicit verification that the VL-RewardBench splits used do not overlap with training data to rule out leakage; and confirmation that the no-rubric baselines were trained with equivalent RL procedures, differing only in the absence of the planning role. These enhancements will allow readers to better attribute the observed gains to the structured decomposition. revision: yes
Referee: [Method (Disagreement Planner and multi-role RL)] The Disagreement Planner is asserted to produce neutral, instance-specific checklists that isolate visual and factual discrepancies without introducing new biases. Because planning and verification are jointly optimized via multi-role RL on final preference judgments, no external constraint or independent validation ensures neutrality or completeness; checklists could be gamed to the reward signal. Ablation studies or human evaluations of checklist quality are required to support the claim that the decomposition, rather than RL optimization, produces the gains.

Authors: We recognize the importance of validating the quality and neutrality of the generated checklists independently of the final reward signal. While the joint optimization encourages checklists that improve judgment accuracy, we will incorporate ablation experiments that isolate the planning component (e.g., comparing to a verifier-only RL setup) and conduct human evaluations on a subset of checklists to assess their neutrality, completeness, and lack of bias. These additions will provide direct evidence supporting the value of the plan-verify decomposition beyond RL effects alone. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical RL method with independent benchmark validation

full rationale

The paper describes DeltaRubric as a two-step plan-and-execute process inside a single MLLM, formulated as multi-role RL and evaluated empirically on VL-RewardBench and other benchmarks. No equations, predictions, or first-principles derivations are presented that reduce to fitted inputs or self-definitions by construction. The central claims rest on reported accuracy gains (+22.6 / +18.8 points) over no-rubric baselines rather than on any self-referential theorem or ansatz. Self-citations, if present, are not load-bearing for the method's validity, and the RL optimization is described as standard joint training without circular reduction to the target judgments.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that MLLMs can reliably generate neutral, instance-specific checklists and that RL can jointly optimize the two roles; limited information from abstract prevents full enumeration of parameters or entities.

free parameters (1)

RL role weights and hyperparameters
Multi-role reinforcement learning for planning and verification likely requires tuned weights and learning rates fitted to achieve the reported gains.

axioms (1)

domain assumption Self-generated checklists can isolate spatial and factual discrepancies without bias
Invoked in the Disagreement Planner step to enable grounded verification.

pith-pipeline@v0.9.0 · 5607 in / 1274 out tokens · 31911 ms · 2026-05-12T04:38:14.580532+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

DeltaRubric operates in two steps: acting first as a Disagreement Planner, the model generates a neutral, instance-specific verification checklist. Transitioning into a Checklist Verifier, it executes these self-generated checks
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We formulate DeltaRubric as a multi-role reinforcement learning problem, jointly optimizing planning and verification capabilities

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · 12 internal anchors

[1]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[3]

Mllm-as-a-judge: Assessing multimodal llm-as- a-judge with vision-language benchmark

Dongping Chen, Ruoxi Chen, Shilin Zhang, Yaochen Wang, Yinuo Liu, Huichi Zhou, Qihui Zhang, Yao Wan, Pan Zhou, and Lichao Sun. Mllm-as-a-judge: Assessing multimodal llm-as- a-judge with vision-language benchmark. InForty-first International Conference on Machine Learning, 2024

work page 2024
[4]

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185–24198, 2024

work page 2024
[5]

Cde: Curiosity-driven exploration for efficient reinforcement learning in large language models.arXiv preprint arXiv:2509.09675,

Runpeng Dai, Linfeng Song, Haolin Liu, Zhenwen Liang, Dian Yu, Haitao Mi, Zhaopeng Tu, Rui Liu, Tong Zheng, Hongtu Zhu, et al. Cde: Curiosity-driven exploration for efficient reinforcement learning in large language models.arXiv preprint arXiv:2509.09675, 2025

work page arXiv 2025
[6]

NVLM: Open frontier-class multimodal LLMs

Wenliang Dai, Nayeon Lee, Boxin Wang, Zhuolin Yang, Zihan Liu, Jon Barker, Tuomas Rintamaki, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Nvlm: Open frontier-class multimodal llms.arXiv preprint arXiv:2409.11402, 2024

work page arXiv 2024
[7]

Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

work page 2026
[8]

Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models

Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et al. Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 91–104, 2025

work page 2025
[9]

arXiv preprint arXiv:2512.05111 , year=

Shengyuan Ding, Xinyu Fang, Ziyu Liu, Yuhang Zang, Yuhang Cao, Xiangyu Zhao, Haodong Duan, Xiaoyi Dong, Jianze Liang, Bin Wang, et al. Arm-thinker: Reinforcing multimodal gener- ative reward models with agentic tool use and visual reasoning.arXiv preprint arXiv:2512.05111, 2025

work page arXiv 2025
[10]

Vita-1.5: Towards gpt-4o level real-time vision and speech interaction.arXiv preprint arXiv:2501.01957, 2025

Chaoyou Fu, Haojia Lin, Xiong Wang, Yi-Fan Zhang, Yunhang Shen, Xiaoyu Liu, Haoyu Cao, Zuwei Long, Heting Gao, Ke Li, et al. Vita-1.5: Towards gpt-4o level real-time vision and speech interaction.arXiv preprint arXiv:2501.01957, 2025

work page arXiv 2025
[11]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Yunzhong He, Bing Liu, and Sean Hendryx. Rubrics as rewards: Reinforcement learning beyond verifiable domains.arXiv preprint arXiv:2507.17746, 2025

work page internal anchor Pith review arXiv 2025
[13]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation

Qidong Huang, Xiaoyi Dong, Pan Zhang, Bin Wang, Conghui He, Jiaqi Wang, Dahua Lin, Weiming Zhang, and Nenghai Yu. Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13418–13427, 2024. 10

work page 2024
[15]

Reinforcement learning with rubric anchors

Zenan Huang, Yihong Zhuang, Guoshan Lu, Zeyu Qin, Haokai Xu, Tianyu Zhao, Ru Peng, Jiaqi Hu, Zhanming Shen, Xiaomeng Hu, et al. Reinforcement learning with rubric anchors. arXiv preprint arXiv:2508.12790, 2025

work page arXiv 2025
[16]

AutoRubric: Rubric-Based Generative Rewards for Faithful Multimodal Reasoning

Mengzhao Jia, Zhihan Zhang, Ignacio Cases, Zheyuan Liu, Meng Jiang, and Peng Qi. Autorubric-r1v: Rubric-based generative rewards for faithful multimodal reasoning.arXiv preprint arXiv:2510.14738, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Prometheus: Inducing fine-grained evaluation capability in language models

Seungone Kim, Jamin Shin, Yejin Cho, Joel Jang, Shayne Longpre, Hwaran Lee, Sangdoo Yun, Seongjin Shin, Sungdong Kim, James Thorne, et al. Prometheus: Inducing fine-grained evaluation capability in language models. InThe Twelfth International Conference on Learning Representations, 2023

work page 2023
[18]

Reinforcement Learning from Human Feedback

Nathan Lambert. Reinforcement learning from human feedback.arXiv preprint arXiv:2504.12501, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Tulu 3: Pushing Frontiers in Open Language Model Post-Training

Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, et al. Tulu 3: Pushing frontiers in open language model post-training.arXiv preprint arXiv:2411.15124, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Rewardbench: Evaluating reward models for language modeling

Nathan Lambert, Valentina Pyatkin, Jacob Morrison, Lester James Validad Miranda, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, et al. Rewardbench: Evaluating reward models for language modeling. InFindings of the Association for Computa- tional Linguistics: NAACL 2025, pages 1755–1797, 2025

work page 2025
[21]

Vl-rewardbench: a challenging benchmark for vision-language generative reward models

Lei Li, Yuancheng Wei, Zhihui Xie, Xuqing Yang, Yifan Song, Peiyi Wang, Chenxin An, Tianyu Liu, Sujian Li, Bill Yuchen Lin, et al. Vl-rewardbench: a challenging benchmark for vision-language generative reward models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24657–24668, 2025

work page 2025
[22]

Self-Rewarding Vision-Language Model via Reasoning Decomposition

Zongxia Li, Wenhao Yu, Chengsong Huang, Rui Liu, Zhenwen Liang, Fuxiao Liu, Jingxi Che, Dian Yu, Jordan Boyd-Graber, Haitao Mi, et al. Self-rewarding vision-language model via reasoning decomposition.arXiv preprint arXiv:2508.19652, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Stable and efficient single-rollout rl for multimodal reasoning.arXiv preprint arXiv:2512.18215, 2025

Rui Liu, Dian Yu, Lei Ke, Haolin Liu, Yujun Zhou, Zhenwen Liang, Haitao Mi, Pratap Tokekar, and Dong Yu. Stable and efficient single-rollout rl for multimodal reasoning.arXiv preprint arXiv:2512.18215, 2025

work page arXiv 2025
[24]

V ogue: Guiding exploration with visual uncertainty improves multimodal reasoning.arXiv preprint arXiv:2510.01444, 2025

Rui Liu, Dian Yu, Tong Zheng, Runpeng Dai, Zongxia Li, Wenhao Yu, Zhenwen Liang, Linfeng Song, Haitao Mi, Pratap Tokekar, et al. V ogue: Guiding exploration with visual uncertainty improves multimodal reasoning.arXiv preprint arXiv:2510.01444, 2025

work page arXiv 2025
[25]

Openrubrics: Towards scalable synthetic rubric generation for reward modeling and llm alignment.arXiv preprint arXiv:2510.07743, 2025

Tianci Liu, Ran Xu, Tony Yu, Ilgee Hong, Carl Yang, Tuo Zhao, and Haoyu Wang. Openrubrics: Towards scalable synthetic rubric generation for reward modeling and llm alignment.arXiv preprint arXiv:2510.07743, 2025

work page arXiv 2025
[26]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019. URL https://openreview.net/forum? id=Bkg6RiCqY7

work page 2019
[27]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

work page 2022
[28]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023

work page 2023
[29]

Self-critiquing models for assisting human evaluators

William Saunders, Catherine Yeh, Jeff Wu, Steven Bills, Long Ouyang, Jonathan Ward, and Jan Leike. Self-critiquing models for assisting human evaluators.arXiv preprint arXiv:2206.05802, 2022. 11

work page arXiv 2022
[30]

Dr tulu: Reinforcement learning with evolving rubrics for deep research.arXiv preprint arXiv:2511.19399, 2025

Rulin Shao, Akari Asai, Shannon Zejiang Shen, Hamish Ivison, Varsha Kishore, Jingming Zhuo, Xinran Zhao, Molly Park, Samuel G Finlayson, David Sontag, et al. Dr tulu: Reinforcement learning with evolving rubrics for deep research.arXiv preprint arXiv:2511.19399, 2025

work page arXiv 2025
[31]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

Rein- forcing chain-of-thought reasoning with self-evolving rubrics.arXiv preprint arXiv:2602.10885, 2026

Leheng Sheng, Wenchang Ma, Ruixin Hong, Xiang Wang, An Zhang, and Tat-Seng Chua. Rein- forcing chain-of-thought reasoning with self-evolving rubrics.arXiv preprint arXiv:2602.10885, 2026

work page arXiv 2026
[33]

A long way to go: Investigat- ing length correlations in RLHF.arXiv preprint arXiv:2310.03716,

Prasann Singhal, Tanya Goyal, Jiacheng Xu, and Greg Durrett. A long way to go: Investigating length correlations in rlhf.arXiv preprint arXiv:2310.03716, 2023

work page arXiv 2023
[34]

Aligning large multimodal models with factually augmented rlhf

Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liangyan Gui, Yu-Xiong Wang, Yiming Yang, et al. Aligning large multimodal models with factually augmented rlhf. InFindings of the Association for Computational Linguistics: ACL 2024, pages 13088–13110, 2024

work page 2024
[35]

ReFT: Reasoning with reinforced fine-tuning

Luong Trung, Xinbo Zhang, Zhanming Jie, Peng Sun, Xiaoran Jin, and Hang Li. ReFT: Reasoning with reinforced fine-tuning. In Lun-Wei Ku, Andre Martins, and Vivek Sriku- mar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers), pages 7601–7614, Bangkok, Thailand, August

work page
[36]

doi: 10.18653/v1/2024.acl-long.410

Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.410. URL https://aclanthology.org/2024.acl-long.410/

work page doi:10.18653/v1/2024.acl-long.410 2024
[37]

In: NeurIPS (2025),https://arxiv.org/abs/2507.18624

Vijay Viswanathan, Yanchao Sun, Shuang Ma, Xiang Kong, Meng Cao, Graham Neubig, and Tongshuang Wu. Checklists are better than reward models for aligning language models.arXiv preprint arXiv:2507.18624, 2025

work page arXiv 2025
[38]

Msrl: Scaling generative multimodal reward modeling via multi-stage reinforcement learning.arXiv preprint arXiv:2603.25108, 2026

Chenglong Wang, Yifu Huo, Yang Gan, Qiaozhi He, Qi Meng, Bei Li, Yan Wang, Junfu Liu, Tianhua Zhou, Jingbo Zhu, et al. Msrl: Scaling generative multimodal reward modeling via multi-stage reinforcement learning.arXiv preprint arXiv:2603.25108, 2026

work page arXiv 2026
[39]

Unified multimodal chain-of-thought reward model through reinforcement fine-tuning.arXiv preprint arXiv:2505.03318, 2025

Yibin Wang, Zhimin Li, Yuhang Zang, Chunyu Wang, Qinglin Lu, Cheng Jin, and Jiaqi Wang. Unified multimodal chain-of-thought reward model through reinforcement fine-tuning.arXiv preprint arXiv:2505.03318, 2025

work page arXiv 2025
[40]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

work page 2022
[41]

Llava-critic: Learning to evaluate multimodal models

Tianyi Xiong, Xiyao Wang, Dong Guo, Qinghao Ye, Haoqi Fan, Quanquan Gu, Heng Huang, and Chunyuan Li. Llava-critic: Learning to evaluate multimodal models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13618–13628, 2025

work page 2025
[42]

arXiv preprint arXiv:2602.01511 , year=

Ran Xu, Tianci Liu, Zihan Dong, Tony Yu, Ilgee Hong, Carl Yang, Linjun Zhang, Tao Zhao, and Haoyu Wang. Alternating reinforcement learning for rubric-based reward modeling in non-verifiable llm post-training.arXiv preprint arXiv:2602.01511, 2026

work page arXiv 2026
[43]

Multimodal reward- bench: Holistic evaluation of reward models for vision language models.arXiv preprint arXiv:2502.14191, 2025

Michihiro Yasunaga, Luke Zettlemoyer, and Marjan Ghazvininejad. Multimodal reward- bench: Holistic evaluation of reward models for vision language models.arXiv preprint arXiv:2502.14191, 2025

work page arXiv 2025
[44]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback

Tianyu Yu, Yuan Yao, Haoye Zhang, Taiwen He, Yifeng Han, Ganqu Cui, Jinyi Hu, Zhiyuan Liu, Hai-Tao Zheng, Maosong Sun, et al. Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13807–13816, 2024. 12

work page 2024
[46]

Rlaif-v: Open-source ai feedback leads to super gpt-4v trustworthiness

Tianyu Yu, Haoye Zhang, Qiming Li, Qixin Xu, Yuan Yao, Da Chen, Xiaoman Lu, Ganqu Cui, Yunkai Dang, Taiwen He, et al. Rlaif-v: Open-source ai feedback leads to super gpt-4v trustworthiness. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19985–19995, 2025

work page 2025
[47]

Benchmarking large multimodal models against common corruptions.arXiv preprint arXiv:2401.11943, 2024

Jiawei Zhang, Tianyu Pang, Chao Du, Yi Ren, Bo Li, and Min Lin. Benchmarking large multimodal models against common corruptions.arXiv preprint arXiv:2401.11943, 2024

work page arXiv 2024
[48]

R1-reward: Training multimodal reward model through stable reinforcement learning.CoRR, abs/2505.02835, 2025

Yi-Fan Zhang, Xingyu Lu, Xiao Hu, Chaoyou Fu, Bin Wen, Tianke Zhang, Changyi Liu, Kaiyu Jiang, Kaibing Chen, Kaiyu Tang, et al. R1-reward: Training multimodal reward model through stable reinforcement learning.arXiv preprint arXiv:2505.02835, 2025

work page arXiv 2025
[49]

Mm-rlhf: The next step forward in multimodal llm alignment,

Yi-Fan Zhang, Tao Yu, Haochen Tian, Chaoyou Fu, Peiyan Li, Jianshu Zeng, Wulin Xie, Yang Shi, Huanyu Zhang, Junkang Wu, et al. Mm-rlhf: The next step forward in multimodal llm alignment.arXiv preprint arXiv:2502.10391, 2025

work page arXiv 2025
[50]

Basereward: A strong baseline for multimodal reward model

YiFan Zhang, Haihua Yang, Huanyu Zhang, Yang Shi, Zezhou Chen, Haochen Tian, Chaoyou Fu, Kai WU, Bo Cui, Xu Wang, Jianfei Pan, Haotian Wang, Zhang Zhang, and Liang Wang. Basereward: A strong baseline for multimodal reward model. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum? id=EuN5iszF0a

work page 2026
[51]

Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

work page 2023
[52]

Parallel-r1: Towards parallel thinking via reinforcement learning

Tong Zheng, Hongming Zhang, Wenhao Yu, Xiaoyang Wang, Runpeng Dai, Rui Liu, Huiwen Bao, Chengsong Huang, Heng Huang, and Dong Yu. Parallel-r1: Towards parallel thinking via reinforcement learning.arXiv preprint arXiv:2509.07980, 2025

work page arXiv 2025
[53]

Easyr1: An efficient, scalable, multi-modality rl training framework

Yaowei Zheng, Junting Lu, Shenzhi Wang, Zhangchi Feng, Dongdong Kuang, Yuwen Xiong, and Richong Zhang. Easyr1: An efficient, scalable, multi-modality rl training framework. https://github.com/hiyouga/EasyR1, 2025

work page 2025
[54]

A comprehensive survey of reward models: Taxonomy, applications, challenges, and future.arXiv preprint arXiv:2504.12328, 2025

Jialun Zhong, Wei Shen, Yanzeng Li, Songyang Gao, Hua Lu, Yicheng Chen, Yang Zhang, Wei Zhou, Jinjie Gu, and Lei Zou. A comprehensive survey of reward models: Taxonomy, applications, challenges, and future.arXiv preprint arXiv:2504.12328, 2025

work page arXiv 2025
[55]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 13 A Appendix A.1 Implementation Details We perform direct RL training on the Qwen3-VL-4B and 8B In...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[56]

Directly answers the question using the information relevant to the image

work page
[57]

Makes factual claims that are consistent with the image and avoids unsupported details

work page
[58]

Correctly identifies important visual information when it matters for the question

work page
[59]

Uses sound reasoning and logical inference where needed

work page
[60]

DeltaRubric Evaluation Prompt You are a fair judge

Gives a clear and complete answer. DeltaRubric Evaluation Prompt You are a fair judge. Decide which response better answers the question below based on the image. Use the verification checklist as a sequence of checks to execute, not as evidence. If a checklist item is irrelevant, too vague, or contradicted by the image/question, ignore that item. Questio...

work page