pith. machine review for the scientific record. sign in

arxiv: 2605.09269 · v1 · submitted 2026-05-10 · 💻 cs.CL · cs.CV

Recognition: 2 theorem links

· Lean Theorem

DeltaRubric: Generative Multimodal Reward Modeling via Joint Planning and Verification

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:38 UTC · model grok-4.3

classification 💻 cs.CL cs.CV
keywords multimodal reward modelinggenerative evaluationchecklist verificationreinforcement learningvision language modelspreference alignmentrubric generation
0
0 comments X

The pith

Decomposing multimodal preference evaluation into instance-specific checklist generation and verification produces more reliable reward models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that single-step evaluators for multimodal models often rely on language priors instead of actual visual checks, leading to biased rewards. It proposes DeltaRubric, where an MLLM first plans a neutral checklist of potential disagreements for a given image and response pair, then verifies each point against the image to decide the preference. This structured approach is trained with multi-role reinforcement learning that optimizes both planning and verification together. If successful, it would mean better alignment signals for multimodal large language models without needing external rubrics or multiple models.

Core claim

DeltaRubric reformulates multimodal preference evaluation as a plan-and-execute process within a single MLLM. The model first acts as a Disagreement Planner to generate a neutral, instance-specific verification checklist. It then transitions to a Checklist Verifier that executes these checks against the image and question to produce the final judgment. This is formulated as a multi-role reinforcement learning problem that jointly optimizes the planning and verification capabilities, resulting in improved accuracy on benchmarks like VL-RewardBench.

What carries the argument

The two-role mechanism of Disagreement Planner followed by Checklist Verifier, jointly trained via multi-role reinforcement learning to produce grounded multimodal judgments.

Load-bearing premise

Self-generated checklists stay neutral and accurately capture the key visual and factual differences without adding new biases or needing corrections.

What would settle it

Testing DeltaRubric on a benchmark with deliberately subtle visual discrepancies that checklists might miss would show whether accuracy gains hold or if performance reverts to baseline levels.

Figures

Figures reproduced from arXiv: 2605.09269 by Dian Yu, Haitao Mi, Leoweiliang, Pratap Tokekar, Rui Liu, Runpeng Dai, Tong Zheng, Yucheng Shi, Zhenwen Liang.

Figure 1
Figure 1. Figure 1: Overview of DeltaRubric. Given an input tuple (I, q, yA, yB), a single shared MLLM operates in two sequential roles. As a Disagreement Planner, the model analyzes the candidate responses to identify critical factual divergences and generates a neutral, instance-specific verification checklist. Example generated checklists are shown in [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Planner and Verifier training dynamics. (a) and (b) compare the Verifier’s training and validation accuracy (evaluated every 5 steps) against a no-rubric baseline. While both approaches improve, DeltaRubric overall achieves a higher mean accuracy in juding final responses. (c) tracks the Planner probe accuracy, an intermediate proxy measuring the fraction of sampled checklists that successfully guide a lig… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison of evaluation methods. While the standard no-rubric baseline misses the hallucinated "cars" in Response A, DeltaRubric generates a targeted disagreement checklist. This explicitly enforces visual verification, allowing the model to successfully catch the hallucination and correctly select Response B. provides baseline structural guidance and improves overall accuracy (78.8), removing… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison of evaluation methods. The standard no-rubric baseline fails to verify fine-grained visual details and incorrectly validates the hallucinated "white shoes" in Response B. In contrast, DeltaRubric generates a targeted checklist that explicitly isolates the conflicting shoe color. By enforcing active visual verification, DeltaRubric successfully catches the hallucination and correctly … view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison of evaluation methods. The standard no-rubric baseline exhibits severe logical inconsistency: it correctly notes the absence of a tree branch but unaccountably still chooses Response A. In contrast, DeltaRubric generates a targeted checklist that systematically verifies the visual evidence (e.g., granular food in an open palm, not a branch), strictly enforcing logical consistency and… view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison of evaluation methods. The standard no-rubric baseline struggles with fine-grained visual attribute binding, confusing the white tissues with the actual color of the box to incorrectly select Response B. In contrast, DeltaRubric generates a targeted checklist that explicitly isolates and verifies the "green exterior," preventing this attribute confusion and correctly selecting Respon… view at source ↗
read the original abstract

Aligning Multimodal Large Language Models (MLLMs) requires reliable reward models, yet existing single-step evaluators can suffer from lazy judging, exploiting language priors over fine-grained visual verification. While rubric-based evaluation mitigates these biases in text-only settings, extending it to multimodal tasks is bottlenecked by the complexity of visual reasoning. The critical differences between responses often depend on instance-specific visual details. Robust evaluation requires dynamically synthesizing rubrics that isolate spatial and factual discrepancies. To address this, we introduce $\textbf{DeltaRubric}$, an approach that reformulates multimodal preference evaluation as a plan-and-execute process within a single MLLM. DeltaRubric operates in two steps: acting first as a $\textit{Disagreement Planner}$, the model generates a neutral, instance-specific verification checklist. Transitioning into a $\textit{Checklist Verifier}$, it executes these self-generated checks against the image and question to produce the final grounded judgment. We formulate DeltaRubric as a multi-role reinforcement learning problem, jointly optimizing planning and verification capabilities. Validated on Qwen3-VL 4B and 8B Instruct models, DeltaRubric achieves solid empirical gains. For instance, On VL-RewardBench, it improves base model overall accuracy by $\textbf{+22.6}$ (4B) and $\textbf{+18.8}$ (8B) points, largely outperforming standard no-rubric baselines. The results demonstrate that decomposing evaluation into structured, verifiable steps leads to more reliable and generalizable multimodal reward modeling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. DeltaRubric reformulates multimodal preference evaluation as a plan-and-execute process within a single MLLM. The model first acts as a Disagreement Planner to generate a neutral, instance-specific verification checklist and then as a Checklist Verifier to execute those checks against the image and question, producing a grounded judgment. The approach is cast as a multi-role RL problem that jointly optimizes planning and verification capabilities. Experiments on Qwen3-VL 4B and 8B Instruct models report accuracy gains of +22.6 and +18.8 points on VL-RewardBench relative to base models, outperforming standard no-rubric baselines.

Significance. If the reported gains are robustly attributable to the structured decomposition rather than RL shaping or benchmark artifacts, the work would advance generative multimodal reward modeling by mitigating lazy judging and language priors through dynamic, instance-specific rubrics. The joint optimization of planning and verification roles is a technically interesting contribution that could improve reliability in MLLM alignment pipelines.

major comments (2)
  1. [Experimental Results] The central empirical claim (accuracy gains of +22.6 / +18.8 on VL-RewardBench) is load-bearing for the paper's contribution, yet the manuscript provides no details on baseline implementations, statistical significance, controls for data leakage, or whether no-rubric baselines received equivalent RL training. Without these, it cannot be established that the plan-verify decomposition itself drives the improvement over baselines.
  2. [Method (Disagreement Planner and multi-role RL)] The Disagreement Planner is asserted to produce neutral, instance-specific checklists that isolate visual and factual discrepancies without introducing new biases. Because planning and verification are jointly optimized via multi-role RL on final preference judgments, no external constraint or independent validation ensures neutrality or completeness; checklists could be gamed to the reward signal. Ablation studies or human evaluations of checklist quality are required to support the claim that the decomposition, rather than RL optimization, produces the gains.
minor comments (1)
  1. [Method] The multi-role RL objective would benefit from an explicit equation or pseudocode clarifying how the planner and verifier roles are jointly optimized and how role-specific rewards are balanced.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and insightful comments on our manuscript. We address each of the major concerns below and outline the revisions we will make to strengthen the paper.

read point-by-point responses
  1. Referee: [Experimental Results] The central empirical claim (accuracy gains of +22.6 / +18.8 on VL-RewardBench) is load-bearing for the paper's contribution, yet the manuscript provides no details on baseline implementations, statistical significance, controls for data leakage, or whether no-rubric baselines received equivalent RL training. Without these, it cannot be established that the plan-verify decomposition itself drives the improvement over baselines.

    Authors: We agree that additional experimental details are necessary to substantiate our claims. In the revised manuscript, we will expand the experimental section to include: detailed specifications of all baseline implementations; results of statistical significance tests (such as bootstrap resampling or McNemar's test) with reported p-values; explicit verification that the VL-RewardBench splits used do not overlap with training data to rule out leakage; and confirmation that the no-rubric baselines were trained with equivalent RL procedures, differing only in the absence of the planning role. These enhancements will allow readers to better attribute the observed gains to the structured decomposition. revision: yes

  2. Referee: [Method (Disagreement Planner and multi-role RL)] The Disagreement Planner is asserted to produce neutral, instance-specific checklists that isolate visual and factual discrepancies without introducing new biases. Because planning and verification are jointly optimized via multi-role RL on final preference judgments, no external constraint or independent validation ensures neutrality or completeness; checklists could be gamed to the reward signal. Ablation studies or human evaluations of checklist quality are required to support the claim that the decomposition, rather than RL optimization, produces the gains.

    Authors: We recognize the importance of validating the quality and neutrality of the generated checklists independently of the final reward signal. While the joint optimization encourages checklists that improve judgment accuracy, we will incorporate ablation experiments that isolate the planning component (e.g., comparing to a verifier-only RL setup) and conduct human evaluations on a subset of checklists to assess their neutrality, completeness, and lack of bias. These additions will provide direct evidence supporting the value of the plan-verify decomposition beyond RL effects alone. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical RL method with independent benchmark validation

full rationale

The paper describes DeltaRubric as a two-step plan-and-execute process inside a single MLLM, formulated as multi-role RL and evaluated empirically on VL-RewardBench and other benchmarks. No equations, predictions, or first-principles derivations are presented that reduce to fitted inputs or self-definitions by construction. The central claims rest on reported accuracy gains (+22.6 / +18.8 points) over no-rubric baselines rather than on any self-referential theorem or ansatz. Self-citations, if present, are not load-bearing for the method's validity, and the RL optimization is described as standard joint training without circular reduction to the target judgments.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that MLLMs can reliably generate neutral, instance-specific checklists and that RL can jointly optimize the two roles; limited information from abstract prevents full enumeration of parameters or entities.

free parameters (1)
  • RL role weights and hyperparameters
    Multi-role reinforcement learning for planning and verification likely requires tuned weights and learning rates fitted to achieve the reported gains.
axioms (1)
  • domain assumption Self-generated checklists can isolate spatial and factual discrepancies without bias
    Invoked in the Disagreement Planner step to enable grounded verification.

pith-pipeline@v0.9.0 · 5607 in / 1274 out tokens · 31911 ms · 2026-05-12T04:38:14.580532+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · 12 internal anchors

  1. [1]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  2. [2]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022

  3. [3]

    Mllm-as-a-judge: Assessing multimodal llm-as- a-judge with vision-language benchmark

    Dongping Chen, Ruoxi Chen, Shilin Zhang, Yaochen Wang, Yinuo Liu, Huichi Zhou, Qihui Zhang, Yao Wan, Pan Zhou, and Lichao Sun. Mllm-as-a-judge: Assessing multimodal llm-as- a-judge with vision-language benchmark. InForty-first International Conference on Machine Learning, 2024

  4. [4]

    Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185–24198, 2024

  5. [5]

    Cde: Curiosity-driven exploration for efficient reinforcement learning in large language models.arXiv preprint arXiv:2509.09675,

    Runpeng Dai, Linfeng Song, Haolin Liu, Zhenwen Liang, Dian Yu, Haitao Mi, Zhaopeng Tu, Rui Liu, Tong Zheng, Hongtu Zhu, et al. Cde: Curiosity-driven exploration for efficient reinforcement learning in large language models.arXiv preprint arXiv:2509.09675, 2025

  6. [6]

    NVLM: Open frontier-class multimodal LLMs

    Wenliang Dai, Nayeon Lee, Boxin Wang, Zhuolin Yang, Zihan Liu, Jon Barker, Tuomas Rintamaki, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Nvlm: Open frontier-class multimodal llms.arXiv preprint arXiv:2409.11402, 2024

  7. [7]

    Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

    DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

  8. [8]

    Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models

    Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et al. Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 91–104, 2025

  9. [9]

    arXiv preprint arXiv:2512.05111 , year=

    Shengyuan Ding, Xinyu Fang, Ziyu Liu, Yuhang Zang, Yuhang Cao, Xiangyu Zhao, Haodong Duan, Xiaoyi Dong, Jianze Liang, Bin Wang, et al. Arm-thinker: Reinforcing multimodal gener- ative reward models with agentic tool use and visual reasoning.arXiv preprint arXiv:2512.05111, 2025

  10. [10]

    Vita-1.5: Towards gpt-4o level real-time vision and speech interaction.arXiv preprint arXiv:2501.01957, 2025

    Chaoyou Fu, Haojia Lin, Xiong Wang, Yi-Fan Zhang, Yunhang Shen, Xiaoyu Liu, Haoyu Cao, Zuwei Long, Heting Gao, Ke Li, et al. Vita-1.5: Towards gpt-4o level real-time vision and speech interaction.arXiv preprint arXiv:2501.01957, 2025

  11. [11]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  12. [12]

    Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

    Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Yunzhong He, Bing Liu, and Sean Hendryx. Rubrics as rewards: Reinforcement learning beyond verifiable domains.arXiv preprint arXiv:2507.17746, 2025

  13. [13]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  14. [14]

    Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation

    Qidong Huang, Xiaoyi Dong, Pan Zhang, Bin Wang, Conghui He, Jiaqi Wang, Dahua Lin, Weiming Zhang, and Nenghai Yu. Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13418–13427, 2024. 10

  15. [15]

    Reinforcement learning with rubric anchors

    Zenan Huang, Yihong Zhuang, Guoshan Lu, Zeyu Qin, Haokai Xu, Tianyu Zhao, Ru Peng, Jiaqi Hu, Zhanming Shen, Xiaomeng Hu, et al. Reinforcement learning with rubric anchors. arXiv preprint arXiv:2508.12790, 2025

  16. [16]

    AutoRubric: Rubric-Based Generative Rewards for Faithful Multimodal Reasoning

    Mengzhao Jia, Zhihan Zhang, Ignacio Cases, Zheyuan Liu, Meng Jiang, and Peng Qi. Autorubric-r1v: Rubric-based generative rewards for faithful multimodal reasoning.arXiv preprint arXiv:2510.14738, 2025

  17. [17]

    Prometheus: Inducing fine-grained evaluation capability in language models

    Seungone Kim, Jamin Shin, Yejin Cho, Joel Jang, Shayne Longpre, Hwaran Lee, Sangdoo Yun, Seongjin Shin, Sungdong Kim, James Thorne, et al. Prometheus: Inducing fine-grained evaluation capability in language models. InThe Twelfth International Conference on Learning Representations, 2023

  18. [18]

    Reinforcement Learning from Human Feedback

    Nathan Lambert. Reinforcement learning from human feedback.arXiv preprint arXiv:2504.12501, 2025

  19. [19]

    Tulu 3: Pushing Frontiers in Open Language Model Post-Training

    Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, et al. Tulu 3: Pushing frontiers in open language model post-training.arXiv preprint arXiv:2411.15124, 2024

  20. [20]

    Rewardbench: Evaluating reward models for language modeling

    Nathan Lambert, Valentina Pyatkin, Jacob Morrison, Lester James Validad Miranda, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, et al. Rewardbench: Evaluating reward models for language modeling. InFindings of the Association for Computa- tional Linguistics: NAACL 2025, pages 1755–1797, 2025

  21. [21]

    Vl-rewardbench: a challenging benchmark for vision-language generative reward models

    Lei Li, Yuancheng Wei, Zhihui Xie, Xuqing Yang, Yifan Song, Peiyi Wang, Chenxin An, Tianyu Liu, Sujian Li, Bill Yuchen Lin, et al. Vl-rewardbench: a challenging benchmark for vision-language generative reward models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24657–24668, 2025

  22. [22]

    Self-Rewarding Vision-Language Model via Reasoning Decomposition

    Zongxia Li, Wenhao Yu, Chengsong Huang, Rui Liu, Zhenwen Liang, Fuxiao Liu, Jingxi Che, Dian Yu, Jordan Boyd-Graber, Haitao Mi, et al. Self-rewarding vision-language model via reasoning decomposition.arXiv preprint arXiv:2508.19652, 2025

  23. [23]

    Stable and efficient single-rollout rl for multimodal reasoning.arXiv preprint arXiv:2512.18215, 2025

    Rui Liu, Dian Yu, Lei Ke, Haolin Liu, Yujun Zhou, Zhenwen Liang, Haitao Mi, Pratap Tokekar, and Dong Yu. Stable and efficient single-rollout rl for multimodal reasoning.arXiv preprint arXiv:2512.18215, 2025

  24. [24]

    V ogue: Guiding exploration with visual uncertainty improves multimodal reasoning.arXiv preprint arXiv:2510.01444, 2025

    Rui Liu, Dian Yu, Tong Zheng, Runpeng Dai, Zongxia Li, Wenhao Yu, Zhenwen Liang, Linfeng Song, Haitao Mi, Pratap Tokekar, et al. V ogue: Guiding exploration with visual uncertainty improves multimodal reasoning.arXiv preprint arXiv:2510.01444, 2025

  25. [25]

    Openrubrics: Towards scalable synthetic rubric generation for reward modeling and llm alignment.arXiv preprint arXiv:2510.07743, 2025

    Tianci Liu, Ran Xu, Tony Yu, Ilgee Hong, Carl Yang, Tuo Zhao, and Haoyu Wang. Openrubrics: Towards scalable synthetic rubric generation for reward modeling and llm alignment.arXiv preprint arXiv:2510.07743, 2025

  26. [26]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019. URL https://openreview.net/forum? id=Bkg6RiCqY7

  27. [27]

    Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

  28. [28]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023

  29. [29]

    Self-critiquing models for assisting human evaluators

    William Saunders, Catherine Yeh, Jeff Wu, Steven Bills, Long Ouyang, Jonathan Ward, and Jan Leike. Self-critiquing models for assisting human evaluators.arXiv preprint arXiv:2206.05802, 2022. 11

  30. [30]

    Dr tulu: Reinforcement learning with evolving rubrics for deep research.arXiv preprint arXiv:2511.19399, 2025

    Rulin Shao, Akari Asai, Shannon Zejiang Shen, Hamish Ivison, Varsha Kishore, Jingming Zhuo, Xinran Zhao, Molly Park, Samuel G Finlayson, David Sontag, et al. Dr tulu: Reinforcement learning with evolving rubrics for deep research.arXiv preprint arXiv:2511.19399, 2025

  31. [31]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  32. [32]

    Rein- forcing chain-of-thought reasoning with self-evolving rubrics.arXiv preprint arXiv:2602.10885, 2026

    Leheng Sheng, Wenchang Ma, Ruixin Hong, Xiang Wang, An Zhang, and Tat-Seng Chua. Rein- forcing chain-of-thought reasoning with self-evolving rubrics.arXiv preprint arXiv:2602.10885, 2026

  33. [33]

    A long way to go: Investigat- ing length correlations in RLHF.arXiv preprint arXiv:2310.03716,

    Prasann Singhal, Tanya Goyal, Jiacheng Xu, and Greg Durrett. A long way to go: Investigating length correlations in rlhf.arXiv preprint arXiv:2310.03716, 2023

  34. [34]

    Aligning large multimodal models with factually augmented rlhf

    Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liangyan Gui, Yu-Xiong Wang, Yiming Yang, et al. Aligning large multimodal models with factually augmented rlhf. InFindings of the Association for Computational Linguistics: ACL 2024, pages 13088–13110, 2024

  35. [35]

    ReFT: Reasoning with reinforced fine-tuning

    Luong Trung, Xinbo Zhang, Zhanming Jie, Peng Sun, Xiaoran Jin, and Hang Li. ReFT: Reasoning with reinforced fine-tuning. In Lun-Wei Ku, Andre Martins, and Vivek Sriku- mar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers), pages 7601–7614, Bangkok, Thailand, August

  36. [36]

    doi: 10.18653/v1/2024.acl-long.410

    Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.410. URL https://aclanthology.org/2024.acl-long.410/

  37. [37]

    In: NeurIPS (2025),https://arxiv.org/abs/2507.18624

    Vijay Viswanathan, Yanchao Sun, Shuang Ma, Xiang Kong, Meng Cao, Graham Neubig, and Tongshuang Wu. Checklists are better than reward models for aligning language models.arXiv preprint arXiv:2507.18624, 2025

  38. [38]

    Msrl: Scaling generative multimodal reward modeling via multi-stage reinforcement learning.arXiv preprint arXiv:2603.25108, 2026

    Chenglong Wang, Yifu Huo, Yang Gan, Qiaozhi He, Qi Meng, Bei Li, Yan Wang, Junfu Liu, Tianhua Zhou, Jingbo Zhu, et al. Msrl: Scaling generative multimodal reward modeling via multi-stage reinforcement learning.arXiv preprint arXiv:2603.25108, 2026

  39. [39]

    Unified multimodal chain-of-thought reward model through reinforcement fine-tuning.arXiv preprint arXiv:2505.03318, 2025

    Yibin Wang, Zhimin Li, Yuhang Zang, Chunyu Wang, Qinglin Lu, Cheng Jin, and Jiaqi Wang. Unified multimodal chain-of-thought reward model through reinforcement fine-tuning.arXiv preprint arXiv:2505.03318, 2025

  40. [40]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

  41. [41]

    Llava-critic: Learning to evaluate multimodal models

    Tianyi Xiong, Xiyao Wang, Dong Guo, Qinghao Ye, Haoqi Fan, Quanquan Gu, Heng Huang, and Chunyuan Li. Llava-critic: Learning to evaluate multimodal models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13618–13628, 2025

  42. [42]

    arXiv preprint arXiv:2602.01511 , year=

    Ran Xu, Tianci Liu, Zihan Dong, Tony Yu, Ilgee Hong, Carl Yang, Linjun Zhang, Tao Zhao, and Haoyu Wang. Alternating reinforcement learning for rubric-based reward modeling in non-verifiable llm post-training.arXiv preprint arXiv:2602.01511, 2026

  43. [43]

    Multimodal reward- bench: Holistic evaluation of reward models for vision language models.arXiv preprint arXiv:2502.14191, 2025

    Michihiro Yasunaga, Luke Zettlemoyer, and Marjan Ghazvininejad. Multimodal reward- bench: Holistic evaluation of reward models for vision language models.arXiv preprint arXiv:2502.14191, 2025

  44. [44]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

  45. [45]

    Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback

    Tianyu Yu, Yuan Yao, Haoye Zhang, Taiwen He, Yifeng Han, Ganqu Cui, Jinyi Hu, Zhiyuan Liu, Hai-Tao Zheng, Maosong Sun, et al. Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13807–13816, 2024. 12

  46. [46]

    Rlaif-v: Open-source ai feedback leads to super gpt-4v trustworthiness

    Tianyu Yu, Haoye Zhang, Qiming Li, Qixin Xu, Yuan Yao, Da Chen, Xiaoman Lu, Ganqu Cui, Yunkai Dang, Taiwen He, et al. Rlaif-v: Open-source ai feedback leads to super gpt-4v trustworthiness. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19985–19995, 2025

  47. [47]

    Benchmarking large multimodal models against common corruptions.arXiv preprint arXiv:2401.11943, 2024

    Jiawei Zhang, Tianyu Pang, Chao Du, Yi Ren, Bo Li, and Min Lin. Benchmarking large multimodal models against common corruptions.arXiv preprint arXiv:2401.11943, 2024

  48. [48]

    R1-reward: Training multimodal reward model through stable reinforcement learning.CoRR, abs/2505.02835, 2025

    Yi-Fan Zhang, Xingyu Lu, Xiao Hu, Chaoyou Fu, Bin Wen, Tianke Zhang, Changyi Liu, Kaiyu Jiang, Kaibing Chen, Kaiyu Tang, et al. R1-reward: Training multimodal reward model through stable reinforcement learning.arXiv preprint arXiv:2505.02835, 2025

  49. [49]

    Mm-rlhf: The next step forward in multimodal llm alignment,

    Yi-Fan Zhang, Tao Yu, Haochen Tian, Chaoyou Fu, Peiyan Li, Jianshu Zeng, Wulin Xie, Yang Shi, Huanyu Zhang, Junkang Wu, et al. Mm-rlhf: The next step forward in multimodal llm alignment.arXiv preprint arXiv:2502.10391, 2025

  50. [50]

    Basereward: A strong baseline for multimodal reward model

    YiFan Zhang, Haihua Yang, Huanyu Zhang, Yang Shi, Zezhou Chen, Haochen Tian, Chaoyou Fu, Kai WU, Bo Cui, Xu Wang, Jianfei Pan, Haotian Wang, Zhang Zhang, and Liang Wang. Basereward: A strong baseline for multimodal reward model. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum? id=EuN5iszF0a

  51. [51]

    Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

  52. [52]

    Parallel-r1: Towards parallel thinking via reinforcement learning

    Tong Zheng, Hongming Zhang, Wenhao Yu, Xiaoyang Wang, Runpeng Dai, Rui Liu, Huiwen Bao, Chengsong Huang, Heng Huang, and Dong Yu. Parallel-r1: Towards parallel thinking via reinforcement learning.arXiv preprint arXiv:2509.07980, 2025

  53. [53]

    Easyr1: An efficient, scalable, multi-modality rl training framework

    Yaowei Zheng, Junting Lu, Shenzhi Wang, Zhangchi Feng, Dongdong Kuang, Yuwen Xiong, and Richong Zhang. Easyr1: An efficient, scalable, multi-modality rl training framework. https://github.com/hiyouga/EasyR1, 2025

  54. [54]

    A comprehensive survey of reward models: Taxonomy, applications, challenges, and future.arXiv preprint arXiv:2504.12328, 2025

    Jialun Zhong, Wei Shen, Yanzeng Li, Songyang Gao, Hua Lu, Yicheng Chen, Yang Zhang, Wei Zhou, Jinjie Gu, and Lei Zou. A comprehensive survey of reward models: Taxonomy, applications, challenges, and future.arXiv preprint arXiv:2504.12328, 2025

  55. [55]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 13 A Appendix A.1 Implementation Details We perform direct RL training on the Qwen3-VL-4B and 8B In...

  56. [56]

    Directly answers the question using the information relevant to the image

  57. [57]

    Makes factual claims that are consistent with the image and avoids unsupported details

  58. [58]

    Correctly identifies important visual information when it matters for the question

  59. [59]

    Uses sound reasoning and logical inference where needed

  60. [60]

    DeltaRubric Evaluation Prompt You are a fair judge

    Gives a clear and complete answer. DeltaRubric Evaluation Prompt You are a fair judge. Decide which response better answers the question below based on the image. Use the verification checklist as a sequence of checks to execute, not as evidence. If a checklist item is irrelevant, too vague, or contradicted by the image/question, ignore that item. Questio...