pith. sign in

arxiv: 2605.29156 · v1 · pith:CHRBFCXSnew · submitted 2026-05-27 · 💻 cs.LG · cs.CL

RUBRIC-ARROW: Alternating Pointwise Rubric Reward Modeling for LLM Post-training in Non-verifiable Domains

Pith reviewed 2026-06-29 13:11 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords reward modelingLLM post-trainingrubric-based evaluationpairwise preferencespointwise scoringnon-verifiable domainsGRPO optimization
0
0 comments X

The pith

RUBRIC-ARROW jointly trains a rubric generator and judge from pairwise data to produce pointwise rewards for subjective LLM tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents RUBRIC-ARROW as a method to build pointwise reward models for LLM post-training in domains without clear right answers. It alternates between generating explicit evaluation criteria and training a judge that scores outputs against those criteria, all optimized using only pairwise preference data. The approach adds a probability-based scoring rule to cut down on ties and uses phase-specific rewards inside an alternating GRPO loop. A sympathetic reader would care because pointwise scores are needed for stable policy training, yet most current rubric methods still lean on expensive frontier models or produce inconsistent results.

Core claim

RUBRIC-ARROW is an alternating framework that jointly trains a rubric generator and a rubric-conditioned judge. Its RL stage uses only pairwise preference data, coupling a probability-based scoring rule that reduces ties with phase-specific preference-based rewards and an alternating GRPO scheme that together train the pointwise evaluator.

What carries the argument

Alternating GRPO optimization between a rubric generator and a rubric-conditioned judge, combined with probability-based scoring and phase-specific preference rewards.

If this is right

  • The trained judge achieves competitive accuracy against existing reward models on standard benchmarks.
  • Downstream policy optimization using the resulting pointwise scores produces consistent performance gains.
  • Probability-based scoring measurably lowers the rate of tied evaluations compared with Boolean aggregation.
  • The method requires only pairwise preference data and does not need frontier LLMs at inference time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same alternating loop could be applied to other preference datasets where absolute scores would help training stability.
  • If the rubric generator learns reusable criteria, the approach might transfer across related subjective tasks without retraining from scratch.
  • Testing the judge on held-out subjective domains such as creative writing or ethical reasoning would reveal whether the pointwise signals generalize beyond the training preferences.

Load-bearing premise

Jointly training the rubric generator and rubric-conditioned judge via alternating GRPO with probability-based scoring and phase-specific preference rewards will produce an effective pointwise evaluator from pairwise data alone in non-verifiable domains.

What would settle it

An experiment in which the alternating model shows no accuracy gain over a non-alternating baseline judge or produces more scoring ties than a simple probability threshold would predict.

Figures

Figures reproduced from arXiv: 2605.29156 by Haoxiang Jiang, Haoyu Wang, Linjun Zhang, Ran Xu, Tianci Liu, Tony Yu, Wanying Wang, Zihan Dong.

Figure 1
Figure 1. Figure 1: Best-of-N selection quality on WildBench: each method selects one response from the first N sampled candidates, and we report the mean oracle score. The Oracle curve marks the per-prompt best candidate (upper bound); shaded regions show 95% bootstrap confidence intervals over prompts. Failure to track the Oracle curve at larger N indicates reward-model overoptimization [PITH_FULL_IMAGE:figures/full_fig_p0… view at source ↗
Figure 2
Figure 2. Figure 2: Trained-policy comparison on IFBench. Baseline results (except Rubric-RM IterDPO) are from Open￾Rubrics Liu et al. (2025a); two baselines are omitted for readability (see [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Trained-policy comparison on Creative Writing Benchmark v3. Baseline results (except Rubric-RM) are from RuscaRL Zhou et al. (2025); two baselines are omitted for readability (see [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of trained policy models on IFBench. Results of baselines except Rubric-RM (IterDPO) are from OpenRubrics Liu et al. (2025a). Qwen2.5-7B -Instruct RaR Rubric-based RL-S RuscaRL Rubric-RM (DPO) Rubric-ARROW w/o RL (DPO) RIFL (DPO) Rubric-ARROW (DPO) Rubric-RM (IterDPO) Rubric-ARROW w/o RL (IterDPO) RIFL (IterDPO) Rubric-ARROW (IterDPO) 35 36 37 38 39 40 41 Score 37.4 38.8 38.3 38.6 37.5 39.4 38.5… view at source ↗
Figure 5
Figure 5. Figure 5 [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗
read the original abstract

Pointwise reward modeling offers critical signals for LLM post-training, yet struggles with absolute scoring in subjective, non-verifiable settings. Rubric-based methods address this by decomposing evaluation into explicit criteria, but existing approaches typically depend on frontier LLMs and suffer from ties caused by hard Boolean aggregation. We present RUBRIC-ARROW, an alternating framework that jointly trains a rubric generator and a rubric-conditioned judge, with its RL stage using only pairwise preference data. Our method couples a probability-based scoring rule that reduces ties with phase-specific preference-based rewards and an alternating GRPO scheme that together train the pointwise evaluator. Extensive experiments show that RUBRIC-ARROW achieves competitive reward-modeling accuracy and yields consistent gains for downstream policy post-training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper introduces RUBRIC-ARROW, an alternating framework that jointly trains a rubric generator and a rubric-conditioned judge for pointwise reward modeling in non-verifiable domains. Training uses only pairwise preference data via an alternating GRPO scheme, a probability-based scoring rule to reduce ties, and phase-specific preference rewards. The central claims are that this yields competitive reward-modeling accuracy and consistent gains for downstream policy post-training.

Significance. If the empirical results hold, the work would offer a practical route to pointwise evaluators from pairwise data without frontier LLMs, addressing ties in rubric aggregation for subjective domains. The alternating optimization and probability-based scoring could strengthen reward signals for LLM post-training where absolute scoring is difficult.

major comments (1)
  1. [Abstract] The manuscript consists solely of the abstract and states that 'extensive experiments' demonstrate competitive accuracy and downstream gains, yet provides no datasets, baselines, metrics, ablation studies, or quantitative results. This absence is load-bearing for the central empirical claims and prevents any assessment of whether the alternating GRPO procedure or probability-based scoring delivers the reported benefits.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review. We address the major comment below point by point.

read point-by-point responses
  1. Referee: [Abstract] The manuscript consists solely of the abstract and states that 'extensive experiments' demonstrate competitive accuracy and downstream gains, yet provides no datasets, baselines, metrics, ablation studies, or quantitative results. This absence is load-bearing for the central empirical claims and prevents any assessment of whether the alternating GRPO procedure or probability-based scoring delivers the reported benefits.

    Authors: We agree that the manuscript text provided for review consists only of the abstract and contains no datasets, baselines, metrics, ablation studies, or quantitative results. This omission prevents evaluation of the central claims regarding the alternating GRPO scheme and probability-based scoring. The full arXiv version is intended to contain these elements, but to resolve the issue we will revise the submission to include all experimental details, specific datasets, baselines, metrics, ablations on the alternating optimization and scoring rule, and the reported quantitative results. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external pairwise data

full rationale

The abstract and method description indicate training from external pairwise preference data using alternating GRPO, probability-based scoring, and phase-specific rewards to produce pointwise scores. No equations, self-citations, or fitted parameters are shown reducing the output to inputs by construction. The approach is presented as using independent data sources rather than self-referential definitions or predictions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, so no concrete free parameters, axioms, or invented entities can be extracted; the ledger remains empty.

pith-pipeline@v0.9.1-grok · 5677 in / 1047 out tokens · 50783 ms · 2026-06-29T13:11:43.058736+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 24 canonical work pages · 10 internal anchors

  1. [1]

    Ultraif: Advancing instruction following from the wild

    Kaikai An, Li Sheng, Ganqu Cui, Shuzheng Si, Ning Ding, Yu Cheng, and Baobao Chang. Ultraif: Advancing instruction following from the wild. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 18722–18737,

  2. [2]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862,

  3. [3]

    Judgelrm: Large reasoning models as a judge.arXiv preprint arXiv:2504.00050, 2025a

    13 Alternating Pointwise Rubric Reward Modeling NuoChen, ZhiyuanHu, QingyunZou, JiayingWu, QianWang, BryanHooi, andBingshengHe. Judgelrm: Large reasoning models as a judge.arXiv preprint arXiv:2504.00050, 2025a. Xiusi Chen, Gaotang Li, Ziqi Wang, Bowen Jin, Cheng Qian, Yu Wang, Hongru Wang, Yu Zhang, Denghui Zhang, Tong Zhang, et al. Rm-r1: Reward modelin...

  4. [4]

    Self-play with execution feedback: Improving instruction-following capabilities of large language models

    Guanting Dong, Keming Lu, Chengpeng Li, Tingyu Xia, Bowen Yu, Chang Zhou, and Jingren Zhou. Self-play with execution feedback: Improving instruction-following capabilities of large language models. InInternational Conference on Learning Representations, volume 2025, pages 39286–39313,

  5. [5]

    Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

    Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators.arXiv preprint arXiv:2404.04475,

  6. [6]

    How to evaluate reward models for rlhf

    Evan Frick, Tianle Li, Connor Chen, Wei-Lin Chiang, Anastasios Angelopoulos, Jiantao Jiao, Banghua Zhu, Joseph E Gonzalez, and Ion Stoica. How to evaluate reward models for rlhf. InInternational Conference on Learning Representations, volume 2025, pages 18128–18163,

  7. [7]

    Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

    Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Yunzhong He, Bing Liu, and Sean Hendryx. Rubrics as rewards: Reinforcement learning beyond verifiable domains.arXiv preprint arXiv:2507.17746,

  8. [8]

    InNeurIPS 2025 Work- shop on Efficient Reasoning

    Yun He, Wenzhe Li, Hejia Zhang, Songlin Li, Karishma Mandyam, Sopan Khosla, Yuanhao Xiong, Nanshu Wang, Xiaoliang Peng, Beibin Li, et al. Advancedif: Rubric-based benchmarking and reinforcement learning for advancing llm instruction following.arXiv preprint arXiv:2511.10507,

  9. [9]

    Open rubric system: Scaling reinforcement learning with pairwise adaptive rubric.arXiv preprint arXiv:2602.14069,

    Ruipeng Jia, Yunyi Yang, Yuxin Wu, Yongbo Gai, Siyuan Tao, Mengyu Zhou, Jianhe Lin, Xiaoxi Jiang, and Guanjun Jiang. Open rubric system: Scaling reinforcement learning with pairwise adaptive rubric.arXiv preprint arXiv:2602.14069,

  10. [10]

    Rewardbench: Evaluating reward models for language modeling

    Nathan Lambert, Valentina Pyatkin, Jacob Morrison, LJ Miranda, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, et al. Rewardbench: Evaluating reward models for language modeling. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 1755–1797,

  11. [11]

    Checkeval: A reliable llm-as-a-judge framework for evaluating text generation using checklists

    Yukyung Lee, Joonghoon Kim, Jaehee Kim, Hyowon Cho, Jaewook Kang, Pilsung Kang, and Najoung Kim. Checkeval: A reliable llm-as-a-judge framework for evaluating text generation using checklists. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 15782–15809,

  12. [12]

    From generation to judgment: Oppor- tunities and challenges of llm-as-a-judge

    Dawei Li, Bohan Jiang, Liangjie Huang, Alimohammad Beigi, Chengshuai Zhao, Zhen Tan, Amrita Bhattacharjee, Yuxuan Jiang, Canyu Chen, Tianhao Wu, et al. From generation to judgment: Oppor- tunities and challenges of llm-as-a-judge. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 2757–2791,

  13. [13]

    LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

    Haitao Li, Qian Dong, Junjie Chen, Huixue Su, Yujia Zhou, Qingyao Ai, Ziyi Ye, and Yiqun Liu. Llms-as- judges: a comprehensive survey on llm-based evaluation methods.arXiv preprint arXiv:2412.05579, 2024a. Sunzhu Li, Jiale Zhao, Miteto Wei, Huimin Ren, Yang Zhou, Jingwen Yang, Shunyu Liu, Kaike Zhang, and Wei Chen. Rubrichub: A comprehensive and highly di...

  14. [14]

    From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline

    Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Tianhao Wu, Banghua Zhu, Joseph E Gonzalez, and Ion Stoica. From crowdsourced data to high-quality benchmarks: Arena-hard and benchbuilder pipeline.arXiv preprint arXiv:2406.11939, 2024b. Bill Yuchen Lin, Yuntian Deng, Khyathi Chandu, Abhilasha Ravichander, Valentina Pyatkin, Nouha Dziri, Ronan Le Bras, ...

  15. [15]

    Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs

    Chris Yuhao Liu, Liang Zeng, Jiacai Liu, Rui Yan, Jujie He, Chaojie Wang, Shuicheng Yan, Yang Liu, and Yahui Zhou. Skywork-reward: Bag of tricks for reward modeling in llms.arXiv preprint arXiv:2410.18451,

  16. [16]

    InInternational Con- ference on Learning Representations, volume 2024, pages 29927–29962

    Tianci Liu, Ran Xu, Tony Yu, Ilgee Hong, Carl Yang, Tuo Zhao, and Haoyu Wang. Openrubrics: Towards scalable synthetic rubric generation for reward modeling and llm alignment.arXiv preprint arXiv:2510.07743, 2025a. Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: Nlg evaluation using gpt-4 with better human alignment. I...

  17. [17]

    Rm-bench: Benchmarking re- ward models of language models with subtlety and style

    15 Alternating Pointwise Rubric Reward Modeling Yantao Liu, Zijun Yao, Rui Min, Yixin Cao, Lei Hou, and Juanzi Li. Rm-bench: Benchmarking re- ward models of language models with subtlety and style. InInternational Conference on Learning Representations, volume 2025, pages 44323–44355, 2025b. DakotaMahan,DuyVanPhung,RafaelRafailov,ChaseBlagden,NathanLile,L...

  18. [18]

    RewardBench 2: Advancing Reward Model Evaluation

    Saumya Malik, Valentina Pyatkin, Sander Land, Jacob Morrison, Noah A Smith, Hannaneh Ha- jishirzi, and Nathan Lambert. Rewardbench 2: Advancing reward model evaluation.arXiv preprint arXiv:2506.01937,

  19. [19]

    Rubriceval: A rubric-level meta-evaluation benchmark for llm judges in instruction following.arXiv preprint arXiv:2603.25133,

    Tianjun Pan, Xuan Lin, Wenyan Yang, Qianyu He, Shisong Chen, Licai Qi, Wanqing Xu, Hongwei Feng, Bo Xu, and Yanghua Xiao. Rubriceval: A rubric-level meta-evaluation benchmark for llm judges in instruction following.arXiv preprint arXiv:2603.25133,

  20. [20]

    Infobench: Evaluating instruction following ability in large language models

    Yiwei Qin, Kaiqiang Song, Yebowen Hu, Wenlin Yao, Sangwoo Cho, Xiaoyang Wang, Xuansheng Wu, Fei Liu, Pengfei Liu, and Dong Yu. Infobench: Evaluating instruction following ability in large language models. InFindings of the Association for Computational Linguistics: ACL 2024, pages 13025–13048,

  21. [21]

    Online rubrics elicitation from pairwise comparisons.arXiv preprint arXiv:2510.07284,

    MohammadHossein Rezaei, Robert Vacareanu, Zihao Wang, Clinton Wang, Bing Liu, Yunzhong He, and Afra Feyza Akyürek. Online rubrics elicitation from pairwise comparisons.arXiv preprint arXiv:2510.07284,

  22. [22]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

  23. [23]

    Interpretable preferences via multi-objective reward modeling and mixture-of-experts

    Haoxiang Wang, Wei Xiong, Tengyang Xie, Han Zhao, and Tong Zhang. Interpretable preferences via multi-objective reward modeling and mixture-of-experts. InFindings of the Association for Computa- tional Linguistics: EMNLP 2024, pages 10582–10592,

  24. [24]

    Drift: Learning from abundant user dissatisfaction in real-world preference learning.arXiv preprint arXiv:2510.02341,

    Yifan Wang, Bolian Li, Junlin Wu, Zhaoxuan Tan, Zheli Liu, Ruqi Zhang, Ananth Grama, and Qingkai Zeng. Drift: Learning from abundant user dissatisfaction in real-world preference learning.arXiv preprint arXiv:2510.02341,

  25. [25]

    Auto-rubric: Learning from implicit weights to explicit rubrics for reward modeling.arXiv preprint arXiv:2510.17314,

    Lipeng Xie, Sen Huang, Zhuo Zhang, Anni Zou, Yunpeng Zhai, Dingchao Ren, Kezun Zhang, Haoyuan Hu, Boyin Liu, Haoran Chen, et al. Auto-rubric: Learning from implicit weights to explicit rubrics for reward modeling.arXiv preprint arXiv:2510.17314,

  26. [26]

    Incentivizing agentic reasoning in llm judges via tool-integrated reinforcement learning.arXiv preprint arXiv:2510.23038,

    Ran Xu, Jingjing Chen, Jiayu Ye, Yu Wu, Jun Yan, Carl Yang, and Hongkun Yu. Incentivizing agentic reasoning in llm judges via tool-integrated reinforcement learning.arXiv preprint arXiv:2510.23038,

  27. [27]

    Alternating reinforcement learning for rubric-based reward modeling in non-verifiable llm post-training.arXiv preprint arXiv:2602.01511,

    Ran Xu, Tianci Liu, Zihan Dong, Tony Yu, Ilgee Hong, Carl Yang, Linjun Zhang, Tao Zhao, and Haoyu Wang. Alternating reinforcement learning for rubric-based reward modeling in non-verifiable llm post-training.arXiv preprint arXiv:2602.01511,

  28. [28]

    Qwen2.5 Technical Report

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu X...

  29. [29]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

  30. [30]

    Beyond correctness: Evaluating subjective writing preferences across cultures.arXiv preprint arXiv:2510.14616,

    Shuangshuang Ying, Yunwen Li, Xingwei Qu, Xin Li, Sheng Jin, Minghao Liu, Zhoufutu Wen, Xeron Du, Tianyu Zheng, Yichi Zhang, et al. Beyond correctness: Evaluating subjective writing preferences across cultures.arXiv preprint arXiv:2510.14616,

  31. [31]

    Self-generated critiques boost reward modeling for language models

    Yue Yu, Zhengxing Chen, Aston Zhang, Liang Tan, Chenguang Zhu, Richard Yuanzhe Pang, Yundi Qian, Xuewei Wang, Suchin Gururangan, Chao Zhang, Melanie Kambadur, Dhruv Mahajan, and Rui Hou. Self-generated critiques boost reward modeling for language models. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Proceedings of the 2025 Conference of the Nations ...

  32. [32]

    ISBN 979-8-89176-189-6

    Association for Computational Linguistics. ISBN 979-8-89176-189-6. doi: 10.18653/v1/2025.naacl-long.573. URL https:// aclanthology.org/2025.naacl-long.573/. Junkai Zhang, Zihao Wang, Lin Gui, Swarnashree Mysore Sathyendra, Jaehwan Jeong, Victor Veitch, Wei Wang, Yunzhong He, Bing Liu, and Lifeng Jin. Chasing the tail: Effective rubric-based reward modelin...

  33. [33]

    Breaking the exploration bottleneck: Rubric-scaffolded reinforcement learning for general llm reasoning.arXiv preprint arXiv:2508.16949,

    Yang Zhou, Sunzhu Li, Shunyu Liu, Wenkai Fang, Kongcheng Zhang, Jiale Zhao, Jingwen Yang, Yihe Zhou, Jianwei Lv, Tongya Zheng, et al. Breaking the exploration bottleneck: Rubric-scaffolded reinforcement learning for general llm reasoning.arXiv preprint arXiv:2508.16949,

  34. [34]

    Experiment A.1

    A. Experiment A.1. Implementation Details Tables 9 and 10 summarize the hyperparameters used forRubric-ARROWand policy model training, respectively. In rubric-based scoring, we assign a weight of 3 to eachHard Ruleand a weight of 1 to each Principlewhen computing the final score. We train the GRPO models using the ms-swift library2 (Zhao et al., 2025), an...

  35. [35]

    For baseline methods, we use the sampling configurations specified in their official implementations or original papers. A.2. Evaluation Details Pairwise construction for FollowBench and InfoBench.For these two benchmarks, we convert the original single-response evaluation into a pairwise comparison setting: for each prompt, two responses are sampled from...

  36. [36]

    We also compare against rubric-based reward modelsRubric-RM, Rubric-ARROWw/o RL, andRIFL

    is a checklist-based reward signal that converts explicit constraints into a scalar reward. We also compare against rubric-based reward modelsRubric-RM, Rubric-ARROWw/o RL, andRIFL. 4https://www.anthropic.com/news/claude-3-5-sonnet 5https://openai.com/index/gpt-4-1/ 19 Alternating Pointwise Rubric Reward Modeling Table 10:Hyper-parameters used in policy m...