Outcome accuracy is not enough: Aligning the reasoning process of reward models

Binghai Wang, Yantao Liu, Yuxuan Liu, Tianyi Tang, Shenzhi Wang, Chang Gao, Chujie Zheng, Yichang Zhang, Le Yu, Shixuan Liu, Tao Gui, Qi Zhang, Xuanjing Huang, Bowen Yu, Fei Huang, Junyang Lin · 2026 · arXiv 2602.04649

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

read on arXiv browse 4 citing papers

citation-role summary

method 1

citation-polarity summary

use method 1

representative citing papers

From Answers to States: Verifiable Process-Level Evaluation of Chemical Reasoning in Large Language Models

cs.AI · 2026-06-02 · unverdicted · novelty 7.0

ChemCoTBench-V2 is a new rule-verifiable benchmark with 5,620 samples across 18 tasks that evaluates LLM chemical reasoning traces using deterministic chemistry rules and reference traces rather than final answers alone.

Skill-RM: Unifying Heterogeneous Evaluation Criteria via Agent Skill

cs.LG · 2026-06-02 · unverdicted · novelty 6.0

Skill-RM unifies heterogeneous reward criteria by modeling reward computation as dynamic execution of a reusable Reward-Evaluation Skill within an agent framework.

Rubric-Guided Process Reward for Stepwise Model Routing

cs.AI · 2026-05-28 · unverdicted · novelty 6.0

RoRo uses alternating optimization of a Rubricor and Judge to create process rewards from outcome-cost-process preference pairs, then combines them with outcome rewards via GRPO to train stepwise model routers that outperform baselines on five reasoning benchmarks.

Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges

cs.LG · 2026-04-15 · unverdicted · novelty 5.0

The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under optimization pressure.

citing papers explorer

Showing 4 of 4 citing papers.

From Answers to States: Verifiable Process-Level Evaluation of Chemical Reasoning in Large Language Models cs.AI · 2026-06-02 · unverdicted · none · ref 74
ChemCoTBench-V2 is a new rule-verifiable benchmark with 5,620 samples across 18 tasks that evaluates LLM chemical reasoning traces using deterministic chemistry rules and reference traces rather than final answers alone.
Skill-RM: Unifying Heterogeneous Evaluation Criteria via Agent Skill cs.LG · 2026-06-02 · unverdicted · none · ref 48
Skill-RM unifies heterogeneous reward criteria by modeling reward computation as dynamic execution of a reusable Reward-Evaluation Skill within an agent framework.
Rubric-Guided Process Reward for Stepwise Model Routing cs.AI · 2026-05-28 · unverdicted · none · ref 3
RoRo uses alternating optimization of a Rubricor and Judge to create process rewards from outcome-cost-process preference pairs, then combines them with outcome rewards via GRPO to train stepwise model routers that outperform baselines on five reasoning benchmarks.
Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges cs.LG · 2026-04-15 · unverdicted · none · ref 137
The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under optimization pressure.

Outcome accuracy is not enough: Aligning the reasoning process of reward models

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer