pith. sign in

Outcome accuracy is not enough: Aligning the reasoning process of reward models

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

citation-role summary

method 1

citation-polarity summary

fields

cs.AI 2 cs.LG 2

years

2026 4

verdicts

UNVERDICTED 4

roles

method 1

polarities

use method 1

representative citing papers

Rubric-Guided Process Reward for Stepwise Model Routing

cs.AI · 2026-05-28 · unverdicted · novelty 6.0

RoRo uses alternating optimization of a Rubricor and Judge to create process rewards from outcome-cost-process preference pairs, then combines them with outcome rewards via GRPO to train stepwise model routers that outperform baselines on five reasoning benchmarks.

citing papers explorer

Showing 4 of 4 citing papers.