Rubric-Guided Process Reward for Stepwise Model Routing

Jian Yang; Shenghao Ye; Shuangwu Chen; Yu Guo; Zhengheng Li

arxiv: 2605.29310 · v1 · pith:IWBAFUTYnew · submitted 2026-05-28 · 💻 cs.AI · cs.CL

Rubric-Guided Process Reward for Stepwise Model Routing

Shenghao Ye , Yu Guo , Zhengheng Li , Shuangwu Chen , Jian Yang This is my paper

Pith reviewed 2026-06-29 07:56 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords stepwise model routingprocess rewardrubric-guided evaluationlarge reasoning modelspreference pairsGRPOalternating optimization

0 comments

The pith

RoRo trains a Rubricor and Judge via alternating optimization to score routing trajectories with query-specific rubrics, supplying process rewards that outperform outcome-only supervision.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that stepwise model routing currently relies on final-answer rewards that provide no signal about the quality of individual routing choices. RoRo fixes this by first gathering diverse trajectories, forming preference pairs on outcome, cost, and process quality, then alternating between a Rubricor that creates a query-specific rubric and a Judge that scores trajectories against it. The resulting process rewards are added to outcome rewards and used to optimize the router with GRPO. If the approach holds, routing decisions become directly supervised at each step, producing better accuracy and lower cost across model families on reasoning tasks.

Core claim

RoRo collects routing trajectories, builds preference pairs from outcome, cost, and process quality, and uses alternating optimization to train Rubricor to generate query-specific evaluation rubrics and a Judge to score the trajectories under those rubrics; the resulting process rewards, when combined with outcome rewards, train a routing policy via GRPO that outperforms baselines on five reasoning benchmarks under same-family and cross-family settings.

What carries the argument

Rubricor-Judge alternating optimization that produces query-specific rubrics and scores routing trajectories to generate process rewards.

If this is right

Intermediate routing decisions receive direct supervision instead of only final-answer correctness.
The router achieves higher accuracy and better cost trade-offs on five reasoning benchmarks in both same-family and cross-family model settings.
Process rewards derived from rubric scoring generalize the supervision signal beyond outcome-only methods.
Alternating optimization between rubric generation and trajectory scoring produces usable rewards for GRPO policy optimization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same rubric-generation loop could be tested on sequential decisions outside model routing, such as tool-use chains or multi-step planning.
If rubric quality can be maintained without human validation, the method reduces dependence on manually designed evaluation criteria in other RL settings.
Cross-family gains suggest the rubric approach may help when routing between models whose internal representations differ substantially.

Load-bearing premise

Preference pairs built from outcome, cost, and process quality can be scored reliably by the learned Judge under a rubric generated by Rubricor without introducing new biases or needing human checks on rubric quality.

What would settle it

Run a held-out human rating study on scored trajectories and check whether Judge scores under the generated rubrics show low correlation with human judgments on process quality; if correlation is near zero while performance gains vanish on new benchmarks, the rubric-guided process reward claim is falsified.

Figures

Figures reproduced from arXiv: 2605.29310 by Jian Yang, Shenghao Ye, Shuangwu Chen, Yu Guo, Zhengheng Li.

**Figure 3.** Figure 3: Overview of the RoRo pipeline. Stage 1 constructs route preference data from routing policies. Stage 2 [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Accuracy-FLOPs trade-off curves under same [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 7.** Figure 7: Routing trajectory of RoRo on Case 1 (omnimath_01261). RoRo concentrates LRM calls on the first 9 steps for problem formulation and reaches the correct answer with 13 LRM calls [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 9.** Figure 9: Routing trajectory of RoRo on Case 2 (omnimath_01924). RoRo invokes the LRM at the early critical stage and reaches the correct answer with only 10 LRM calls [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

read the original abstract

Stepwise model routing improves the efficiency of Large Reasoning Models (LRMs) by assigning each reasoning step to a suitable model. Recent methods formulate routing as a sequential decision process and train the router with reinforcement learning. However, although they model routing as a process, they still supervise the router with outcome rewards. Such rewards only reflect final answer correctness and fail to evaluate intermediate routing decisions, which can weaken performance and generalization. To address this gap, we propose RoRo, a rubric-guided process reward framework for stepwise model routing. RoRo first collects diverse routing trajectories and constructs preference pairs based on outcome, cost, and process quality. It then trains a Rubricor to generate a query-specific evaluation rubric and a Judge to score routing trajectories under this rubric through alternating optimization. The resulting process rewards are combined with outcome rewards to optimize the routing policy via GRPO. Experiments on five reasoning benchmarks under both same-family and cross-family settings show that RoRo consistently outperforms strong baselines and achieves better accuracy and cost trade-offs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RoRo adds query-specific rubrics and alternating Rubricor-Judge training to generate process rewards for routing, but without human validation those scores may largely echo the outcome and cost signals already used.

read the letter

The main takeaway is that the paper gives a concrete pipeline for process-level supervision in stepwise routing: collect trajectories, build preference pairs from outcome/cost/process quality, then alternate between a Rubricor that writes query-specific rubrics and a Judge that scores under them, and finally mix the resulting process rewards into GRPO.

What is actually new is the combination of query-dependent rubric generation with alternating optimization and the use of those scores as an additive term in the router's RL objective. Prior routing work stayed with outcome rewards; this tries to close that gap directly.

The approach is reasonable on paper. Routing decisions are intermediate, so a reward that looks at the step rather than only the final answer makes sense in principle, and the five-benchmark setup (same-family and cross-family) is a fair test bed.

The soft spot is exactly the one the stress-test flags. The process reward depends on Judge scores produced under Rubricor rubrics, yet the abstract and summary give no evidence that those scores were checked against human process judgments or that the alternating optimization avoids simply rediscovering patterns already present in the outcome and cost data. If the added term is mostly redundant, the reported accuracy-cost gains could come from data construction, hyperparameter choices, or baseline differences rather than genuine process supervision. Minor additional issues are the lack of visible statistical significance tests or ablation on rubric quality in the provided material.

This is for people working on inference-time routing and process rewards for reasoning models. A reader already following that literature will see a clear next step even if the validation is missing.

It deserves peer review because the idea is well-motivated and the experimental claims are specific enough to be checked. I would send it, with the main referee questions focused on whether the learned process scores actually capture independent routing quality.

Referee Report

3 major / 2 minor

Summary. The paper proposes RoRo, a rubric-guided process reward framework for stepwise model routing in Large Reasoning Models. It collects diverse routing trajectories, constructs preference pairs based on outcome, cost, and process quality, then trains a Rubricor to generate query-specific rubrics and a Judge to score trajectories via alternating optimization. The resulting process rewards are combined with outcome rewards to optimize the routing policy using GRPO. Experiments on five reasoning benchmarks under same-family and cross-family settings report that RoRo consistently outperforms strong baselines with improved accuracy-cost trade-offs.

Significance. If the process rewards derived from the learned Judge under Rubricor rubrics genuinely reflect intermediate routing quality independent of outcome and cost signals, the framework would address a clear gap in outcome-only supervision for sequential routing decisions, potentially enabling more efficient and generalizable inference in LRMs.

major comments (3)

[Experiments] The central claim that process rewards improve routing over outcome-only baselines rests on the Judge producing scores that capture genuine process quality. However, the manuscript provides no human validation, inter-annotator agreement, or correlation analysis between Judge scores and human process judgments (Experiments section), leaving open the possibility that gains arise from rubric artifacts or reproduction of outcome/cost patterns rather than new information.
[Method] The alternating optimization between Rubricor and Judge is presented as producing stable, query-specific rubrics, but no convergence diagnostics, stability metrics, or ablation on whether rubrics remain non-circular with the preference-construction signals (outcome, cost, process quality) are reported (Method section). This is load-bearing for attributing performance gains to the process term in GRPO.
[Experiments] Table or figure reporting the five-benchmark results does not include statistical significance tests, variance across runs, or controls for GRPO hyperparameter sensitivity and baseline implementation details, which are required to substantiate the 'consistent outperformance' claim given the low-visibility dataset construction process.

minor comments (2)

[Introduction] The abstract and method description introduce 'Rubricor' and 'Judge' without a dedicated related-work subsection contrasting them against prior rubric or LLM-as-judge approaches.
[Method] Notation for the combined reward (process + outcome) inside GRPO is introduced without an explicit equation reference, making it harder to trace how the process term is scaled.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to incorporate the requested validation, diagnostics, and statistical details.

read point-by-point responses

Referee: [Experiments] The central claim that process rewards improve routing over outcome-only baselines rests on the Judge producing scores that capture genuine process quality. However, the manuscript provides no human validation, inter-annotator agreement, or correlation analysis between Judge scores and human process judgments (Experiments section), leaving open the possibility that gains arise from rubric artifacts or reproduction of outcome/cost patterns rather than new information.

Authors: We agree that explicit human validation would provide stronger evidence that the Judge scores reflect genuine process quality. The preference pairs explicitly incorporate process-quality signals during construction, but to address the concern we will add a small-scale human correlation study (with inter-annotator agreement) on a held-out subset of trajectories in the revised manuscript. revision: yes
Referee: [Method] The alternating optimization between Rubricor and Judge is presented as producing stable, query-specific rubrics, but no convergence diagnostics, stability metrics, or ablation on whether rubrics remain non-circular with the preference-construction signals (outcome, cost, process quality) are reported (Method section). This is load-bearing for attributing performance gains to the process term in GRPO.

Authors: We will add convergence curves for the alternating optimization, stability metrics across iterations, and an ablation demonstrating that the learned rubrics do not collapse to outcome/cost signals alone. These additions will be placed in the Method section and appendix of the revision. revision: yes
Referee: [Experiments] Table or figure reporting the five-benchmark results does not include statistical significance tests, variance across runs, or controls for GRPO hyperparameter sensitivity and baseline implementation details, which are required to substantiate the 'consistent outperformance' claim given the low-visibility dataset construction process.

Authors: We acknowledge the need for statistical rigor. The revised manuscript will report means and standard deviations over multiple random seeds, include paired t-test p-values against baselines, and provide additional details on GRPO hyperparameter ranges and baseline re-implementation choices. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected; derivation relies on external preference construction and benchmark evaluation.

full rationale

The abstract describes an external data collection step to build preference pairs from outcome, cost, and process quality signals, followed by alternating optimization to train Rubricor and Judge, with the resulting scores added to outcome rewards inside GRPO. This structure does not reduce the process reward term to a quantity defined by the router policy itself or to fitted parameters by construction. No self-definitional equations, fitted-input predictions, or load-bearing self-citations are identifiable from the provided text. The performance claims rest on experiments across five benchmarks rather than internal redefinitions, making the chain self-contained against external evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

Abstract-only; no explicit free parameters, axioms, or invented entities are stated beyond the introduction of Rubricor and Judge as trainable components. The framework implicitly assumes that process quality can be captured by rubric-based scoring without circular dependence on the router being trained.

invented entities (2)

Rubricor no independent evidence
purpose: Generates query-specific evaluation rubrics for scoring routing trajectories
Introduced as a trainable model in the alternating optimization loop; no independent evidence provided in abstract.
Judge no independent evidence
purpose: Scores routing trajectories under the generated rubric
Introduced as a trainable model in the alternating optimization loop; no independent evidence provided in abstract.

pith-pipeline@v0.9.1-grok · 5712 in / 1330 out tokens · 18515 ms · 2026-06-29T07:56:23.529818+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 4 canonical work pages · 3 internal anchors

[1]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Rubrics as rewards: Reinforcement learning beyond verifiable domains. InInternational Confer- ence on Learning Representations. Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, and 1 others. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv p...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Let's Verify Step by Step

Let’s verify step by step.arXiv preprint arXiv:2305.20050. Tianci Liu, Ran Xu, Tony Yu, Ilgee Hong, Carl Yang, Tuo Zhao, and Haoyu Wang. 2025. Openrubrics: Towards scalable synthetic rubric generation for re- ward modeling and LLM alignment.arXiv preprint arXiv:2510.07743. Qianli Ma, Haotian Zhou, Tingkai Liu, Jianbo Yuan, Pengfei Liu, Yang You, and Hongx...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Solving math word problems with process- based and outcome-based feedback. Binghai Wang, Yantao Liu, Yuxuan Liu, Tianyi Tang, Shenzhi Wang, Chang Gao, Chujie Zheng, Yichang Zhang, Le Yu, Shixuan Liu, Tao Gui, Qi Zhang, Xu- anjing Huang, Bowen Yu, Fei Huang, and Junyang Lin. 2026. Outcome accuracy is not enough: Align- ing the reasoning process of reward m...

work page arXiv 2026
[4]

Qwen3 Technical Report

Qwen3 technical report.arXiv preprint arXiv:2505.09388. Wenhao Zeng, Xuteng Zhang, Yuling Shi, Chao Hu, Yuting Chen, Beijun Shen, and Xiaodong Gu. 2026. Glimprouter: Efficient collaborative inference by glimpsing one token of thoughts.arXiv preprint arXiv:2601.05110. Haozhen Zhang, Tao Feng, and Jiaxuan You. 2026. Router-r1: Teaching llms multi-round rout...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[5]

The routing policy is a 2-layer MLP with 128 hidden units, following TRIM (Kapoor et al., 2026)

for cross-family collaboration. The routing policy is a 2-layer MLP with 128 hidden units, following TRIM (Kapoor et al., 2026). It takes a 5-dimensional state vector as input, including three uncertainty-derived features (the current-step uncertainty, the minimum and average uncertainty over the prefix), the step token count normalized by a fixed constan...

2026
[6]

We use the subset with difficulty levels 1–10 for our evaluation

The benchmark spans a wide range of math- ematical topics and requires advanced problem- solving strategies. We use the subset with difficulty levels 1–10 for our evaluation. GSM8K (Cobbe et al., 2021).GSM8K is a benchmark of 8,792 grade-school math word prob- lems that require multi-step arithmetic reasoning. Although the individual reasoning steps are s...

2021
[7]

It evaluates the routing process rather than final answer correctness
[8]

It is label-agnostic: it must not refer to trajectory IDs, final correctness, reference answers, or which trajectory is preferred
[9]

It should be applicable beyond this specific problem, while still being relevant to the routing challenges shown in the trajectory pool
[10]

It should capture whether the route prevents, repairs, or verifies high-impact reasoning errors
[11]

always use LRM in later steps

It must not collapse into trivial heuristics such as "always use LRM in later steps", "always minimize LRM calls", "always avoid switching", or "use LRM whenever the solution is long". ### Possible Aspects: Possible aspects include, but are not limited to: intervention before error propagation; timeliness of escalation under uncertainty; avoiding LRM call...
[12]

Decide whether the trajectory satisfies the criterion
[13]

satisfied

Set "satisfied" to true if the routing behavior clearly satisfies the criterion
[14]

satisfied

Set "satisfied" to false if the routing behavior clearly violates the criterion or lacks evidence for satisfying it
[15]

criterion_judgments

Use the criterion and score_guidance to make the decision. Compute the final process score as: final_score = sum of weight * score * indicator where indicator = 1 if satisfied is true and 0 otherwise. ### Output Format: Return only a valid JSON object: {"criterion_judgments": [{"criterion": "the original criterion text", "score": 0.5, "satisfied": true} ....

[1] [1]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Rubrics as rewards: Reinforcement learning beyond verifiable domains. InInternational Confer- ence on Learning Representations. Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, and 1 others. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv p...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Let's Verify Step by Step

Let’s verify step by step.arXiv preprint arXiv:2305.20050. Tianci Liu, Ran Xu, Tony Yu, Ilgee Hong, Carl Yang, Tuo Zhao, and Haoyu Wang. 2025. Openrubrics: Towards scalable synthetic rubric generation for re- ward modeling and LLM alignment.arXiv preprint arXiv:2510.07743. Qianli Ma, Haotian Zhou, Tingkai Liu, Jianbo Yuan, Pengfei Liu, Yang You, and Hongx...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Solving math word problems with process- based and outcome-based feedback. Binghai Wang, Yantao Liu, Yuxuan Liu, Tianyi Tang, Shenzhi Wang, Chang Gao, Chujie Zheng, Yichang Zhang, Le Yu, Shixuan Liu, Tao Gui, Qi Zhang, Xu- anjing Huang, Bowen Yu, Fei Huang, and Junyang Lin. 2026. Outcome accuracy is not enough: Align- ing the reasoning process of reward m...

work page arXiv 2026

[4] [4]

Qwen3 Technical Report

Qwen3 technical report.arXiv preprint arXiv:2505.09388. Wenhao Zeng, Xuteng Zhang, Yuling Shi, Chao Hu, Yuting Chen, Beijun Shen, and Xiaodong Gu. 2026. Glimprouter: Efficient collaborative inference by glimpsing one token of thoughts.arXiv preprint arXiv:2601.05110. Haozhen Zhang, Tao Feng, and Jiaxuan You. 2026. Router-r1: Teaching llms multi-round rout...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[5] [5]

The routing policy is a 2-layer MLP with 128 hidden units, following TRIM (Kapoor et al., 2026)

for cross-family collaboration. The routing policy is a 2-layer MLP with 128 hidden units, following TRIM (Kapoor et al., 2026). It takes a 5-dimensional state vector as input, including three uncertainty-derived features (the current-step uncertainty, the minimum and average uncertainty over the prefix), the step token count normalized by a fixed constan...

2026

[6] [6]

We use the subset with difficulty levels 1–10 for our evaluation

The benchmark spans a wide range of math- ematical topics and requires advanced problem- solving strategies. We use the subset with difficulty levels 1–10 for our evaluation. GSM8K (Cobbe et al., 2021).GSM8K is a benchmark of 8,792 grade-school math word prob- lems that require multi-step arithmetic reasoning. Although the individual reasoning steps are s...

2021

[7] [7]

It evaluates the routing process rather than final answer correctness

[8] [8]

It is label-agnostic: it must not refer to trajectory IDs, final correctness, reference answers, or which trajectory is preferred

[9] [9]

It should be applicable beyond this specific problem, while still being relevant to the routing challenges shown in the trajectory pool

[10] [10]

It should capture whether the route prevents, repairs, or verifies high-impact reasoning errors

[11] [11]

always use LRM in later steps

It must not collapse into trivial heuristics such as "always use LRM in later steps", "always minimize LRM calls", "always avoid switching", or "use LRM whenever the solution is long". ### Possible Aspects: Possible aspects include, but are not limited to: intervention before error propagation; timeliness of escalation under uncertainty; avoiding LRM call...

[12] [12]

Decide whether the trajectory satisfies the criterion

[13] [13]

satisfied

Set "satisfied" to true if the routing behavior clearly satisfies the criterion

[14] [14]

satisfied

Set "satisfied" to false if the routing behavior clearly violates the criterion or lacks evidence for satisfying it

[15] [15]

criterion_judgments

Use the criterion and score_guidance to make the decision. Compute the final process score as: final_score = sum of weight * score * indicator where indicator = 1 if satisfied is true and 0 otherwise. ### Output Format: Return only a valid JSON object: {"criterion_judgments": [{"criterion": "the original criterion text", "score": 0.5, "satisfied": true} ....