pith. sign in

arxiv: 2605.29310 · v1 · pith:IWBAFUTYnew · submitted 2026-05-28 · 💻 cs.AI · cs.CL

Rubric-Guided Process Reward for Stepwise Model Routing

Pith reviewed 2026-06-29 07:56 UTC · model grok-4.3

classification 💻 cs.AI cs.CL
keywords stepwise model routingprocess rewardrubric-guided evaluationlarge reasoning modelspreference pairsGRPOalternating optimization
0
0 comments X

The pith

RoRo trains a Rubricor and Judge via alternating optimization to score routing trajectories with query-specific rubrics, supplying process rewards that outperform outcome-only supervision.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that stepwise model routing currently relies on final-answer rewards that provide no signal about the quality of individual routing choices. RoRo fixes this by first gathering diverse trajectories, forming preference pairs on outcome, cost, and process quality, then alternating between a Rubricor that creates a query-specific rubric and a Judge that scores trajectories against it. The resulting process rewards are added to outcome rewards and used to optimize the router with GRPO. If the approach holds, routing decisions become directly supervised at each step, producing better accuracy and lower cost across model families on reasoning tasks.

Core claim

RoRo collects routing trajectories, builds preference pairs from outcome, cost, and process quality, and uses alternating optimization to train Rubricor to generate query-specific evaluation rubrics and a Judge to score the trajectories under those rubrics; the resulting process rewards, when combined with outcome rewards, train a routing policy via GRPO that outperforms baselines on five reasoning benchmarks under same-family and cross-family settings.

What carries the argument

Rubricor-Judge alternating optimization that produces query-specific rubrics and scores routing trajectories to generate process rewards.

If this is right

  • Intermediate routing decisions receive direct supervision instead of only final-answer correctness.
  • The router achieves higher accuracy and better cost trade-offs on five reasoning benchmarks in both same-family and cross-family model settings.
  • Process rewards derived from rubric scoring generalize the supervision signal beyond outcome-only methods.
  • Alternating optimization between rubric generation and trajectory scoring produces usable rewards for GRPO policy optimization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same rubric-generation loop could be tested on sequential decisions outside model routing, such as tool-use chains or multi-step planning.
  • If rubric quality can be maintained without human validation, the method reduces dependence on manually designed evaluation criteria in other RL settings.
  • Cross-family gains suggest the rubric approach may help when routing between models whose internal representations differ substantially.

Load-bearing premise

Preference pairs built from outcome, cost, and process quality can be scored reliably by the learned Judge under a rubric generated by Rubricor without introducing new biases or needing human checks on rubric quality.

What would settle it

Run a held-out human rating study on scored trajectories and check whether Judge scores under the generated rubrics show low correlation with human judgments on process quality; if correlation is near zero while performance gains vanish on new benchmarks, the rubric-guided process reward claim is falsified.

Figures

Figures reproduced from arXiv: 2605.29310 by Jian Yang, Shenghao Ye, Shuangwu Chen, Yu Guo, Zhengheng Li.

Figure 1
Figure 1. Figure 1: Budgeted accuracy during router optimization [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the RoRo pipeline. Stage 1 constructs route preference data from routing policies. Stage 2 [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Accuracy-FLOPs trade-off curves under same [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 7
Figure 7. Figure 7: Routing trajectory of RoRo on Case 1 (omnimath_01261). RoRo concentrates LRM calls on the first 9 steps for problem formulation and reaches the correct answer with 13 LRM calls [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Routing trajectory of RoRo on Case 2 (omnimath_01924). RoRo invokes the LRM at the early critical stage and reaches the correct answer with only 10 LRM calls [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
read the original abstract

Stepwise model routing improves the efficiency of Large Reasoning Models (LRMs) by assigning each reasoning step to a suitable model. Recent methods formulate routing as a sequential decision process and train the router with reinforcement learning. However, although they model routing as a process, they still supervise the router with outcome rewards. Such rewards only reflect final answer correctness and fail to evaluate intermediate routing decisions, which can weaken performance and generalization. To address this gap, we propose RoRo, a rubric-guided process reward framework for stepwise model routing. RoRo first collects diverse routing trajectories and constructs preference pairs based on outcome, cost, and process quality. It then trains a Rubricor to generate a query-specific evaluation rubric and a Judge to score routing trajectories under this rubric through alternating optimization. The resulting process rewards are combined with outcome rewards to optimize the routing policy via GRPO. Experiments on five reasoning benchmarks under both same-family and cross-family settings show that RoRo consistently outperforms strong baselines and achieves better accuracy and cost trade-offs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes RoRo, a rubric-guided process reward framework for stepwise model routing in Large Reasoning Models. It collects diverse routing trajectories, constructs preference pairs based on outcome, cost, and process quality, then trains a Rubricor to generate query-specific rubrics and a Judge to score trajectories via alternating optimization. The resulting process rewards are combined with outcome rewards to optimize the routing policy using GRPO. Experiments on five reasoning benchmarks under same-family and cross-family settings report that RoRo consistently outperforms strong baselines with improved accuracy-cost trade-offs.

Significance. If the process rewards derived from the learned Judge under Rubricor rubrics genuinely reflect intermediate routing quality independent of outcome and cost signals, the framework would address a clear gap in outcome-only supervision for sequential routing decisions, potentially enabling more efficient and generalizable inference in LRMs.

major comments (3)
  1. [Experiments] The central claim that process rewards improve routing over outcome-only baselines rests on the Judge producing scores that capture genuine process quality. However, the manuscript provides no human validation, inter-annotator agreement, or correlation analysis between Judge scores and human process judgments (Experiments section), leaving open the possibility that gains arise from rubric artifacts or reproduction of outcome/cost patterns rather than new information.
  2. [Method] The alternating optimization between Rubricor and Judge is presented as producing stable, query-specific rubrics, but no convergence diagnostics, stability metrics, or ablation on whether rubrics remain non-circular with the preference-construction signals (outcome, cost, process quality) are reported (Method section). This is load-bearing for attributing performance gains to the process term in GRPO.
  3. [Experiments] Table or figure reporting the five-benchmark results does not include statistical significance tests, variance across runs, or controls for GRPO hyperparameter sensitivity and baseline implementation details, which are required to substantiate the 'consistent outperformance' claim given the low-visibility dataset construction process.
minor comments (2)
  1. [Introduction] The abstract and method description introduce 'Rubricor' and 'Judge' without a dedicated related-work subsection contrasting them against prior rubric or LLM-as-judge approaches.
  2. [Method] Notation for the combined reward (process + outcome) inside GRPO is introduced without an explicit equation reference, making it harder to trace how the process term is scaled.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to incorporate the requested validation, diagnostics, and statistical details.

read point-by-point responses
  1. Referee: [Experiments] The central claim that process rewards improve routing over outcome-only baselines rests on the Judge producing scores that capture genuine process quality. However, the manuscript provides no human validation, inter-annotator agreement, or correlation analysis between Judge scores and human process judgments (Experiments section), leaving open the possibility that gains arise from rubric artifacts or reproduction of outcome/cost patterns rather than new information.

    Authors: We agree that explicit human validation would provide stronger evidence that the Judge scores reflect genuine process quality. The preference pairs explicitly incorporate process-quality signals during construction, but to address the concern we will add a small-scale human correlation study (with inter-annotator agreement) on a held-out subset of trajectories in the revised manuscript. revision: yes

  2. Referee: [Method] The alternating optimization between Rubricor and Judge is presented as producing stable, query-specific rubrics, but no convergence diagnostics, stability metrics, or ablation on whether rubrics remain non-circular with the preference-construction signals (outcome, cost, process quality) are reported (Method section). This is load-bearing for attributing performance gains to the process term in GRPO.

    Authors: We will add convergence curves for the alternating optimization, stability metrics across iterations, and an ablation demonstrating that the learned rubrics do not collapse to outcome/cost signals alone. These additions will be placed in the Method section and appendix of the revision. revision: yes

  3. Referee: [Experiments] Table or figure reporting the five-benchmark results does not include statistical significance tests, variance across runs, or controls for GRPO hyperparameter sensitivity and baseline implementation details, which are required to substantiate the 'consistent outperformance' claim given the low-visibility dataset construction process.

    Authors: We acknowledge the need for statistical rigor. The revised manuscript will report means and standard deviations over multiple random seeds, include paired t-test p-values against baselines, and provide additional details on GRPO hyperparameter ranges and baseline re-implementation choices. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected; derivation relies on external preference construction and benchmark evaluation.

full rationale

The abstract describes an external data collection step to build preference pairs from outcome, cost, and process quality signals, followed by alternating optimization to train Rubricor and Judge, with the resulting scores added to outcome rewards inside GRPO. This structure does not reduce the process reward term to a quantity defined by the router policy itself or to fitted parameters by construction. No self-definitional equations, fitted-input predictions, or load-bearing self-citations are identifiable from the provided text. The performance claims rest on experiments across five benchmarks rather than internal redefinitions, making the chain self-contained against external evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

Abstract-only; no explicit free parameters, axioms, or invented entities are stated beyond the introduction of Rubricor and Judge as trainable components. The framework implicitly assumes that process quality can be captured by rubric-based scoring without circular dependence on the router being trained.

invented entities (2)
  • Rubricor no independent evidence
    purpose: Generates query-specific evaluation rubrics for scoring routing trajectories
    Introduced as a trainable model in the alternating optimization loop; no independent evidence provided in abstract.
  • Judge no independent evidence
    purpose: Scores routing trajectories under the generated rubric
    Introduced as a trainable model in the alternating optimization loop; no independent evidence provided in abstract.

pith-pipeline@v0.9.1-grok · 5712 in / 1330 out tokens · 18515 ms · 2026-06-29T07:56:23.529818+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

15 extracted references · 4 canonical work pages · 3 internal anchors

  1. [1]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Rubrics as rewards: Reinforcement learning beyond verifiable domains. InInternational Confer- ence on Learning Representations. Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, and 1 others. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv p...

  2. [2]

    Let's Verify Step by Step

    Let’s verify step by step.arXiv preprint arXiv:2305.20050. Tianci Liu, Ran Xu, Tony Yu, Ilgee Hong, Carl Yang, Tuo Zhao, and Haoyu Wang. 2025. Openrubrics: Towards scalable synthetic rubric generation for re- ward modeling and LLM alignment.arXiv preprint arXiv:2510.07743. Qianli Ma, Haotian Zhou, Tingkai Liu, Jianbo Yuan, Pengfei Liu, Yang You, and Hongx...

  3. [3]

    Solving math word problems with process- based and outcome-based feedback. Binghai Wang, Yantao Liu, Yuxuan Liu, Tianyi Tang, Shenzhi Wang, Chang Gao, Chujie Zheng, Yichang Zhang, Le Yu, Shixuan Liu, Tao Gui, Qi Zhang, Xu- anjing Huang, Bowen Yu, Fei Huang, and Junyang Lin. 2026. Outcome accuracy is not enough: Align- ing the reasoning process of reward m...

  4. [4]

    Qwen3 Technical Report

    Qwen3 technical report.arXiv preprint arXiv:2505.09388. Wenhao Zeng, Xuteng Zhang, Yuling Shi, Chao Hu, Yuting Chen, Beijun Shen, and Xiaodong Gu. 2026. Glimprouter: Efficient collaborative inference by glimpsing one token of thoughts.arXiv preprint arXiv:2601.05110. Haozhen Zhang, Tao Feng, and Jiaxuan You. 2026. Router-r1: Teaching llms multi-round rout...

  5. [5]

    The routing policy is a 2-layer MLP with 128 hidden units, following TRIM (Kapoor et al., 2026)

    for cross-family collaboration. The routing policy is a 2-layer MLP with 128 hidden units, following TRIM (Kapoor et al., 2026). It takes a 5-dimensional state vector as input, including three uncertainty-derived features (the current-step uncertainty, the minimum and average uncertainty over the prefix), the step token count normalized by a fixed constan...

  6. [6]

    We use the subset with difficulty levels 1–10 for our evaluation

    The benchmark spans a wide range of math- ematical topics and requires advanced problem- solving strategies. We use the subset with difficulty levels 1–10 for our evaluation. GSM8K (Cobbe et al., 2021).GSM8K is a benchmark of 8,792 grade-school math word prob- lems that require multi-step arithmetic reasoning. Although the individual reasoning steps are s...

  7. [7]

    It evaluates the routing process rather than final answer correctness

  8. [8]

    It is label-agnostic: it must not refer to trajectory IDs, final correctness, reference answers, or which trajectory is preferred

  9. [9]

    It should be applicable beyond this specific problem, while still being relevant to the routing challenges shown in the trajectory pool

  10. [10]

    It should capture whether the route prevents, repairs, or verifies high-impact reasoning errors

  11. [11]

    always use LRM in later steps

    It must not collapse into trivial heuristics such as "always use LRM in later steps", "always minimize LRM calls", "always avoid switching", or "use LRM whenever the solution is long". ### Possible Aspects: Possible aspects include, but are not limited to: intervention before error propagation; timeliness of escalation under uncertainty; avoiding LRM call...

  12. [12]

    Decide whether the trajectory satisfies the criterion

  13. [13]

    satisfied

    Set "satisfied" to true if the routing behavior clearly satisfies the criterion

  14. [14]

    satisfied

    Set "satisfied" to false if the routing behavior clearly violates the criterion or lacks evidence for satisfying it

  15. [15]

    criterion_judgments

    Use the criterion and score_guidance to make the decision. Compute the final process score as: final_score = sum of weight * score * indicator where indicator = 1 if satisfied is true and 0 otherwise. ### Output Format: Return only a valid JSON object: {"criterion_judgments": [{"criterion": "the original criterion text", "score": 0.5, "satisfied": true} ....