arxiv: 2605.06116 · v1 · submitted 2026-05-07 · 💻 cs.AI

Recognition: unknown

Policy-Guided Stepwise Model Routing for Cost-Effective Reasoning

Insup Lee, Osbert Bastani, Wenwen Si

Authors on Pith no claims yet

Pith reviewed 2026-05-08 10:25 UTC · model grok-4.3

classification 💻 cs.AI

keywords model routingreinforcement learningchain-of-thought reasoningcost-efficient inferencelarge language modelsmath reasoning benchmarksstepwise decision making

0 comments

The pith

A small reinforcement learning policy routes chain-of-thought steps between large and small models to improve accuracy per unit cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper frames the choice of which model to invoke for each next reasoning step as a constrained optimization problem. It solves this by training a compact policy with reinforcement learning and then calibrating decision thresholds to balance correctness against total inference spend. Experiments on GSM8K, MATH500, and OmniMath show the resulting routing curve lies above handcrafted baselines and reaches parity with far larger process-reward models. The approach requires only a small additional network and works for both open-weight and closed API models.

Core claim

We formulate stepwise model routing as a constrained decision-making problem, which we solve by training a small control policy using reinforcement learning in conjunction with threshold calibration to tune the performance-efficiency tradeoff. Our method consistently improves the accuracy-cost tradeoff compared to handcrafted approaches, while achieving a comparable tradeoff to methods that require training large process reward models.

What carries the argument

A compact reinforcement-learning policy that outputs routing actions at each intermediate chain-of-thought state, paired with post-training threshold calibration that sets the cost-performance operating point.

If this is right

The method delivers higher accuracy for any given inference budget on the three evaluated math benchmarks.
No large process-reward model needs to be trained or stored at inference time.
The same small policy can be reused across both open-weight and proprietary model pairs.
Threshold calibration provides a direct knob to move along the accuracy-cost frontier after training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same constrained-RL formulation could be applied to routing decisions that also respect latency or energy limits rather than cost alone.
If the policy learns general features of when a small model suffices, the approach may transfer to non-math domains such as code generation or scientific question answering.
Deploying only the small policy plus the two base models removes the memory and compute overhead of maintaining a separate large reward model.

Load-bearing premise

A small policy learned through reinforcement learning can discover routing decisions that remain reliable when applied to new problems and to model families it has not seen during training.

What would settle it

On a held-out math or reasoning benchmark, the learned policy produces an accuracy-cost curve that lies strictly below the curve obtained from a simple handcrafted rule such as 'use the large model for the first three steps then switch to the small model.'

Figures

Figures reproduced from arXiv: 2605.06116 by Insup Lee, Osbert Bastani, Wenwen Si.

**Figure 1.** Figure 1: Illustrations of policy-guided stepwise model routing. view at source ↗

read the original abstract

Inference-time computation has greatly enhanced the performance of large language models (LLMs) on challenging reasoning tasks, but this strategy can incur high inference costs. One solution is to route intermediate chain-of-thought (CoT) states to language models of different sizes; however, existing approaches rely on handcrafted routing strategies that limit performance, or on training large process reward models that may be infeasible in many applications. We formulate stepwise model routing as a constrained decision-making problem, which we solve by training a small control policy using reinforcement learning in conjunction with threshold calibration to tune the performance-efficiency tradeoff. We validate our method on three math benchmarks (GSM8K, MATH500, and OmniMath) on both open and closed models. Our method consistently improves the accuracy-cost tradeoff compared to handcrafted approaches, while achieving a comparable tradeoff to methods that require training large process reward models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's small RL policy for stepwise CoT routing gives a lighter way to trade accuracy for cost than big PRMs, but the abstract leaves the generalization and reward design claims thin.

read the letter

The main takeaway is that this work frames stepwise model routing as a constrained RL problem and solves it with a small control policy plus threshold calibration. That setup aims to beat handcrafted routers on the accuracy-cost curve while matching large process reward models, all without training heavy supervisors. They run it on GSM8K, MATH500, and OmniMath using both open and closed model families, which covers a reasonable range of settings. If the full experiments back the abstract claims with clear numbers, this could be a useful tool for anyone running reasoning workloads on a budget. The formulation itself is the clearest advance over prior handcrafted or PRM-heavy baselines. The soft spots sit in the evidence. The abstract states consistent improvements but gives no tables, error bars, policy-size ablations, or reward-function details, so it is difficult to see whether the small policy learns genuine per-step logic or just coarse heuristics that happen to work on these three benchmarks. The stress-test concern about implicit process supervision or benchmark overfitting is reasonable given what is shown; without those checks, the load-bearing assumption that RL alone produces reliable, transferable decisions stays untested. This paper is aimed at researchers working on inference-time efficiency for LLMs, especially those who want routing without large auxiliary models. A reader already following routing or cost-reduction papers would get value from the method and the multi-benchmark setup. It deserves a serious referee because the problem is practical and the RL framing is clean, even if the current writeup needs more experimental grounding to stand up. I would send it to review with a request for the missing ablations and cross-family results.

Referee Report

2 major / 2 minor

Summary. The paper formulates stepwise model routing for LLM chain-of-thought reasoning as a constrained decision-making problem and solves it by training a small control policy with reinforcement learning together with threshold calibration. It evaluates the approach on GSM8K, MATH500, and OmniMath using both open and closed models, claiming consistent improvements in the accuracy-cost tradeoff over handcrafted baselines and performance comparable to methods that train large process reward models.

Significance. If the empirical results and generalization claims hold, the work offers a practical route to cost-effective inference-time reasoning that avoids the computational overhead of large process reward models. The combination of RL-based policy training with simple threshold tuning is a lightweight alternative to existing routing strategies and could broaden access to high-performance reasoning systems.

major comments (2)

[Abstract] Abstract and experimental sections: the central claim of consistent accuracy-cost improvements and comparability to large-PRM methods is stated without quantitative results, error bars, ablation details on reward design or policy size, or cross-family transfer metrics. This leaves the load-bearing assumption—that a small RL policy discovers reliable per-step routing logic—unverified from the available information.
[Method] The formulation treats routing as a constrained decision-making problem solved by RL plus threshold calibration, yet no derivation or pseudocode shows how the reward function avoids reducing to final-answer correctness alone (which would yield only coarse heuristics rather than stepwise decisions).

minor comments (2)

[Method] Notation for the state representation and action space of the control policy should be defined explicitly with an equation or diagram to clarify what information is available to the small policy at each step.
[Experiments] The three benchmarks are listed but no table or figure caption indicates the exact model families, sizes, or cost metrics (e.g., tokens or latency) used for the open/closed comparisons.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address the major comments point by point below and are happy to revise the manuscript to strengthen the presentation of our results and method.

read point-by-point responses

Referee: [Abstract] Abstract and experimental sections: the central claim of consistent accuracy-cost improvements and comparability to large-PRM methods is stated without quantitative results, error bars, ablation details on reward design or policy size, or cross-family transfer metrics. This leaves the load-bearing assumption—that a small RL policy discovers reliable per-step routing logic—unverified from the available information.

Authors: The experimental sections report quantitative accuracy-cost tradeoffs on GSM8K, MATH500, and OmniMath across open and closed models, with direct comparisons to handcrafted baselines and large-PRM methods. To make these claims more verifiable, we will add error bars from repeated runs, ablations varying reward components and policy sizes, and cross-family transfer results in the revised version. These additions will better document the per-step routing decisions learned by the small policy. revision: yes
Referee: [Method] The formulation treats routing as a constrained decision-making problem solved by RL plus threshold calibration, yet no derivation or pseudocode shows how the reward function avoids reducing to final-answer correctness alone (which would yield only coarse heuristics rather than stepwise decisions).

Authors: The reward combines a terminal correctness signal with per-step quality indicators derived from the constrained formulation, so that the policy learns to route at individual CoT steps rather than only at the end. We agree the manuscript would benefit from an explicit derivation and pseudocode; we will insert both in the revised Method section to clarify how the reward encourages fine-grained stepwise decisions. revision: yes

Circularity Check

0 steps flagged

No circularity: standard RL formulation with independent empirical validation

full rationale

The paper formulates stepwise model routing as a constrained decision-making problem and solves it by training a small control policy via reinforcement learning plus threshold calibration. This is a direct application of RL to the routing task with no equations or claims that reduce the reported accuracy-cost improvements to fitted parameters by construction, self-definitional loops, or load-bearing self-citations. The abstract and method description present the RL policy as an independent solver whose performance is validated externally on GSM8K, MATH500, and OmniMath across model families, without renaming known results or smuggling ansatzes via prior work. No load-bearing step collapses to its inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract, the approach rests on standard assumptions about model size differences and RL applicability rather than new invented entities or heavily fitted parameters.

free parameters (1)

routing thresholds
Calibrated to control the performance-efficiency tradeoff; exact values not specified in abstract.

axioms (1)

domain assumption Language models of different sizes exhibit distinct accuracy and cost profiles that can be exploited by routing decisions during reasoning.
Implicit foundation for the stepwise routing strategy.

pith-pipeline@v0.9.0 · 5447 in / 1266 out tokens · 80931 ms · 2026-05-08T10:25:00.834103+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 8 internal anchors

[1]

arXiv preprint arXiv:2402.01139 , year=

Anastasios N Angelopoulos, Rina Foygel Barber, and Stephen Bates. Online conformal prediction with decaying step sizes.arXiv preprint arXiv:2402.01139,

work page arXiv
[2]

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V Le, Christopher R´e, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling.arXiv preprint arXiv:2407.21787,

work page internal anchor Pith review arXiv
[3]

Accelerating Large Language Model Decoding with Speculative Sampling

Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating large language model decoding with speculative sampling. arXiv preprint arXiv:2302.01318, 2023a. Ding Chen, Qingchen Yu, Pengyuan Wang, Mengting Hu, Wentao Zhang, Zhengren Wang, Bo Tang, Feiyu Xiong, Xinchi Li, Chao Wang, et al. xverify: ...

work page internal anchor Pith review arXiv
[4]

FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance

Lingjiao Chen, Matei Zaharia, and James Zou. Frugalgpt: How to use large language models while reducing cost and improving performance.arXiv preprint arXiv:2305.05176, 2023b. Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Sch...

work page internal anchor Pith review arXiv
[5]

2410.07985 , archivePrefix=

9 Preprint. Under review. Bofei Gao, Feifan Song, Zhe Yang, Zefan Cai, Yibo Miao, Qingxiu Dong, Lei Li, Chenghao Ma, Liang Chen, Runxin Xu, et al. Omni-math: A universal olympiad level mathematic benchmark for large language models.arXiv preprint arXiv:2410.07985,

work page arXiv
[6]

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt

URLhttps://doi.org/10.5281/zenodo.16998085. Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,

work page doi:10.5281/zenodo.16998085
[7]

Deep Learning Scaling is Predictable, Empirically

Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable, empirically.arXiv preprint arXiv:1712.00409,

work page internal anchor Pith review arXiv
[8]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361,

work page internal anchor Pith review arXiv 2001
[9]

Reasoning with sampling: Your base model is smarter than you think.arXiv preprint arXiv:2510.14901,

Aayush Karan and Yilun Du. Reasoning with sampling: Your base model is smarter than you think.arXiv preprint arXiv:2510.14901,

work page arXiv
[10]

Reward-guided speculative decoding for efficient llm reasoning

Baohao Liao, Yuhui Xu, Hanze Dong, Junnan Li, Christof Monz, Silvio Savarese, Doyen Sa- hoo, and Caiming Xiong. Reward-guided speculative decoding for efficient llm reasoning. arXiv preprint arXiv:2501.19324,

work page arXiv
[11]

arXiv preprint arXiv:2410.06526 , year=

Kaijing Ma, Xinrun Du, Yunran Wang, Haoran Zhang, Zhoufutu Wen, Xingwei Qu, Jian Yang, Jiaheng Liu, Minghao Liu, Xiang Yue, et al. Kor-bench: Benchmarking language models on knowledge-orthogonal reasoning tasks.arXiv preprint arXiv:2410.06526,

work page arXiv
[12]

RouteLLM: Learning to Route LLMs with Preference Data

Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E Gonzalez, M Waleed Kadous, and Ion Stoica. Routellm: Learning to route llms with preference data. arXiv preprint arXiv:2406.18665,

work page internal anchor Pith review arXiv
[13]

Specrea- son: Fast and accurate inference-time compute via speculative reasoning

Rui Pan, Yinwei Dai, Zhihao Zhang, Gabriele Oliaro, Zhihao Jia, and Ravi Netravali. Specrea- son: Fast and accurate inference-time compute via speculative reasoning.arXiv preprint arXiv:2504.07891,

work page arXiv
[14]

Robots that ask for help: Uncertainty alignment for large language model planners,

PMLR. Allen Z Ren, Anushri Dixit, Alexandra Bodrova, Sumeet Singh, Stephen Tu, Noah Brown, Peng Xu, Leila Takayama, Fei Xia, Jake Varley, et al. Robots that ask for help: Uncertainty alignment for large language model planners.arXiv preprint arXiv:2307.01928,

work page arXiv
[15]

Beyond chinchilla- optimal: Accounting for inference in language model scaling laws,

Nikhil Sardana, Jacob Portes, Sasha Doubov, and Jonathan Frankle. Beyond chinchilla- optimal: Accounting for inference in language model scaling laws.arXiv preprint arXiv:2401.00448,

work page arXiv
[16]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171,

work page internal anchor Pith review arXiv
[17]

An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, et al. Qwen2. 5-math technical report: Toward mathematical expert model via self-improvement.arXiv preprint arXiv:2409.12122,

work page internal anchor Pith review arXiv