pith. sign in

arxiv: 2605.16604 · v1 · pith:PNF2PRQ2new · submitted 2026-05-15 · 💻 cs.LG

R2V Agent: Teaching SLMs When to Ask for Help

Pith reviewed 2026-05-20 20:20 UTC · model grok-4.3

classification 💻 cs.LG
keywords R2V-AgentSLM-LLM routingrisk calibrationprocess verifierinteractive agentsreliability-cost frontierstep-level routeragentic systems
0
0 comments X

The pith

A calibrated router lets small language models run interactive agents and escalates to large models only on steps where failure is likely.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to build agents that mostly use cheap small language models but still reach high reliability by calling a large model at risky moments during execution. Difficulty changes during a task after tool calls or errors, so the system trains a small policy first then learns a router that spots residual failures using a verifier and risk-aware training. Experiments on coding, text adventure, and terminal tasks demonstrate higher success rates at lower escalation costs than previous routing methods. Readers should care because this makes powerful agents practical without constant expensive model use.

Core claim

R2V-Agent is a risk-calibrated SLM-LLM routing framework for interactive agents. It first trains a stable small language model policy through behavioral cloning on teacher trajectories followed by verifier-guided direct preference optimization. A lightweight process verifier then scores candidate actions at each step, and a step-level router is trained on the fixed policy's residual failures using Brier-calibrated probability estimation and a CVaR-constrained objective. This produces escalation decisions that improve the reliability-cost frontier across HumanEval+, TextWorld, and TerminalBench.

What carries the argument

The calibrated step-level router that estimates residual failure risk for the fixed small policy at each step and escalates to the teacher LLM only when the risk warrants intervention according to Brier scores and CVaR.

Load-bearing premise

The lightweight process verifier can accurately score how likely the small model is to fail on candidate actions so the router produces reliable escalation decisions that generalize beyond the training perturbations.

What would settle it

Test the complete R2V system on a new interactive task whose failure modes differ from those generated by the perturbation seeds used to train the router, and check whether success rates and escalation fractions remain close to the reported values.

Figures

Figures reproduced from arXiv: 2605.16604 by Humaira Firdowse Mohammed, Raghu Vamshi Hemadri, Rishabh Maheshwary, Sagar Davasam, Sai Rajeswar, Srinivas Sunkara, Srivatsava Daruru, Vikas Yadav.

Figure 1
Figure 1. Figure 1: Traditional agentic workflows versus R2V-Agent. Monolithic frontier-LLM execution is [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: R2V-Agent pipeline. Phase I: Teacher trajectories are perturbed to train a BC-initialized SLM with verifier-guided DPO and consistency regularization; verifier and policy features then train a Brier-calibrated and CVaR-calibrated router. Phase II: At inference, the SLM acts by default, while the teacher LLM is invoked only when the router’s residual-risk estimate exceeds τ ∗ route. 3.1 Verifier-Guided Dist… view at source ↗
Figure 3
Figure 3. Figure 3: Cost-performance Pareto frontier. Each R2V point corresponds to one SLM backbone with 95% bootstrap confidence intervals. R2V gives near-free gains on HumanEval+, closely tracks the oracle on TextWorld, and recovers substantial SR for weaker TerminalBench backbones while remaining below heuristic-router cost [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
read the original abstract

Efficient agentic systems should incur expensive frontier-model costs only on decisions where a cheaper local model is likely to fail. Existing LLM cascades usually route whole queries before execution, but task difficulty shifts mid-trajectory - after flaky tool calls, truncated observations, or compounding local errors - making pre-execution routing brittle. We introduce \textbf{R2V-Agent}, a risk-calibrated SLM-LLM routing framework for interactive agents. R2V combines four components: a distilled small language model (SLM) policy, a stronger teacher LLM, a lightweight process verifier that scores candidate actions at each step, and a calibrated step-level router. The router is our central contribution: after the SLM is trained, it estimates residual failure risk at each step and escalates only when teacher intervention is warranted. To make the routing problem well-defined, we first train a stable local SLM using a standard offline pipeline: behavioral cloning (BC) on teacher trajectories, followed by verifier-guided Direct Preference Optimization (DPO) with consistency regularization. The router is then trained on this fixed policy's residual failures using Brier-calibrated probability estimation and a Conditional Value-at-Risk (CVaR)-constrained objective that penalizes worst-case failures across perturbation seeds. Across HumanEval+, TextWorld, and TerminalBench with four SLM backbones, R2V improves the reliability-cost frontier: it achieves $94.3\%$ HumanEval+ success with $0.60\%$ LLM escalation, recovers TextWorld from $64.6\%$ SLM-only success to $98.2\%$ at $41.7\%$ escalation, and reaches $93.3\%$ TerminalBench success at $33.9\%$ LLM calls, roughly half the heuristic-router cost.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces R2V-Agent, a risk-calibrated SLM-LLM routing framework for interactive agents. After training a stable SLM policy via behavioral cloning followed by verifier-guided DPO with consistency regularization, a lightweight process verifier scores candidate actions and a step-level router is trained on the fixed SLM policy's residual failures using Brier-calibrated probability estimation and a CVaR-constrained objective over perturbation seeds. The central claim is that this produces reliable escalation decisions, yielding improved reliability-cost frontiers: 94.3% success on HumanEval+ at 0.60% LLM escalation, recovery of TextWorld to 98.2% success at 41.7% escalation, and 93.3% success on TerminalBench at 33.9% LLM calls (roughly half heuristic-router cost) across four SLM backbones.

Significance. If the empirical results and generalization hold, the work provides a concrete mechanism for dynamic, mid-trajectory routing that addresses limitations of static query-level cascades. The combination of a frozen SLM policy, process verifier, and CVaR-regularized router training offers a reproducible template for cost-efficient agent deployment; the reported quantitative gains on three distinct benchmarks constitute a falsifiable prediction that can be directly tested by other groups.

major comments (1)
  1. [Router training and deployment procedure] Router training (described after the SLM policy is frozen): the CVaR objective and perturbation seeds are drawn exclusively from SLM-only trajectories. In deployment, an LLM escalation produces a new observation that becomes input to the next SLM step, creating mixed trajectories absent from the training distribution. This covariate shift is not controlled by the reported perturbation procedure, so the calibration of escalation decisions (and thus the reported escalation rates such as 41.7% on TextWorld) rests on an untested assumption.
minor comments (2)
  1. [Abstract] The abstract states the heuristic-router comparison yields 'roughly half' the cost but does not define the heuristic or report the exact baseline escalation percentages and success rates for direct comparison.
  2. [Evaluation] No error bars, standard deviations, or number of random seeds are mentioned for the success and escalation figures; adding these would clarify whether the frontier improvements are statistically distinguishable from the SLM-only and heuristic baselines.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive and detailed review of our manuscript. We address the major comment below regarding the router training procedure and have incorporated revisions to clarify and strengthen the relevant sections.

read point-by-point responses
  1. Referee: [Router training and deployment procedure] Router training (described after the SLM policy is frozen): the CVaR objective and perturbation seeds are drawn exclusively from SLM-only trajectories. In deployment, an LLM escalation produces a new observation that becomes input to the next SLM step, creating mixed trajectories absent from the training distribution. This covariate shift is not controlled by the reported perturbation procedure, so the calibration of escalation decisions (and thus the reported escalation rates such as 41.7% on TextWorld) rests on an untested assumption.

    Authors: We acknowledge the referee's observation about the training distribution for the router. The CVaR objective and perturbations are indeed derived from SLM-only trajectories to capture variability in observations and action outcomes. However, all reported performance metrics—including success rates, escalation percentages, and cost tradeoffs on HumanEval+, TextWorld, and TerminalBench—are measured in full end-to-end deployment, where LLM escalations naturally occur and generate mixed trajectories. These empirical results therefore already reflect the router's behavior under the actual deployment distribution. To further address the concern, we have revised the manuscript to include an explicit discussion of this training-deployment mismatch in Section 4.3 and added an ablation study that incorporates a small number of mixed trajectories into router training data; the ablation shows that the reported calibration and escalation rates remain stable. We believe these changes strengthen the presentation without changing the core claims or methodology. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation remains self-contained

full rationale

The paper first trains and freezes a stable SLM policy via behavioral cloning followed by verifier-guided DPO. The router is then trained separately on that fixed policy's residual failures using Brier calibration and a CVaR objective over perturbation seeds. Reported metrics (e.g., 94.3% success at 0.60% escalation) are measured outcomes on evaluation trajectories, not quantities that reduce by construction to the training fit itself. No equations equate a prediction directly to its input parameters, and no load-bearing self-citations or uniqueness theorems are invoked in the described chain. The central claim therefore rests on empirical evaluation rather than definitional equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Framework rests on standard supervised and preference optimization pipelines plus two new modeling choices: a lightweight verifier and a CVaR objective over perturbation seeds. Only abstract prevents exhaustive listing of all background assumptions.

axioms (1)
  • domain assumption The SLM policy remains fixed after its BC+DPO training when the router is trained on its residual failures
    Abstract states 'after the SLM is trained, it estimates residual failure risk' and 'the router is then trained on this fixed policy's residual failures'
invented entities (1)
  • Calibrated step-level router no independent evidence
    purpose: Estimates per-step residual failure risk to decide LLM escalation
    Presented as the central contribution; no external validation or prior equivalent cited in abstract

pith-pipeline@v0.9.0 · 5897 in / 1452 out tokens · 98994 ms · 2026-05-20T20:20:53.982863+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    The router is then trained on this fixed policy’s residual failures using Brier-calibrated probability estimation and a Conditional Value-at-Risk (CVaR)-constrained objective that penalizes worst-case failures across perturbation seeds.

  • IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    R2V factorizes execution into four components: an efficient SLM policy πθ, a stronger teacher LLM πT, a lightweight process verifier Vϕ(xt,at) that scores candidate actions, and a router rψ(ft) that estimates whether the current step should be escalated.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 16 internal anchors

  1. [1]

    Algorithms for CVaR Optimization in MDPs

    URL https://arxiv.org/abs/1406.3339. Cognition. Introducing devin, the first ai software engineer.Cognition Blogs,

  2. [2]

    Hybrid llm: Cost-efficient and quality-aware query routing

    URLhttps://openreview.net/forum?id=qe8BfREMrb. Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata Mukherjee, Victor Ruhle, Laks V . S. Lakshmanan, and Ahmed Hassan Awadallah. Hybrid LLM: Cost-efficient and quality-aware query routing.arXiv preprint arXiv:2404.14618, 2024a. URL https://arxiv.org/abs/2404.14618. Dujian Ding, Ankur Mallick, Chi Wan...

  3. [3]

    The Llama 3 Herd of Models

    URLhttps://arxiv.org/abs/2407.21783. Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks,

  4. [4]

    On Calibration of Modern Neural Networks

    URLhttps://arxiv.org/abs/1706.04599. Qitian Jason Hu, Jacob Bieker, Xiuyu Li, Nan Jiang, Benjamin Keigwin, Gaurav Ranganath, Kurt Keutzer, and Shriyash Kaustubh Upadhyay. Routerbench: A benchmark for multi-llm routing system,

  5. [5]

    RouterBench: A Benchmark for Multi-LLM Routing System

    URLhttps://arxiv.org/abs/2403.12031. Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation,

  6. [6]

    Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation

    URL https://arxiv.org/abs/ 2302.09664. Aviral Kumar, Vincent Zhuang, Rishabh Agarwal, Yi Su, John D. Co-Reyes, Avi Singh, Kate Baumli, Shariq Iqbal, Colton Bishop, Rebecca Roelofs, Lei M. Zhang, Kay McKinney, Disha Shrivastava, Cosmin Paduraru, George Tucker, Doina Precup, Feryal Behbahani, and Aleksandra Faust. Training language models to self-correct vi...

  7. [7]

    Training Language Models to Self-Correct via Reinforcement Learning

    URLhttps://arxiv.org/abs/2409.12917. Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention,

  8. [8]

    Efficient Memory Management for Large Language Model Serving with PagedAttention

    URLhttps://arxiv.org/abs/2309.06180. Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step,

  9. [9]

    URL https://arxiv.org/abs/2305.20050. J. Lin. Divergence measures based on the shannon entropy.IEEE Transactions on Information Theory, 37(1):145–151,

  10. [10]

    1991 , publisher =

    doi: 10.1109/18.61115. Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatGPT really correct? rigorous evaluation of large language models for code generation. InThirty-seventh Conference on Neural Information Processing Systems,

  11. [11]

    Decoupled Weight Decay Regularization

    URL https: //arxiv.org/abs/1711.05101. 10 Mike A. Merrill, Alexander G. Shaw, and Nicholas Carlini et al. Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces,

  12. [12]

    Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

    URL https://arxiv.org/ abs/2601.11868. Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: A benchmark for general AI assistants. InInternational Conference on Learning Representations,

  13. [13]

    GAIA: a benchmark for General AI Assistants

    URL https://arxiv.org/abs/2311.12983. Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E. Gonzalez, M Waleed Kadous, and Ion Stoica. RouteLLM: Learning to route LLMs from preference data. InThe Thirteenth International Conference on Learning Representations,

  14. [14]

    Yao Qiang, Subhrangshu Nandi, Ninareh Mehrabi, Greg Ver Steeg, Anoop Kumar, Anna Rumshisky, and Aram Galstyan

    URL https: //openreview.net/forum?id=8sSqNntaMr. Yao Qiang, Subhrangshu Nandi, Ninareh Mehrabi, Greg Ver Steeg, Anoop Kumar, Anna Rumshisky, and Aram Galstyan. Prompt perturbation consistency learning for robust language models. In Findings of the Association for Computational Linguistics: EACL 2024, pp. 1357–1370. Asso- ciation for Computational Linguist...

  15. [15]

    Qwen2.5 Technical Report

    URL https://arxiv.org/abs/2412.15115. Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in Neural Information Processing Systems, 36:53728–53741,

  16. [16]

    ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

    Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. ALFWorld: Aligning text and embodied environments for interactive learning. In International Conference on Learning Representations, 2021a. URL https://arxiv.org/abs/ 2010.03768. Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trisc...

  17. [17]

    TDD for Embedded Systems: A Basic Approach and Toolset

    URL https://arxiv.org/ abs/1507.07969. Gemma Team. Gemma 2: Improving open language models at a practical size,

  18. [18]

    Gemma 2: Improving Open Language Models at a Practical Size

    URL https://arxiv.org/abs/2408.00118. Peiyi Wang, Lei Li, Zhihong Shao, R. X. Xu, Damai Dai, Yifei Li, Deli Chen, Y . Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations,

  19. [19]

    Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations

    URLhttps://arxiv.org/abs/2312.08935. 11 Zhaoyang Wang, Weilei He, Zhiyuan Liang, Xuchao Zhang, Chetan Bansal, Ying Wei, Weitong Zhang, and Huaxiu Yao. CREAM: Consistency regularized self-rewarding language models. In International Conference on Learning Representations, 2025a. URL https://arxiv.org/abs/ 2410.12735. Zhaoyang Wang, Weilei He, Zhiyuan Liang,...

  20. [20]

    URL https: //arxiv.org/abs/2410.02223. 12 A Algorithms Algorithm 1R2V-Agent Training Pipeline Require:Teacher LLMπ T , initial SLMπ θ, verifierV ϕ, perturbation operators{P k} 1:Collect expert trajectoriesD exp usingπ T on training tasks 2: Apply {Pk} across seeds z∈ Z to obtain perturbed trajectories Dpert and form offline trajectory poolD traj ← D exp ∪...

  21. [21]

    New Terminal Output:

    Each clean trajectory is replayed under 5 independently sampled perturbation seeds to produce the noisy training and evaluation distributions. HumanEval+.We use the full EvalPlus benchmark (Liu et al., 2023), comprising 164 Python programming problems. Each problem is presented as a function signature with a docstring. The agent interacts with a three-act...

  22. [22]

    The router has approximately 10,000 parameters and runs entirely on CPU

    is applied to the output logits. The router has approximately 10,000 parameters and runs entirely on CPU. At each step t, the distilled SLM samples K= 5 candidate actions a(1) t , . . . , a(K) t with vLLM (Kwon et al., 2023). The verifier scores all candidates, and the resulting 15-dimensional feature vector ft contains token-level entropy and log-probabi...

  23. [23]

    19 Table 5:R2V-Agent per-model resultsunder noisy evaluation with 95% bootstrap confidence intervals

    The router is trained for20epochs with cosine learning-rate annealing on batches of4,096steps. 19 Table 5:R2V-Agent per-model resultsunder noisy evaluation with 95% bootstrap confidence intervals. Oracle is shown as a non-deployable hindsight reference. Benchmark Model SR (%) 95% CI LLM% HumanEval+ Gemma-9B 91.9 [89.6, 93.8] 0.50% LLaMA-3.1-8B 95.8 [94.3,...