RLM-Cascade: Response-Level Speculative Decoding for Cost-Efficient LLM API Serving

Fangbo Tu; Haifeng Wu; Jian Wan; Junhua Zhao; Srinivasan Manoharan

arxiv: 2606.22840 · v1 · pith:QFMTYWN7new · submitted 2026-06-22 · 💻 cs.LG

RLM-Cascade: Response-Level Speculative Decoding for Cost-Efficient LLM API Serving

Haifeng Wu , Srinivasan Manoharan , Fangbo Tu , Junhua Zhao , Jian Wan This is my paper

Pith reviewed 2026-06-26 09:08 UTC · model grok-4.3

classification 💻 cs.LG

keywords speculative decodingLLM API servingcost reductionagentic workloadsproxy layercomplexity routerresponse-level decodingtool-call strategy

0 comments

The pith

A proxy applies response-level speculative decoding to cut LLM API costs by 45.8% while also reducing median latency on agentic coding workloads.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RLM-Cascade, a proxy-layer system that lets an inexpensive draft model generate a full candidate response first, after which a capable verify model accepts it, enhances it, or is bypassed entirely according to a lightweight complexity router. This response-level approach works without model architecture access or shared vocabulary and targets cost reduction for LLM API serving. On 125 real production requests from an agentic coding workload, the system reaches an 88.8% draft-use rate, delivers a 45.8% cost reduction versus direct use of the strong model, and achieves a 1.83X median latency speedup because the skip path dominates. Quality matches or exceeds the baseline, with a 100% pass rate on a 20-task benchmark versus 95% for the native strong model.

Core claim

RLM-Cascade applies speculative decoding at the response level via a proxy that generates a candidate with an inexpensive draft model and then uses a capable verify model to accept, enhance, or bypass based on a lightweight complexity router. On a real-world agentic coding workload, this yields an 88.8% draft-use rate, 45.8% cost reduction versus direct use of the strong model, and a 1.83X median latency speedup because the skip path is used most often. Quality is maintained or improved, with 100% pass rate on benchmarks versus 95% for the baseline.

What carries the argument

The response-level speculative decoding proxy, in which a draft model proposes a complete response and a verify model decides acceptance, enhancement, or full bypass via a complexity router that selects skip paths for simple turns.

If this is right

API costs fall by nearly half on agentic coding workloads due to the high rate of draft acceptance and skip paths.
Median end-to-end latency drops by a factor of 1.83 because most turns avoid the expensive verify model entirely.
Output quality stays at or above the direct strong-model baseline on code, math, and instruct tasks.
The system requires no model internals or shared vocabulary, allowing it to sit in front of any existing LLM API pair.
Production use is supported by rule-based routing for tool calls and open-source deployment with live metrics.
pith_inferences=[

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same proxy pattern could be tested on non-coding agentic workloads to check whether the router generalizes without retraining.
Enterprises with high-volume LLM usage might combine this layer with existing caching or batching to compound savings.
If the skip path remains dominant, overall energy use for serving would decline even though two models are involved in some turns.
A controlled experiment swapping the draft model for alternatives would show how sensitive the 88.8% acceptance rate is to draft capability.

Load-bearing premise

The lightweight complexity router can correctly identify when a simple agentic turn can be handled by the draft model alone or when to bypass the pipeline for schema-critical cases without degrading response quality.

What would settle it

Running the deployed system on a fresh set of 100 production agentic requests and finding either a pass rate below 95% on the benchmark tasks or realized cost savings below 30% would falsify the central performance claims.

Figures

Figures reproduced from arXiv: 2606.22840 by Fangbo Tu, Haifeng Wu, Jian Wan, Junhua Zhao, Srinivasan Manoharan.

**Figure 1.** Figure 1: RLM-Cascade end-to-end architecture. SKIPPED (64–70% of requests): simple turns are routed to DeepSeek only and returned directly to the client with no Opus call (top rail). Draft+Verify: complex turns go to DeepSeek, then the draft is validated by Opus, which emits ACCEPTED (USE_DRAFT) or ENHANCED (rewritten response), arriving at the client via out.west. Direct: tool-selection turns bypass the pipeline a… view at source ↗

read the original abstract

We present RLM-Cascade, a proxy-layer system that applies speculative decoding at the response level to reduce LLM API costs without requiring model architecture access or a shared vocabulary. A fast, inexpensive draft model generates a candidate response; a capable verify model accepts, enhances, or is bypassed entirely depending on a lightweight complexity router. On a real-world agentic coding workload (Claude Code), RLM-Cascade achieves a draft-use rate of 88.8% across 125 production requests, reducing API cost by 45.8% relative to a direct Opus baseline. Counter-intuitively, the proxy also reduces end-to-end latency: median response time is 2,026 ms versus 3,698 ms for Native Opus -- a 1.83X speedup at p50 -- because the SKIPPED path (DeepSeek only, no Opus call) dominates the workload distribution. Quality matches or exceeds the Opus baseline: 100% pass rate on a 20-task Code/Math/Instruct benchmark versus 95% for Native Opus. We further describe a rule-based complexity router that selects the SKIPPED path for simple agentic turns and a hybrid tool-call strategy that bypasses the speculative pipeline for schema-critical tool-selection turns. RLM-Cascade is deployed in production as an enterprise AI infrastructure component and published as open source with a live metrics dashboard and Prometheus endpoint.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Response-level speculative decoding via proxy cuts API costs on agentic workloads with reported 45.8% savings and latency wins, but high-level router description leaves the 88.8% draft-use rate hard to verify beyond this workload.

read the letter

The key takeaway is that RLM-Cascade runs a cheap draft model to generate full responses, then routes via a lightweight complexity router to accept, enhance with a strong model, or skip entirely, plus a bypass for tool calls. On 125 real production requests from an agentic coding workload it hits 88.8% draft use, 45.8% cost reduction versus direct Opus, and 1.83X median latency improvement because the skip path dominates.

What stands out as new is the response-level framing instead of token-by-token speculative decoding, done as an external proxy with no model architecture access or shared vocabulary required. The rule-based router and hybrid tool-call handling are practical additions for agentic settings.

The paper does well by grounding claims in production traffic rather than synthetic benchmarks, showing the counter-intuitive latency win, hitting 100% on their 20-task Code/Math/Instruct set, and shipping open-source code with a live dashboard.

The soft spot is the router. It is described only as rule-based and selecting the skip path for simple turns, with no rules, features, thresholds, or per-turn error breakdown given. Without that, the high draft-use rate could reflect workload bias more than robust routing, and it is unclear how quality would hold if misclassifications rise on different traffic. The benchmark size is also small.

This is for engineers building LLM serving layers or enterprise API cost controls. It deserves a serious referee because the real-workload numbers are concrete and the system is already deployed, even if the methods section will need expansion for reproducibility.

Recommendation: send to peer review.

Referee Report

2 major / 2 minor

Summary. The paper presents RLM-Cascade, a proxy-layer system for response-level speculative decoding that uses a fast draft model (DeepSeek) to generate candidate responses, a verify model (Opus), and a lightweight rule-based complexity router to select among ACCEPT, ENHANCE, or SKIPPED paths (with a hybrid tool-call bypass for schema-critical cases). On 125 production requests from a Claude Code agentic coding workload, it reports an 88.8% draft-use rate yielding 45.8% API cost reduction versus direct Opus, a 1.83X p50 latency reduction (2026 ms vs 3698 ms), and 100% pass rate on a 20-task Code/Math/Instruct benchmark (vs 95% for Native Opus). The system is deployed in production and released open-source with a metrics dashboard.

Significance. If the reported metrics are reproducible and the router decisions do not degrade quality on misclassified turns, the work demonstrates a practical, architecture-agnostic way to apply speculative decoding at the response level for cost and latency savings in production LLM serving. The open-source release and production deployment provide concrete evidence of deployability and enable external verification, which strengthens the contribution beyond typical empirical LLM papers.

major comments (2)

[Abstract and §3] Abstract and §3 (system description): The rule-based complexity router is described only at a high level as selecting the SKIPPED path for 'simple agentic turns' and using a 'hybrid tool-call strategy' for schema-critical cases, with no explicit rules, input features, thresholds, decision tree, or per-turn classification accuracy reported. This is load-bearing for the central 88.8% draft-use rate, 45.8% cost reduction, and quality claims, because the results could arise from workload bias rather than router reliability; without these details or an ablation on router error rate, the empirical gains cannot be assessed or reproduced.
[§4 and results table] §4 (evaluation) and Table 1 (or equivalent results table): The 125-request production workload and 20-task benchmark results are presented without error bars, confidence intervals, per-category breakdown (e.g., simple vs complex turns), or router misclassification analysis. This undermines the claim that quality 'matches or exceeds' the Opus baseline, as even a modest router error rate on non-trivial turns could produce the observed aggregate numbers.

minor comments (2)

[§4] The latency explanation (SKIPPED path dominating) is plausible but would benefit from a cumulative distribution or breakdown of path frequencies to make the counter-intuitive speedup fully transparent.
[Abstract] The abstract states the system is 'published as open source' with a Prometheus endpoint; the manuscript should include the exact repository URL and commit hash for immediate reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We appreciate the referee's detailed review and recognition of the practical contributions of RLM-Cascade. We respond to the major comments point-by-point below, proposing specific revisions to address the concerns about router details and evaluation reporting.

read point-by-point responses

Referee: The rule-based complexity router is described only at a high level as selecting the SKIPPED path for 'simple agentic turns' and using a 'hybrid tool-call strategy' for schema-critical cases, with no explicit rules, input features, thresholds, decision tree, or per-turn classification accuracy reported. This is load-bearing for the central 88.8% draft-use rate, 45.8% cost reduction, and quality claims, because the results could arise from workload bias rather than router reliability; without these details or an ablation on router error rate, the empirical gains cannot be assessed or reproduced.

Authors: We agree the current description is high-level. The router employs deterministic rules based on input features including token count, presence of tool calls, and lexical indicators of complexity. We will revise the manuscript to include the full set of rules, thresholds, and a decision tree diagram in §3. Additionally, we will report the per-path routing statistics on the 125-request workload and perform a manual audit of router decisions to quantify any misclassification rate. This will allow readers to assess reliability independent of workload bias. revision: yes
Referee: The 125-request production workload and 20-task benchmark results are presented without error bars, confidence intervals, per-category breakdown (e.g., simple vs complex turns), or router misclassification analysis. This undermines the claim that quality 'matches or exceeds' the Opus baseline, as even a modest router error rate on non-trivial turns could produce the observed aggregate numbers.

Authors: We acknowledge these omissions limit the strength of the empirical claims. In the revision, we will augment Table 1 with bootstrap-derived confidence intervals for cost and latency metrics. We will also add a per-category breakdown of results based on router-assigned complexity and include a misclassification analysis by sampling and reviewing router outputs against ground-truth complexity labels. These changes will provide a more rigorous evaluation of whether quality is preserved across turn types. revision: yes

Circularity Check

0 steps flagged

No circularity; results are direct empirical measurements on production workload and benchmark

full rationale

The paper reports measured outcomes (88.8% draft-use rate on 125 requests, 45.8% cost reduction, 1.83X p50 latency, 100% benchmark pass rate) from running the deployed system. No equations, fitted parameters renamed as predictions, self-definitional steps, or load-bearing self-citations appear in the provided text. The rule-based router and hybrid strategy are described at high level but function as implementation choices whose performance is externally validated by the reported metrics rather than forced by construction. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities beyond the high-level system components; the complexity router and SKIPPED path are described as rule-based but without parameter details or independent evidence.

invented entities (1)

complexity router no independent evidence
purpose: Selects between SKIPPED, draft-only, or verify paths based on query complexity
Introduced as a lightweight rule-based component to decide path usage; no independent evidence or falsifiable handle outside the system is provided in the abstract.

pith-pipeline@v0.9.1-grok · 5790 in / 1353 out tokens · 31204 ms · 2026-06-26T09:08:20.010902+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 5 canonical work pages · 3 internal anchors

[1]

Fast inference from transformers via speculative decod- ing,

Y . Leviathan, M. Kalman, and Y . Matias, “Fast inference from transformers via speculative decod- ing,” inProc. 40th Int. Conf. Machine Learning (ICML), PMLR, 2023

2023
[2]

Accelerating Large Language Model Decoding with Speculative Sampling

C. Chen, S. Borgeaud, G. Irving, J.-B. Lespiau, L. Sifre, and J. Jumper, “Accelerating large lan- guage model decoding with speculative sampling,” arXiv preprint arXiv:2302.01318, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance

L. Chen, M. Zaharia, and J. Zou, “FrugalGPT: How to use large language models while reducing cost and improving performance,”arXiv preprint arXiv:2305.05176, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Large language model cas- cades with mixture of thoughts representa- tions for cost-efficient reasoning,

X. Yue et al., “Large language model cas- cades with mixture of thoughts representa- tions for cost-efficient reasoning,”arXiv preprint arXiv:2310.03094, 2023

work page arXiv 2023
[5]

Self-consistency improves chain of thought reasoning in language models,

X. Wang et al., “Self-consistency improves chain of thought reasoning in language models,” inProc. 11th Int. Conf. Learning Representations (ICLR), 2023

2023
[6]

Efficient memory management for large language model serving with PagedAtten- tion,

W. Kwon et al., “Efficient memory management for large language model serving with PagedAtten- tion,” inProc. 29th ACM Symp. Operating Systems Principles (SOSP), ACM, 2023

2023
[7]

Orca: A distributed serving system for Transformer-based generative models,

G. Yu et al., “Orca: A distributed serving system for Transformer-based generative models,” inProc. 16th USENIX Symp. Operating Systems Design and Implementation (OSDI), 2022

2022
[8]

Medusa: Simple LLM inference acceleration framework with multiple decoding heads,

T. Cai et al., “Medusa: Simple LLM inference acceleration framework with multiple decoding heads,” inProc. 41st Int. Conf. Machine Learning (ICML), PMLR, 2024

2024
[9]

EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty

Y . Li et al., “EAGLE: Speculative sampling requires rethinking feature uncertainty,”arXiv preprint arXiv:2401.15077, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

Big-little transformer decoder for optimal inference-time cost,

D. Xu et al., “Big-little transformer decoder for optimal inference-time cost,”arXiv preprint arXiv:2302.07030, 2023

work page arXiv 2023
[11]

Judging LLM-as-a-judge with MT-Bench and Chatbot Arena,

L. Zheng et al., “Judging LLM-as-a-judge with MT-Bench and Chatbot Arena,” inProc. 37th Annu. Conf. Neural Information Processing Sys- tems (NeurIPS), 2023

2023
[12]

Taming throughput-latency tradeoff in LLM inference with Sarathi-Serve,

A. Agrawal et al., “Taming throughput-latency tradeoff in LLM inference with Sarathi-Serve,” in Proc. 18th USENIX Symp. Operating Systems De- sign and Implementation (OSDI), 2024. 9

2024

[1] [1]

Fast inference from transformers via speculative decod- ing,

Y . Leviathan, M. Kalman, and Y . Matias, “Fast inference from transformers via speculative decod- ing,” inProc. 40th Int. Conf. Machine Learning (ICML), PMLR, 2023

2023

[2] [2]

Accelerating Large Language Model Decoding with Speculative Sampling

C. Chen, S. Borgeaud, G. Irving, J.-B. Lespiau, L. Sifre, and J. Jumper, “Accelerating large lan- guage model decoding with speculative sampling,” arXiv preprint arXiv:2302.01318, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance

L. Chen, M. Zaharia, and J. Zou, “FrugalGPT: How to use large language models while reducing cost and improving performance,”arXiv preprint arXiv:2305.05176, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

Large language model cas- cades with mixture of thoughts representa- tions for cost-efficient reasoning,

X. Yue et al., “Large language model cas- cades with mixture of thoughts representa- tions for cost-efficient reasoning,”arXiv preprint arXiv:2310.03094, 2023

work page arXiv 2023

[5] [5]

Self-consistency improves chain of thought reasoning in language models,

X. Wang et al., “Self-consistency improves chain of thought reasoning in language models,” inProc. 11th Int. Conf. Learning Representations (ICLR), 2023

2023

[6] [6]

Efficient memory management for large language model serving with PagedAtten- tion,

W. Kwon et al., “Efficient memory management for large language model serving with PagedAtten- tion,” inProc. 29th ACM Symp. Operating Systems Principles (SOSP), ACM, 2023

2023

[7] [7]

Orca: A distributed serving system for Transformer-based generative models,

G. Yu et al., “Orca: A distributed serving system for Transformer-based generative models,” inProc. 16th USENIX Symp. Operating Systems Design and Implementation (OSDI), 2022

2022

[8] [8]

Medusa: Simple LLM inference acceleration framework with multiple decoding heads,

T. Cai et al., “Medusa: Simple LLM inference acceleration framework with multiple decoding heads,” inProc. 41st Int. Conf. Machine Learning (ICML), PMLR, 2024

2024

[9] [9]

EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty

Y . Li et al., “EAGLE: Speculative sampling requires rethinking feature uncertainty,”arXiv preprint arXiv:2401.15077, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[10] [10]

Big-little transformer decoder for optimal inference-time cost,

D. Xu et al., “Big-little transformer decoder for optimal inference-time cost,”arXiv preprint arXiv:2302.07030, 2023

work page arXiv 2023

[11] [11]

Judging LLM-as-a-judge with MT-Bench and Chatbot Arena,

L. Zheng et al., “Judging LLM-as-a-judge with MT-Bench and Chatbot Arena,” inProc. 37th Annu. Conf. Neural Information Processing Sys- tems (NeurIPS), 2023

2023

[12] [12]

Taming throughput-latency tradeoff in LLM inference with Sarathi-Serve,

A. Agrawal et al., “Taming throughput-latency tradeoff in LLM inference with Sarathi-Serve,” in Proc. 18th USENIX Symp. Operating Systems De- sign and Implementation (OSDI), 2024. 9

2024