pith. machine review for the scientific record. sign in

arxiv: 2604.08075 · v1 · submitted 2026-04-09 · 💻 cs.CL

Recognition: no theorem link

Dual-Pool Token-Budget Routing for Cost-Efficient and Reliable LLM Serving

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:45 UTC · model grok-4.3

classification 💻 cs.CL
keywords LLM servingtoken budget routingdual poolcost efficiencyKV cacheinference optimizationworkload partitioningonline adaptation
0
0 comments X

The pith

Splitting LLM server fleets into short-context and long-context pools based on online token-budget estimates reduces GPU-hours by 31-42 percent while lowering preemption rates by 5.4 times.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that standard LLM serving systems waste most of their capacity because every instance is configured for the longest possible context even though the great majority of requests are short. It proposes partitioning the fleet into two specialized pools and routing each request to the appropriate pool using a lightweight estimate of its total token needs. That estimate is obtained by tracking bytes-to-token ratios for different request categories with an exponential moving average updated from actual prompt feedback, avoiding any need to run a tokenizer at dispatch time. The authors supply a simple analytical model that translates workload statistics and measured throughput gaps into predicted fleet-level savings. If the approach holds, operators can serve the same traffic with substantially fewer GPUs, experience fewer crashes and request rejections, and still maintain low tail latencies.

Core claim

The central claim is that dual-pool token-budget routing partitions a homogeneous fleet into a high-throughput short-context pool and a high-capacity long-context pool, dispatching each request according to its estimated total token budget computed from per-category bytes-to-token ratios learned online via exponential moving average from usage.prompt_tokens feedback. This removes configuration-traffic mismatch, yields 31-42 percent lower GPU-hours on Azure and LMSYS traces with Llama-3-70B, reduces preemption rates by 5.4 times, improves P99 TTFT by 6 percent, and projects multimillion-dollar annual savings at scale, all with constant-time overhead and seamless composition with PagedAttn, p2

What carries the argument

Dual-pool token-budget routing: the dispatch mechanism that sends each request to either the short-context high-throughput pool or the long-context high-capacity pool according to an estimated total token budget derived from online-learned bytes-to-token ratios.

If this is right

  • GPU-hours fall 31-42 percent on production traces from Azure and LMSYS-Chat-1M for Llama-3-70B.
  • Preemption rates drop by a factor of 5.4 and P99 time-to-first-token improves by 6 percent.
  • Projected annual savings reach $2.86 million at fleet scale or $15.4 million for Qwen3-235B-A22B at 10k req/s.
  • An analytical model lets practitioners forecast savings from workload statistics and measured throughput differences before deployment.
  • The method adds only O(1) dispatch cost and integrates directly with PagedAttention, continuous batching, and prefill-decode disaggregation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same partitioning idea could be applied to other dimensions such as memory-bandwidth tiers or quantized versus full-precision instances.
  • Dynamic resizing of the two pools in response to observed traffic shifts might further improve utilization.
  • Finer-grained category tracking or per-user ratio learning could extend the method beyond the current coarse categories without manual configuration.
  • The pre-deployment cost model opens the door to automated fleet sizing tools that optimize pool ratios from historical logs.

Load-bearing premise

Per-category bytes-to-token ratios learned online via exponential moving average continue to produce token-budget estimates accurate enough to prevent systematic misrouting or load imbalance between the two pools.

What would settle it

Run the dual-pool system side-by-side with a single unified pool on the same real-world trace and measure whether the short pool shows higher preemption or OOM rates than the baseline.

read the original abstract

Production vLLM fleets typically provision each instance for the worst-case context length, leading to substantial KV-cache over-allocation and under-utilized concurrency. In practice, 80-95% of requests are short, yet are served under configurations optimized for long contexts, wasting 4-8$\times$ throughput capacity and triggering reliability issues such as OOM crashes, preemption, and request rejections. We identify a common root cause for these inefficiencies: configuration-traffic mismatch. We propose dual-pool token-budget routing, a lightweight dispatch mechanism that partitions a homogeneous fleet into two specialized pools: a high-throughput short-context pool and a high-capacity long-context pool. Each request is routed based on its estimated total token budget, computed using a per-category bytes-to-token ratio that is learned online via exponential moving average from usage.prompt_tokens feedback, eliminating the need for a tokenizer. We also develop a simple analytical model that predicts fleet-level cost savings from workload characteristics and measured throughput differences, enabling practitioners to estimate benefits prior to deployment. Evaluations on real-world traces from the Azure LLM Inference Dataset and LMSYS-Chat-1M, serving Llama-3-70B on A100 GPUs, show that our approach reduces GPU-hours by 31-42%, corresponding to \$2.86M annual savings at fleet scale, while lowering preemption rates by 5.4$\times$ and improving P99 TTFT by 6%. A case study with Qwen3-235B-A22B on AMD MI300X at 10,000 req/s projects \$15.4M in annual savings. The method incurs only O(1) dispatch overhead, adapts automatically to heterogeneous workloads, and composes seamlessly with existing optimizations such as PagedAttention, continuous batching, and prefill-decode disaggregation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper proposes dual-pool token-budget routing to address configuration-traffic mismatch in LLM serving fleets. It partitions a homogeneous GPU fleet into a high-throughput short-context pool and a high-capacity long-context pool, routing each request by its estimated total token budget. The budget is computed from per-category bytes-to-token ratios learned online via exponential moving average on prompt_tokens feedback, without a tokenizer. An analytical model predicts fleet-level savings from workload traits and measured throughput differences. On Azure LLM Inference Dataset and LMSYS-Chat-1M traces with Llama-3-70B on A100s, it reports 31-42% GPU-hour reductions ($2.86M annual savings at scale), 5.4× lower preemption, and 6% better P99 TTFT; a Qwen3-235B case study projects $15.4M savings. The method has O(1) overhead and composes with PagedAttention, continuous batching, and prefill-decode disaggregation.

Significance. If the routing accuracy holds, the approach provides a lightweight, adaptive way to reclaim 4-8× wasted throughput capacity in production fleets while improving reliability, with direct cost implications at scale. The analytical savings model and online adaptation without tokenizer are practical strengths that could aid deployment decisions. The reported gains on real traces and composition with existing optimizations strengthen the case for impact in LLM inference systems.

major comments (2)
  1. [Abstract] Abstract: The 31-42% GPU-hour reduction and 5.4× preemption improvement rest on the assumption that per-category EMA-learned bytes-to-token ratios produce sufficiently accurate token-budget estimates for reliable routing. No quantitative bounds on per-request estimation error, EMA convergence time, or resulting misrouting rates are reported for the Azure or LMSYS traces; without these, it is unclear whether the observed gains arise from correct pool specialization or from other factors such as overall load reduction.
  2. [Abstract] The analytical savings model (described in the abstract) is fitted to measured throughput differences and workload characteristics from the same experiments used to claim the 31-42% savings. This creates a risk of circularity: the model parameters are not derived independently of the target metric, so the projected $2.86M annual savings cannot be treated as an a-priori prediction that validates the routing mechanism.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below with clarifications and revisions to the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The 31-42% GPU-hour reduction and 5.4× preemption improvement rest on the assumption that per-category EMA-learned bytes-to-token ratios produce sufficiently accurate token-budget estimates for reliable routing. No quantitative bounds on per-request estimation error, EMA convergence time, or resulting misrouting rates are reported for the Azure or LMSYS traces; without these, it is unclear whether the observed gains arise from correct pool specialization or from other factors such as overall load reduction.

    Authors: We agree that quantitative bounds on estimation accuracy would strengthen the claims. In the revised manuscript we add a dedicated analysis subsection reporting per-request token-budget error, EMA convergence, and misrouting rates on both traces. The added results show EMA convergence within a few hundred requests per category, median relative error below 12%, and misrouting below 7%. An oracle-routing ablation attributes over 85% of the observed gains to correct specialization. These additions directly address the concern. revision: yes

  2. Referee: [Abstract] The analytical savings model (described in the abstract) is fitted to measured throughput differences and workload characteristics from the same experiments used to claim the 31-42% savings. This creates a risk of circularity: the model parameters are not derived independently of the target metric, so the projected $2.86M annual savings cannot be treated as an a-priori prediction that validates the routing mechanism.

    Authors: The throughput differences in the model come from separate microbenchmark measurements of the two pool configurations; these are independent of the routing experiments. Workload traits are taken from trace statistics before any routing runs. The model predicts savings, which the end-to-end experiments then validate. We have revised Section 4 and the abstract to state this separation explicitly and to position the model as a predictive tool. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's core claims of 31-42% GPU-hour reductions, 5.4× lower preemption, and projected cost savings are obtained directly from empirical evaluations on external real-world traces (Azure LLM Inference Dataset, LMSYS-Chat-1M) with Llama-3-70B on A100 GPUs. The analytical model uses independently measured throughput differences and workload statistics as inputs to forecast benefits for practitioners prior to deployment; it does not define or tautologically reproduce the reported experimental metrics. The per-category EMA for bytes-to-token ratios is an online adaptive component for routing decisions whose accuracy is assessed via observed end-to-end performance rather than assumed by construction. No self-citations, uniqueness theorems, or ansatzes from prior author work are invoked as load-bearing premises. The approach remains self-contained against external benchmarks with no derivations reducing to fitted inputs renamed as predictions.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the ability to learn accurate token budgets online and on the assumption that fleet partitioning yields stable throughput gains without hidden costs; these are domain assumptions rather than new invented entities.

free parameters (1)
  • bytes-to-token ratio
    Per-category ratio estimated online via exponential moving average from usage.prompt_tokens feedback to compute estimated token budget without invoking a tokenizer.
axioms (1)
  • domain assumption Request token usage exhibits stable per-category patterns that can be captured by a simple learned ratio updated via EMA.
    Invoked to justify routing decisions and elimination of tokenizer requirement.

pith-pipeline@v0.9.0 · 5650 in / 1452 out tokens · 63035 ms · 2026-05-10T17:45:16.135076+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 7 canonical work pages · 1 internal anchor

  1. [1]

    Efficient Memory Management for Large Language Model Serving with Page- dAttention,

    W. Kwon et al., “Efficient Memory Management for Large Language Model Serving with Page- dAttention,” inProc. SOSP, 2023

  2. [2]

    Orca: A Distributed Serving System for Transformer-Based Generative Models,

    G.-I. Yu et al., “Orca: A Distributed Serving System for Transformer-Based Generative Models,” inProc. OSDI, 2022

  3. [3]

    TensorRT-LLM,

    NVIDIA, “TensorRT-LLM,”https://github.com/NVIDIA/TensorRT-LLM, 2024

  4. [4]

    FasterTransformer,

    NVIDIA, “FasterTransformer,”https://github.com/NVIDIA/FasterTransformer, 2023

  5. [5]

    Splitwise: Efficient Generative LLM Inference Using Phase Splitting,

    P. Patel et al., “Splitwise: Efficient Generative LLM Inference Using Phase Splitting,” inProc. ISCA, 2024

  6. [6]

    DistServe: Disaggregating Prefill and Decoding for Goodput-optimized LLM Serving,

    Y. Zhong et al., “DistServe: Disaggregating Prefill and Decoding for Goodput-optimized LLM Serving,” inProc. OSDI, 2024

  7. [7]

    Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve,

    A. Agrawal et al., “Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve,” inProc. OSDI, 2024

  8. [8]

    SageServe: Optimizing LLM serving on cloud data centers with forecast aware auto-scaling.arXiv preprint arXiv:2502.14617, 2025

    Microsoft Research, “SageServe: Optimizing LLM Serving on Cloud Data Centers with Forecast Aware Auto-Scaling,”arXiv:2502.14617, 2025

  9. [9]

    EWSJF: Adaptive Scheduler with Hybrid Partitioning for Mixed-Workload LLM Inference,

    “EWSJF: Adaptive Scheduler with Hybrid Partitioning for Mixed-Workload LLM Inference,” arXiv:2601.21758, 2025

  10. [10]

    AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving,

    Z. Li et al., “AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving,” inProc. OSDI, 2023

  11. [11]

    FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU,

    Y. Sheng et al., “FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU,” inProc. ICML, 2023. 14

  12. [12]

    Fast Inference from Transformers via Speculative Decoding,

    Y. Leviathan, M. Kalman, and Y. Matias, “Fast Inference from Transformers via Speculative Decoding,” inProc. ICML, 2023

  13. [13]

    Azure LLM Inference Trace 2024,

    Microsoft Azure, “Azure LLM Inference Trace 2024,” https://github.com/Azure/ AzurePublicDataset, 2024

  14. [14]

    LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset,

    L. Zheng et al., “LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset,” in Proc. ICLR, 2024

  15. [15]

    Optimization and Tuning,

    vLLM Project, “Optimization and Tuning,” https://docs.vllm.ai/en/stable/ configuration/optimization/, 2025

  16. [16]

    Compressing Context to Enhance Inference Efficiency of Large Language Models,

    Y. Li et al., “Compressing Context to Enhance Inference Efficiency of Large Language Models,” inProc. EMNLP, 2023

  17. [17]

    LLMLingua: Compressing Prompts for Accelerated Inference of LLMs,

    H. Jiang et al., “LLMLingua: Compressing Prompts for Accelerated Inference of LLMs,” inProc. EMNLP, 2023

  18. [18]

    LaCache: Ladder-Shaped KV Caching for Efficient Long-Context Modeling of Large Language Models,

    “LaCache: Ladder-Shaped KV Caching for Efficient Long-Context Modeling of Large Language Models,” inProc. ICML, 2025

  19. [19]

    KVShare: An LLM Service System with Efficient and Effective Multi-Tenant KV Cache Reuse,

    “KVShare: An LLM Service System with Efficient and Effective Multi-Tenant KV Cache Reuse,” arXiv:2503.16525, 2025

  20. [20]

    OrbitFlow: SLO-Aware Long-Context LLM Serving with Fine-Grained KV Cache Reconfigura- tion,

    “OrbitFlow: SLO-Aware Long-Context LLM Serving with Fine-Grained KV Cache Reconfigura- tion,”arXiv:2601.10729, 2025

  21. [21]

    Qwen3 Technical Report,

    Qwen Team, “Qwen3 Technical Report,”https://qwenlm.github.io/blog/qwen3/, 2025

  22. [22]

    AMD Instinct MI300X Accelerator Data Sheet,

    AMD, “AMD Instinct MI300X Accelerator Data Sheet,”https://www.amd.com/content/dam/ amd/en/documents/instinct-tech-docs/data-sheets/amd-instinct-mi300x-data-sheet. pdf, 2024

  23. [23]

    Unleashing Computational Power: Ultimate Latency Optimization of Qwen3 on AMD MI300X Series,

    Qwen C-end Infrastructure Engineering Team and AMD AI Framework Team, “Unleashing Computational Power: Ultimate Latency Optimization of Qwen3 on AMD MI300X Series,” LMSYS Org Blog, Feb. 2026.https://lmsys.org/blog/2026-02-11-Qwen-latency/

  24. [24]

    Llumnix: Dynamic Scheduling for Large Language Model Serving,

    B. Sun et al., “Llumnix: Dynamic Scheduling for Large Language Model Serving,” inProc. OSDI, 2024

  25. [25]

    Mooncake: A KVCache-Centric Disaggregated Architecture for LLM Serving,

    R. Qin et al., “Mooncake: A KVCache-Centric Disaggregated Architecture for LLM Serving,” in Proc. FAST, 2025.Best Paper Award

  26. [26]

    BurstGPT: A real-world workload dataset to optimize LLM serving systems,

    Y. Wang et al., “BurstGPT: A Real-World Workload Dataset to Optimize LLM Serving Systems,” arXiv:2401.17644, 2024

  27. [27]

    Demystifying Cost-Efficiency in LLM Serving over Heterogeneous GPUs,

    Y. Jiang et al., “Demystifying Cost-Efficiency in LLM Serving over Heterogeneous GPUs,” in Proc. ICML, 2025

  28. [28]

    Vidur: A Large-Scale Simulation Framework for LLM Inference,

    A. Agrawal et al., “Vidur: A Large-Scale Simulation Framework for LLM Inference,” inProc. MLSys, 2024

  29. [29]

    The Script Tax: Measuring Tokenization-Driven Efficiency and Latency Disparities in Multilingual Language Models,

    A. Dixit and S. Dixit, “The Script Tax: Measuring Tokenization-Driven Efficiency and Latency Disparities in Multilingual Language Models,”arXiv:2602.11174, 2026

  30. [30]

    ServeGen: Workload Characterization and Generation of Large Language Model Serving in Production

    Alibaba, “ServeGen: Workload Characterization and Generation of Large Language Model Serving in Production,”arXiv:2505.09999, 2025. 15