pith. machine review for the scientific record. sign in

arxiv: 2604.06970 · v1 · submitted 2026-04-08 · 💻 cs.DC · cs.OS· cs.PF

Recognition: no theorem link

Scheduling the Unschedulable: Taming Black-Box LLM Inference at Scale

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:50 UTC · model grok-4.3

classification 💻 cs.DC cs.OScs.PF
keywords LLM schedulingblack-box APIclient-side controldeadline satisfactiontoken predictioncongestion controlresource allocationoverload management
0
0 comments X

The pith

Client-side scheduling with predicted token counts achieves full completion and 100% deadline satisfaction for black-box LLM inference under congestion.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that when output lengths can be predicted ahead of time, a client can schedule requests to an opaque LLM service by breaking the problem into three independent parts: deciding how much of the service to give each class of traffic, deciding the order of requests inside each class, and deciding which requests to accept or drop when the service is overloaded. This decomposition lets the client meet every deadline and complete every request even when the system is busy, while still delivering useful throughput. The same structure works whether the client wants to favor short interactive requests or treat all classes more evenly. Because the service internals stay hidden, the approach gives users a way to add reliability to third-party LLM calls without changing the provider.

Core claim

The authors claim that a three-layer client-side scheduler—adaptive deficit round robin for inter-class allocation, feasible-set scoring for intra-class ordering, and explicit admit/defer/reject on a cost ladder for overload—combined with coarse output-token priors, yields 100% completion, 100% deadline satisfaction, and 4.2 ± 1.6 useful SLO-meeting requests per second under balanced or high congestion, with short-request P95 latencies close to those of quota-tiered isolation. The system degrades gracefully under up to 60% multiplicative prediction error and supports different fairness policies via the allocation layer.

What carries the argument

Three-layer client-side decomposition consisting of adaptive DRR allocation for inter-class shares, feasible-set scoring for intra-class ordering, and cost-ladder overload control.

If this is right

  • Coarse magnitude priors, not class labels alone, are required; removing them increases short-request P95 by up to 5.8 times and reduces deadline satisfaction.
  • The scheduler continues to function with graceful degradation when token predictions carry up to 60% multiplicative error.
  • Fair queuing allocation improves short-request P90 by 32% over FIFO while adding only 17% overhead to long requests.
  • Short-priority allocation achieves 27% short-request improvement but incurs 116% overhead on long requests.
  • Heavy-dominated traffic regimes expose clear differences among policies on completion rates, tail latency, and interpretable shedding behavior.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same three-layer structure could be packaged inside client libraries so that ordinary applications gain deadline guarantees without writing custom schedulers.
  • The cost-ladder shedding mechanism may apply to other black-box services whose per-request cost can be estimated in advance.
  • Real-world traces with varying prediction accuracy would test whether the 60% error tolerance holds outside controlled sweeps.

Load-bearing premise

Output token counts can be predicted at submission time with sufficient accuracy.

What would settle it

An experiment that increases token-count prediction error beyond 60% multiplicative and measures whether deadline satisfaction drops below 100% or useful goodput falls sharply.

Figures

Figures reproduced from arXiv: 2604.06970 by Han Wang, Haochun Liao, Linxi Yu, Renzhong Yuan, Xiaosong Gao, Yijun Zeng.

Figure 1
Figure 1. Figure 1: Data path for the client-side stack: allocation, ordering, and overload control ahead of the mock black-box API; coarse length priors are available before dispatch. 6 [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Information ladder with Final (OLC) fixed (five seeds per regime × condition). Top: short-request P95 (mean ± std); no-information blind uses red hatching. Bottom: completion rate and useful goodput. Conditions: no￾information blind, class-only, coarse semi-clairvoyant, oracle [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Short-request P95 versus completion rate (four regimes; mean ± std over seeds). Structured policies at high completion with moderate short tails; naive dispatch skews toward worse short P95 and lower completion under stress. 12 [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Useful goodput versus global P95 for the same main-benchmark runs as the previous figure. 4.6 Alternative allocation policy: Fair Queuing The short-priority allocation policy (DRR biased toward interactive classes) op￾timizes for short-request tail latency at the cost of potentially starving heavy requests. An alternative design objective is fairness: giving each class equal ser￾vice opportunities regardle… view at source ↗
Figure 5
Figure 5. Figure 5: Overload actions summed over Final (OLC) main-benchmark runs (20 runs: four regimes × five seeds): rejections concentrate on xlong; short requests are never rejected. Holding Final (OLC) fixed, we vary only overload_controller.bucket_policy under balanced / high and heavy-dominated / high (five seeds each). Cost ladder (ladder) is the default long/xlong severity map. Uniform mild keeps the same thresholds … view at source ↗
Figure 6
Figure 6. Figure 6: Overload bucket_policy comparison (Final OLC fixed; balanced/high and heavy-dominated/high). Grouped bars: short P95, useful goodput, com￾pletion rate for cost ladder, uniform mild/harsh, and reverse (stress contrast). Mean ± std, five seeds. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Layerwise progression under high congestion (balanced/high, heavy￾dominated/high): short P95, useful goodput, and completion from naive dis￾patch through quota-tiered isolation, adaptive DRR, and Final (OLC). §4.9 perturbs overload-controller thresholds (defer/reject cutoffs and backoff) while holding baseline coarse priors fixed. §4.10 perturbs predictor fidelity— multiplicative error on coarse p50/p90 pr… view at source ↗
Figure 8
Figure 8. Figure 8: Predictor-quality sweep (Final OLC fixed): multiplicative noise on policy-facing p50/p90 priors with L from 0 to 0.6; mock physics unchanged. Mean ± std over five seeds; one line per regime. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
read the original abstract

When output token counts can be predicted at submission time (Gan et al., 2026), client-side scheduling against a black-box LLM API becomes semi-clairvoyant: decisions condition on coarse token priors even though the provider's internals remain hidden. We decompose this boundary problem into three separable concerns: allocation (inter-class share via adaptive DRR), ordering (intra-class sequencing with feasible-set scoring), and overload control (explicit admit/defer/reject on a cost ladder). An information ladder experiment shows that coarse magnitude priors -- not class labels alone -- are the practical threshold for useful client control; removing magnitude inflates short-request P95 by up to $5.8\times$ and degrades deadline satisfaction. Under balanced / high congestion the full stack achieves 100% completion, 100% deadline satisfaction, and useful goodput of $4.2 \pm 1.6$ SLO-meeting requests/s with short P95 within tens of milliseconds of quota-tiered isolation. A predictor-noise sweep confirms graceful degradation under up to 60% multiplicative error. Heavy-dominated regimes separate policies on completion, tail, and interpretable shedding. We further compare short-priority allocation (biased toward interactive traffic) with Fair Queuing (round-robin across classes): Fair Queuing achieves +32% short-request P90 improvement over FIFO with only +17% long-request overhead, versus Short-Priority's +27% / +116% trade-off -- demonstrating that the allocation layer accommodates different fairness objectives without changing the remaining stack. We contribute the three-layer client-side decomposition, controlled evaluation of joint metrics across regimes, allocation-policy alternatives, and overload-policy evidence linking cost-ladder shedding to the stated service objective.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that when output token counts can be predicted at submission time (citing Gan et al. 2026), a client-side three-layer scheduler for black-box LLM APIs—consisting of adaptive DRR for inter-class allocation, feasible-set scoring for intra-class ordering, and cost-ladder overload control—achieves 100% request completion and deadline satisfaction under balanced/high congestion, with useful goodput of 4.2 ± 1.6 SLO-meeting requests/s and short-request P95 latencies within tens of milliseconds of quota-tiered isolation. An information-ladder experiment shows magnitude priors (not just class labels) are essential to avoid up to 5.8× P95 inflation, a noise sweep shows graceful degradation up to 60% multiplicative prediction error, and allocation-policy comparisons (short-priority vs. Fair Queuing) demonstrate flexibility without altering the rest of the stack.

Significance. If the results hold under the stated assumptions, the work offers a practical, separable decomposition for client-side SLO management of black-box LLM inference at scale, with controlled cross-regime evaluation and explicit policy trade-offs. The emphasis on coarse priors enabling effective control, plus reproducible policy alternatives, would be a useful contribution to systems for LLM API orchestration.

major comments (3)
  1. [§5] §5 (predictor-noise sweep and information-ladder experiment): the headline metrics (100% completion, 100% deadline satisfaction, 4.2 ± 1.6 goodput) are obtained only under the assumption that the external predictor from Gan et al. 2026 meets the required accuracy on the paper's workloads and SLO definitions. No independent measurement of prediction error rates for the tested request mix is reported; the 60% multiplicative noise sweep therefore tests robustness but does not establish that the baseline error is low enough for the 100% figures to be attainable in practice.
  2. [§4] §4 (experimental setup): the comparison to quota-tiered isolation and the reported P95 closeness are load-bearing for the central claim of near-optimal tail behavior, yet the manuscript provides no details on how the quota-tiered baseline is implemented, what its exact parameters are, or raw latency distributions, making it impossible to verify that the three-layer stack truly matches it within 'tens of milliseconds'.
  3. [Heavy-dominated regime evaluation] Heavy-dominated regime results: the separation of policies on completion, tail latency, and shedding is presented as evidence that the stack accommodates different objectives, but without the exact request-mix parameters, arrival rates, or deadline definitions used in that regime, it is difficult to assess whether the observed differences are robust or specific to the chosen synthetic mix.
minor comments (2)
  1. The term 'useful goodput' is used in the abstract and results but never given an explicit formula or definition in terms of the SLO parameters; adding a short equation or paragraph would improve clarity.
  2. Figure captions for the information-ladder and noise-sweep plots should include the exact request mix, number of runs, and confidence intervals rather than only the headline numbers.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, indicating where revisions will be made to improve clarity and reproducibility while noting the limits of what we can provide.

read point-by-point responses
  1. Referee: [§5] §5 (predictor-noise sweep and information-ladder experiment): the headline metrics (100% completion, 100% deadline satisfaction, 4.2 ± 1.6 goodput) are obtained only under the assumption that the external predictor from Gan et al. 2026 meets the required accuracy on the paper's workloads and SLO definitions. No independent measurement of prediction error rates for the tested request mix is reported; the 60% multiplicative noise sweep therefore tests robustness but does not establish that the baseline error is low enough for the 100% figures to be attainable in practice.

    Authors: We agree that the headline 100% completion and deadline-satisfaction figures are conditional on the accuracy of the predictor cited from Gan et al. (2026). The noise sweep shows graceful degradation up to 60% multiplicative error, but we did not independently measure the predictor's error rate on our specific request mix and SLO definitions. This is because the manuscript focuses on the client-side scheduling decomposition assuming a usable predictor is available (as stated in the abstract and introduction). In the revision we will add an explicit caveats section discussing this assumption, its implications for real-world use, and the fact that the 100% results hold only when prediction error remains below the demonstrated robustness threshold. We cannot supply an independent error measurement because we did not re-implement or evaluate the Gan et al. predictor ourselves. revision: partial

  2. Referee: [§4] §4 (experimental setup): the comparison to quota-tiered isolation and the reported P95 closeness are load-bearing for the central claim of near-optimal tail behavior, yet the manuscript provides no details on how the quota-tiered baseline is implemented, what its exact parameters are, or raw latency distributions, making it impossible to verify that the three-layer stack truly matches it within 'tens of milliseconds'.

    Authors: We acknowledge that the quota-tiered isolation baseline lacks sufficient implementation details for independent verification. In the revised manuscript we will add a dedicated subsection describing the baseline implementation, the exact quota parameters and tier definitions used, and additional summary statistics (including P95, P99, and inter-quartile ranges) from the latency distributions. This will allow readers to confirm that the three-layer scheduler's short-request P95 remains within tens of milliseconds of the baseline under the reported loads. Raw full distributions are voluminous but we will provide them in supplementary material or a public repository. revision: yes

  3. Referee: [Heavy-dominated regime evaluation] Heavy-dominated regime results: the separation of policies on completion, tail latency, and shedding is presented as evidence that the stack accommodates different objectives, but without the exact request-mix parameters, arrival rates, or deadline definitions used in that regime, it is difficult to assess whether the observed differences are robust or specific to the chosen synthetic mix.

    Authors: We agree that the heavy-dominated regime results require the precise experimental parameters for reproducibility and to evaluate robustness. In the revision we will include a table or appendix entry listing the exact request-mix composition (class proportions), arrival rates, and deadline definitions used in those experiments. This will make clear that the observed policy separations on completion rate, tail latency, and shedding behavior are tied to the stated synthetic workload and can be assessed accordingly. revision: yes

standing simulated objections not resolved
  • Independent measurement of prediction error rates from Gan et al. (2026) on the paper's workloads and SLO definitions, as this would require re-implementing and running their predictor, which lies outside the scope of our client-side scheduling contribution.

Circularity Check

0 steps flagged

No circularity detected; scheduler design and empirical claims are independent of cited external predictor

full rationale

The paper's core contribution is a three-layer client-side decomposition (adaptive DRR allocation, feasible-set ordering, cost-ladder overload control) whose performance is measured under the explicit assumption that coarse token priors are available from an external source (Gan et al., 2026). Experiments include an information-ladder test and a predictor-noise sweep up to 60% multiplicative error, but the design equations, policy comparisons (e.g., Fair Queuing vs. Short-Priority), and reported metrics (100% completion, 4.2 ± 1.6 SLO-meeting req/s) do not reduce by construction to any fitted parameter or self-citation chain inside the paper. The cited predictor is treated as an input rather than derived or renamed within the work, satisfying the criteria for a self-contained derivation against stated assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper's claims depend on the external token predictor and on the experimental setup details not visible in the abstract; no new entities are introduced.

axioms (1)
  • domain assumption Output token counts can be predicted at submission time
    This is the key enabler cited from Gan et al., 2026.

pith-pipeline@v0.9.0 · 5632 in / 1253 out tokens · 45062 ms · 2026-05-10T17:50:53.992957+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

10 extracted references · 1 canonical work pages

  1. [1]

    SageSched: Efficient LLM Scheduling Confronting Demand Uncer- tainty and Hybridity

    Zhenghao Gan, Yichen Bao, Yifei Liu, Chen Chen, Quan Chen, and Minyi Guo. SageSched: Efficient LLM Scheduling Confronting Demand Uncer- tainty and Hybridity. arXiv preprint arXiv:2603.07917 , March 2026

  2. [2]

    Yu et al

    G.-I. Yu et al. Orca: A Distributed Serving System for Transformer-Based Generative Models. In OSDI, 2022

  3. [3]

    Kwon et al

    W. Kwon et al. Efficient Memory Management for Large Language Model Serving with PagedAttention. In SOSP, 2023

  4. [4]

    Zhong et al

    Y. Zhong et al. DistServe: Disaggregating Prefill and Decoding for Goodput-Optimized Large Language Model Serving. In OSDI, 2024

  5. [5]

    Agrawal et al

    A. Agrawal et al. Taming Throughput–Latency Tradeoff in LLM Inference with Sarathi-Serve. In OSDI, 2024

  6. [6]

    Patel et al

    P. Patel et al. Splitwise: Efficient Generative LLM Inference Using Phase Splitting. In NSDI, 2024

  7. [7]

    A. K. Parekh and R. G. Gallager. A Generalized Processor Sharing Ap- proach to Flow Control in Integrated Services Networks: The Single-Node Case. IEEE/ACM Trans. Networking , 1(3):344–357, 1993

  8. [8]

    Crankshaw et al

    D. Crankshaw et al. Clipper: A Low-Latency Online Prediction Serving System. In NSDI, 2017

  9. [9]

    Romero et al

    F. Romero et al. INFaaS: Automated Model-less Inference Serving. In ATC, 2021

  10. [10]

    Gujarati et al

    A. Gujarati et al. Serving DNNs like Clockwork: Performance Predictabil- ity from the Bottom Up. In OSDI, 2020. 21