Recognition: no theorem link
Scheduling the Unschedulable: Taming Black-Box LLM Inference at Scale
Pith reviewed 2026-05-10 17:50 UTC · model grok-4.3
The pith
Client-side scheduling with predicted token counts achieves full completion and 100% deadline satisfaction for black-box LLM inference under congestion.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors claim that a three-layer client-side scheduler—adaptive deficit round robin for inter-class allocation, feasible-set scoring for intra-class ordering, and explicit admit/defer/reject on a cost ladder for overload—combined with coarse output-token priors, yields 100% completion, 100% deadline satisfaction, and 4.2 ± 1.6 useful SLO-meeting requests per second under balanced or high congestion, with short-request P95 latencies close to those of quota-tiered isolation. The system degrades gracefully under up to 60% multiplicative prediction error and supports different fairness policies via the allocation layer.
What carries the argument
Three-layer client-side decomposition consisting of adaptive DRR allocation for inter-class shares, feasible-set scoring for intra-class ordering, and cost-ladder overload control.
If this is right
- Coarse magnitude priors, not class labels alone, are required; removing them increases short-request P95 by up to 5.8 times and reduces deadline satisfaction.
- The scheduler continues to function with graceful degradation when token predictions carry up to 60% multiplicative error.
- Fair queuing allocation improves short-request P90 by 32% over FIFO while adding only 17% overhead to long requests.
- Short-priority allocation achieves 27% short-request improvement but incurs 116% overhead on long requests.
- Heavy-dominated traffic regimes expose clear differences among policies on completion rates, tail latency, and interpretable shedding behavior.
Where Pith is reading between the lines
- The same three-layer structure could be packaged inside client libraries so that ordinary applications gain deadline guarantees without writing custom schedulers.
- The cost-ladder shedding mechanism may apply to other black-box services whose per-request cost can be estimated in advance.
- Real-world traces with varying prediction accuracy would test whether the 60% error tolerance holds outside controlled sweeps.
Load-bearing premise
Output token counts can be predicted at submission time with sufficient accuracy.
What would settle it
An experiment that increases token-count prediction error beyond 60% multiplicative and measures whether deadline satisfaction drops below 100% or useful goodput falls sharply.
Figures
read the original abstract
When output token counts can be predicted at submission time (Gan et al., 2026), client-side scheduling against a black-box LLM API becomes semi-clairvoyant: decisions condition on coarse token priors even though the provider's internals remain hidden. We decompose this boundary problem into three separable concerns: allocation (inter-class share via adaptive DRR), ordering (intra-class sequencing with feasible-set scoring), and overload control (explicit admit/defer/reject on a cost ladder). An information ladder experiment shows that coarse magnitude priors -- not class labels alone -- are the practical threshold for useful client control; removing magnitude inflates short-request P95 by up to $5.8\times$ and degrades deadline satisfaction. Under balanced / high congestion the full stack achieves 100% completion, 100% deadline satisfaction, and useful goodput of $4.2 \pm 1.6$ SLO-meeting requests/s with short P95 within tens of milliseconds of quota-tiered isolation. A predictor-noise sweep confirms graceful degradation under up to 60% multiplicative error. Heavy-dominated regimes separate policies on completion, tail, and interpretable shedding. We further compare short-priority allocation (biased toward interactive traffic) with Fair Queuing (round-robin across classes): Fair Queuing achieves +32% short-request P90 improvement over FIFO with only +17% long-request overhead, versus Short-Priority's +27% / +116% trade-off -- demonstrating that the allocation layer accommodates different fairness objectives without changing the remaining stack. We contribute the three-layer client-side decomposition, controlled evaluation of joint metrics across regimes, allocation-policy alternatives, and overload-policy evidence linking cost-ladder shedding to the stated service objective.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that when output token counts can be predicted at submission time (citing Gan et al. 2026), a client-side three-layer scheduler for black-box LLM APIs—consisting of adaptive DRR for inter-class allocation, feasible-set scoring for intra-class ordering, and cost-ladder overload control—achieves 100% request completion and deadline satisfaction under balanced/high congestion, with useful goodput of 4.2 ± 1.6 SLO-meeting requests/s and short-request P95 latencies within tens of milliseconds of quota-tiered isolation. An information-ladder experiment shows magnitude priors (not just class labels) are essential to avoid up to 5.8× P95 inflation, a noise sweep shows graceful degradation up to 60% multiplicative prediction error, and allocation-policy comparisons (short-priority vs. Fair Queuing) demonstrate flexibility without altering the rest of the stack.
Significance. If the results hold under the stated assumptions, the work offers a practical, separable decomposition for client-side SLO management of black-box LLM inference at scale, with controlled cross-regime evaluation and explicit policy trade-offs. The emphasis on coarse priors enabling effective control, plus reproducible policy alternatives, would be a useful contribution to systems for LLM API orchestration.
major comments (3)
- [§5] §5 (predictor-noise sweep and information-ladder experiment): the headline metrics (100% completion, 100% deadline satisfaction, 4.2 ± 1.6 goodput) are obtained only under the assumption that the external predictor from Gan et al. 2026 meets the required accuracy on the paper's workloads and SLO definitions. No independent measurement of prediction error rates for the tested request mix is reported; the 60% multiplicative noise sweep therefore tests robustness but does not establish that the baseline error is low enough for the 100% figures to be attainable in practice.
- [§4] §4 (experimental setup): the comparison to quota-tiered isolation and the reported P95 closeness are load-bearing for the central claim of near-optimal tail behavior, yet the manuscript provides no details on how the quota-tiered baseline is implemented, what its exact parameters are, or raw latency distributions, making it impossible to verify that the three-layer stack truly matches it within 'tens of milliseconds'.
- [Heavy-dominated regime evaluation] Heavy-dominated regime results: the separation of policies on completion, tail latency, and shedding is presented as evidence that the stack accommodates different objectives, but without the exact request-mix parameters, arrival rates, or deadline definitions used in that regime, it is difficult to assess whether the observed differences are robust or specific to the chosen synthetic mix.
minor comments (2)
- The term 'useful goodput' is used in the abstract and results but never given an explicit formula or definition in terms of the SLO parameters; adding a short equation or paragraph would improve clarity.
- Figure captions for the information-ladder and noise-sweep plots should include the exact request mix, number of runs, and confidence intervals rather than only the headline numbers.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below, indicating where revisions will be made to improve clarity and reproducibility while noting the limits of what we can provide.
read point-by-point responses
-
Referee: [§5] §5 (predictor-noise sweep and information-ladder experiment): the headline metrics (100% completion, 100% deadline satisfaction, 4.2 ± 1.6 goodput) are obtained only under the assumption that the external predictor from Gan et al. 2026 meets the required accuracy on the paper's workloads and SLO definitions. No independent measurement of prediction error rates for the tested request mix is reported; the 60% multiplicative noise sweep therefore tests robustness but does not establish that the baseline error is low enough for the 100% figures to be attainable in practice.
Authors: We agree that the headline 100% completion and deadline-satisfaction figures are conditional on the accuracy of the predictor cited from Gan et al. (2026). The noise sweep shows graceful degradation up to 60% multiplicative error, but we did not independently measure the predictor's error rate on our specific request mix and SLO definitions. This is because the manuscript focuses on the client-side scheduling decomposition assuming a usable predictor is available (as stated in the abstract and introduction). In the revision we will add an explicit caveats section discussing this assumption, its implications for real-world use, and the fact that the 100% results hold only when prediction error remains below the demonstrated robustness threshold. We cannot supply an independent error measurement because we did not re-implement or evaluate the Gan et al. predictor ourselves. revision: partial
-
Referee: [§4] §4 (experimental setup): the comparison to quota-tiered isolation and the reported P95 closeness are load-bearing for the central claim of near-optimal tail behavior, yet the manuscript provides no details on how the quota-tiered baseline is implemented, what its exact parameters are, or raw latency distributions, making it impossible to verify that the three-layer stack truly matches it within 'tens of milliseconds'.
Authors: We acknowledge that the quota-tiered isolation baseline lacks sufficient implementation details for independent verification. In the revised manuscript we will add a dedicated subsection describing the baseline implementation, the exact quota parameters and tier definitions used, and additional summary statistics (including P95, P99, and inter-quartile ranges) from the latency distributions. This will allow readers to confirm that the three-layer scheduler's short-request P95 remains within tens of milliseconds of the baseline under the reported loads. Raw full distributions are voluminous but we will provide them in supplementary material or a public repository. revision: yes
-
Referee: [Heavy-dominated regime evaluation] Heavy-dominated regime results: the separation of policies on completion, tail latency, and shedding is presented as evidence that the stack accommodates different objectives, but without the exact request-mix parameters, arrival rates, or deadline definitions used in that regime, it is difficult to assess whether the observed differences are robust or specific to the chosen synthetic mix.
Authors: We agree that the heavy-dominated regime results require the precise experimental parameters for reproducibility and to evaluate robustness. In the revision we will include a table or appendix entry listing the exact request-mix composition (class proportions), arrival rates, and deadline definitions used in those experiments. This will make clear that the observed policy separations on completion rate, tail latency, and shedding behavior are tied to the stated synthetic workload and can be assessed accordingly. revision: yes
- Independent measurement of prediction error rates from Gan et al. (2026) on the paper's workloads and SLO definitions, as this would require re-implementing and running their predictor, which lies outside the scope of our client-side scheduling contribution.
Circularity Check
No circularity detected; scheduler design and empirical claims are independent of cited external predictor
full rationale
The paper's core contribution is a three-layer client-side decomposition (adaptive DRR allocation, feasible-set ordering, cost-ladder overload control) whose performance is measured under the explicit assumption that coarse token priors are available from an external source (Gan et al., 2026). Experiments include an information-ladder test and a predictor-noise sweep up to 60% multiplicative error, but the design equations, policy comparisons (e.g., Fair Queuing vs. Short-Priority), and reported metrics (100% completion, 4.2 ± 1.6 SLO-meeting req/s) do not reduce by construction to any fitted parameter or self-citation chain inside the paper. The cited predictor is treated as an input rather than derived or renamed within the work, satisfying the criteria for a self-contained derivation against stated assumptions.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Output token counts can be predicted at submission time
Reference graph
Works this paper leans on
-
[1]
SageSched: Efficient LLM Scheduling Confronting Demand Uncer- tainty and Hybridity
Zhenghao Gan, Yichen Bao, Yifei Liu, Chen Chen, Quan Chen, and Minyi Guo. SageSched: Efficient LLM Scheduling Confronting Demand Uncer- tainty and Hybridity. arXiv preprint arXiv:2603.07917 , March 2026
-
[2]
Yu et al
G.-I. Yu et al. Orca: A Distributed Serving System for Transformer-Based Generative Models. In OSDI, 2022
2022
-
[3]
Kwon et al
W. Kwon et al. Efficient Memory Management for Large Language Model Serving with PagedAttention. In SOSP, 2023
2023
-
[4]
Zhong et al
Y. Zhong et al. DistServe: Disaggregating Prefill and Decoding for Goodput-Optimized Large Language Model Serving. In OSDI, 2024
2024
-
[5]
Agrawal et al
A. Agrawal et al. Taming Throughput–Latency Tradeoff in LLM Inference with Sarathi-Serve. In OSDI, 2024
2024
-
[6]
Patel et al
P. Patel et al. Splitwise: Efficient Generative LLM Inference Using Phase Splitting. In NSDI, 2024
2024
-
[7]
A. K. Parekh and R. G. Gallager. A Generalized Processor Sharing Ap- proach to Flow Control in Integrated Services Networks: The Single-Node Case. IEEE/ACM Trans. Networking , 1(3):344–357, 1993
1993
-
[8]
Crankshaw et al
D. Crankshaw et al. Clipper: A Low-Latency Online Prediction Serving System. In NSDI, 2017
2017
-
[9]
Romero et al
F. Romero et al. INFaaS: Automated Model-less Inference Serving. In ATC, 2021
2021
-
[10]
Gujarati et al
A. Gujarati et al. Serving DNNs like Clockwork: Performance Predictabil- ity from the Bottom Up. In OSDI, 2020. 21
2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.