pith. sign in

arxiv: 2504.11320 · v3 · pith:OSEMS3TZnew · submitted 2025-04-15 · 💻 cs.LG · cs.AI· cs.DC· math.OC· stat.ML

Optimizing LLM Inference: Fluid-Guided Online Scheduling with Memory Constraints

Pith reviewed 2026-05-22 19:39 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.DCmath.OCstat.ML
keywords LLM inferencefluid modelKV cacheonline schedulingmemory constraintsstability regionadmission controlscheduling policy
0
0 comments X

The pith

Fluid model of KV-cache growth yields admission rules that approximate optimal LLM inference scheduling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper treats token-by-token LLM inference as an online scheduling problem where each request's KV-cache grows endogenously, imposing hard memory limits on the GPU. It builds a fluid model to identify equilibrium batch compositions, required memory, and the region of stable operation. Using this model, the authors derive WAIT, a simple threshold rule for admitting requests when output lengths are known, and Nested WAIT, which segments the decode phase and adds a safety buffer when lengths are unknown. Simulations with Llama-2-7B on A100 hardware show these policies support a wider range of loads without instability and deliver lower latency than standard baselines, especially when the system approaches or exceeds capacity. Readers care because daily service of millions of users makes even modest gains in stability and speed translate to large cost and experience improvements.

Core claim

The authors formulate inference as a multi-stage online scheduling problem with endogenous memory growth, linear iteration times, and GPU-resident KV-cache constraints. They introduce a fluid model that characterizes equilibrium batch composition, memory requirement, and stability region. Guided by the fluid model, they design WAIT and Nested WAIT policies that approximate the fluid benchmark asymptotically under the stated memory conditions, with Nested WAIT adding a safety buffer for unknown output lengths. In Vidur simulations for Llama-2-7B on an A100 GPU, the policies enlarge the stable operating range and reduce latency relative to baselines, particularly in near-overloaded and loaded.

What carries the argument

Fluid model characterizing equilibrium batch composition, memory requirement, and stability region, which is used to derive threshold-based admission rules for requests.

Load-bearing premise

The fluid model accurately captures equilibrium batch composition, memory requirement, and stability region for the real multi-stage inference process with endogenous KV-cache growth.

What would settle it

If Vidur simulations configured for Llama-2-7B on an A100 GPU show that the policies do not enlarge the stable operating range or reduce latency compared to baseline algorithms, the approximation and performance claims would be falsified.

Figures

Figures reproduced from arXiv: 2504.11320 by David Simchi-Levi, Gan Luo, Ruicheng Ao, Xinshang Wang.

Figure 1
Figure 1. Figure 1: An example of LLM inference. Patterson et al. 2021). Efficient inference scheduling can reduce these operational costs and energy consump￾tion (Strubell et al. 2019, Desislavov et al. 2021) by balancing system throughput and response latency (Wu et al. 2022). To design effective scheduling, we must first understand the LLM inference process [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Example of batching and scheduling. Experiments in Agrawal et al. (2023), Kwon et al. (2023), Zhong et al. (2024) show that iteration time scales linearly with total KV cache size. We model this as: τ = d0 + d1 · Xn1 i=1 lj(i) + Xn2 i ′=1 (lj(i ′) + si ′ ) ! | {z } Total KV cache size in batch , (1) where d0 denotes the fixed overhead time associated with processing a batch, and d1 represents the time cost… view at source ↗
Figure 3
Figure 3. Figure 3: Inference time of Llama-7B and Llama-13B with different number of tokens in the batch on a [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Pipeline of Algorithm 1 4.3 Asymptotic Analysis We evaluate the performance of WAIT in an asymptotic regime that captures long-run system behav￾ior. Specifically, we scale the arrival rates to λ (ζ) j = ζλj and processing speeds accordingly ((d (ζ) 0 , d(ζ) 1 ) = (ζ −1d0, ζ−1d1)) while keeping memory capacity C fixed. This can equivalently be viewed as scaling the time horizon from T to ζT while keeping ar… view at source ↗
Figure 5
Figure 5. Figure 5: Pipeline of the Nested WAIT Algorithm Consider Example 3 with two prompt types: “Hello” → “Hi” (1 token, type 1) and “Hello” → “Hi there!” (2 tokens, type 2). At arrival, we cannot distinguish them. After one decode iteration, however, the prompts reveal themselves: type-1 prompts have completed their output (“Hi”) and exit; type-2 prompts have generated only their first token (“Hi”) and continue. Nested W… view at source ↗
Figure 6
Figure 6. Figure 6: Example of the Nested WAIT Algorithm finish first within segment 1, freeing memory without waiting for long prompts. Crucially, no prediction is needed—prompts simply reveal their output by completing or continuing. This achieves high memory utilization for all types simultaneously, not just the shortest or longest. We now formalize this intuition. 5.2 Main Theorem We now analyze Nested WAIT in the asympto… view at source ↗
Figure 7
Figure 7. Figure 7: Average throughput and latency across algorithms on the synthetic dataset with low demand. [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Average throughput and latency across algorithms on the synthetic dataset with high demand. [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Distribution of prefill and decode lengths in the real dataset. [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Mean prefill lengths and prompt counts per decode length bin in the real dataset. [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Average throughput and latency across algorithms on the synthetic dataset. [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Average throughput and latency across algorithms on the real dataset with clustered output [PITH_FULL_IMAGE:figures/full_fig_p025_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Average throughput and latency across algorithms on the real dataset with clustered output [PITH_FULL_IMAGE:figures/full_fig_p025_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Average throughput and latency across algorithms on the real dataset with actual output lengths. [PITH_FULL_IMAGE:figures/full_fig_p026_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Arrival rate construction from real data, with rates proportional to normalized frequencies. [PITH_FULL_IMAGE:figures/full_fig_p029_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Memory usage proportions under low arrival rate (total rate = 50, [PITH_FULL_IMAGE:figures/full_fig_p029_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Memory usage proportions under high arrival rate (total rate = 500, [PITH_FULL_IMAGE:figures/full_fig_p030_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Memory usage proportions with varying time horizons (total rate = 50, [PITH_FULL_IMAGE:figures/full_fig_p030_18.png] view at source ↗
read the original abstract

Large language models now serve millions of users daily, with providers incurring costs exceeding $700,000 per day. Each request requires token-by-token inference, making GPU scheduling central to latency, capacity, and cost. The difficulty is endogenous memory growth: generated tokens expand the Key-Value (KV) cache, and overflow can evict in-progress requests and waste prior computation. We formulate inference as a multi-stage online scheduling problem with endogenous memory growth, linear iteration times, and GPU-resident KV-cache constraints. We introduce a fluid model that characterizes equilibrium batch composition, memory requirement, and stability region. Guided by the fluid model, we design WAIT (Waiting for Accumulated Inference Threshold), a threshold-based admission rule for known output lengths, and Nested WAIT, which extends the rule to unknown output lengths by regulating how requests advance across decode-stage segments. Both algorithms approximate the fluid benchmark asymptotically under the stated memory conditions. Nested WAIT uses an additional safety buffer of moderate scale to hedge against memory-overflow-induced evictions under unknown output lengths. In Vidur simulations configured for Llama-2-7B on an A100 GPU, with supplemental real-GPU validation reported in the appendix, the policies enlarge the empirically observed stable operating range relative to widely used baseline algorithms and reduce latency especially in near-overloaded and overloaded regimes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper formulates LLM inference as a multi-stage online scheduling problem with endogenous KV-cache memory growth and linear iteration times. It introduces a fluid model characterizing equilibrium batch composition, memory usage, and stability region, then derives WAIT (for known output lengths) and Nested WAIT (for unknown lengths with a safety buffer) as threshold-based admission policies. Both are claimed to asymptotically approximate the fluid benchmark under stated memory conditions. Vidur simulations for Llama-2-7B on A100 (plus appendix real-GPU runs) show enlarged stable operating range and reduced latency versus baselines, especially near and beyond overload.

Significance. If the fluid-to-discrete translation holds, the work supplies an analytically grounded method for memory-aware scheduling that could improve utilization and tail latency in production LLM serving systems. The fluid equilibrium analysis offers a reusable lens for stability regions in systems with growing per-request state, which is a timely contribution given rising inference costs.

major comments (2)
  1. [§3] §3 (Fluid Model): The derivation assumes continuous-time equilibrium batching and linear per-iteration costs with known/segmented output lengths, yet the central claim that WAIT/Nested WAIT remain near-optimal for the discrete multi-stage process with stochastic output lengths and endogenous KV-cache growth is not accompanied by a quantitative bound on the approximation gap or a proof of stability preservation under bursty arrivals.
  2. [§5.3] §5.3 (Simulation Results): The reported enlargement of the stable operating range and latency reductions in near-overloaded regimes are presented without confidence intervals, variance estimates, or statistical significance tests across the Vidur runs; this weakens the empirical support for the claim that the fluid-guided thresholds outperform widely used baselines under realistic conditions.
minor comments (2)
  1. [§4.2] The definition of the safety buffer scale in Nested WAIT (around Eq. (12) or equivalent) should include a brief sensitivity analysis showing how performance varies with buffer size.
  2. [§5] Figure 3 (or equivalent stability-region plot) would benefit from explicit annotation of the fluid-derived thresholds used in the simulated policies.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their insightful comments and the opportunity to improve our manuscript. We address each major comment below, indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: [§3] §3 (Fluid Model): The derivation assumes continuous-time equilibrium batching and linear per-iteration costs with known/segmented output lengths, yet the central claim that WAIT/Nested WAIT remain near-optimal for the discrete multi-stage process with stochastic output lengths and endogenous KV-cache growth is not accompanied by a quantitative bound on the approximation gap or a proof of stability preservation under bursty arrivals.

    Authors: We agree that our fluid model provides an asymptotic characterization rather than a finite-time bound. The policies are derived to approximate the fluid benchmark asymptotically under the memory conditions specified in the paper. While we do not provide a quantitative error bound for the discrete stochastic case or a stability proof for arbitrary bursty arrivals, the Vidur simulations validate the practical performance, including under the arrival processes tested. In the revision, we will clarify the scope of the asymptotic claim and add a discussion of the limitations, including the challenges in deriving such bounds for systems with endogenous memory growth. revision: partial

  2. Referee: [§5.3] §5.3 (Simulation Results): The reported enlargement of the stable operating range and latency reductions in near-overloaded regimes are presented without confidence intervals, variance estimates, or statistical significance tests across the Vidur runs; this weakens the empirical support for the claim that the fluid-guided thresholds outperform widely used baselines under realistic conditions.

    Authors: We concur that statistical rigor would enhance the presentation of the results. We will revise Section 5.3 to include confidence intervals and variance estimates computed over multiple independent simulation runs. Additionally, we will report the results of statistical significance tests comparing the performance of WAIT and Nested WAIT against the baselines. revision: yes

standing simulated objections not resolved
  • A quantitative bound on the approximation gap or a formal proof of stability preservation under bursty arrivals in the discrete stochastic setting

Circularity Check

0 steps flagged

No significant circularity in fluid model derivation or policy design

full rationale

The paper constructs a fluid model from first principles to characterize equilibrium batch composition, memory requirements, and stability region under linear iteration times and KV-cache constraints. Threshold rules for WAIT and Nested WAIT are then derived from this model, with the asymptotic approximation claim being a direct mathematical consequence within the fluid limit rather than a restatement of fitted data. Vidur simulations and real-GPU runs constitute independent empirical tests on discrete request streams, not tautological outputs from the same parameters. No self-definitional reductions, fitted inputs renamed as predictions, or load-bearing self-citations appear in the provided derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on the abstract alone, the central claim rests on the existence of a fluid limit that characterizes stable batch composition and memory use; no explicit free parameters, axioms, or invented entities are named in the provided text.

pith-pipeline@v0.9.0 · 5781 in / 1326 out tokens · 82252 ms · 2026-05-22T19:39:16.054795+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for LLM Inference on Superchips

    cs.DC 2026-01 conditional novelty 7.0

    SuperInfer improves TTFT SLO attainment by up to 74.7% on GH200 Superchips via SLO-aware rotary scheduling (RotaSched) and full-duplex KV cache rotation (DuplexKV) over NVLink-C2C while preserving TBT and throughput.

  2. Tackling the Data-Parallel Load Balancing Bottleneck in LLM Serving: Practical Online Routing at Scale

    cs.DC 2026-05 unverdicted novelty 6.0

    BalanceRoute reduces data-parallel imbalance in LLM inference via F-score routing and lookahead, yielding higher end-to-end throughput on 144-NPU clusters versus vLLM baselines.

  3. Tackling the Data-Parallel Load Balancing Bottleneck in LLM Serving: Practical Online Routing at Scale

    cs.DC 2026-05 unverdicted novelty 6.0

    BalanceRoute uses a piecewise-linear F-score (with optional short lookahead) for sticky request routing in LLM serving, reducing DP imbalance and raising end-to-end throughput versus vLLM baselines on production and A...

  4. A Queueing-Theoretic Framework for Stability Analysis of LLM Inference with KV Cache Memory Constraints

    cs.LG 2026-05 unverdicted novelty 6.0

    A queueing model derives stability conditions for LLM inference services under combined compute and KV cache memory limits, with experimental validation showing typical deviations under 10%.

  5. Flow-Controlled Scheduling for LLM Inference with Provable Stability Guarantees

    cs.LG 2026-04 unverdicted novelty 6.0

    A flow-control framework for LLM inference derives necessary and sufficient stability conditions and experimentally improves throughput, latency, and KV cache stability over common baselines.

  6. Position: LLM Serving Needs Mathematical Optimization and Algorithmic Foundations, Not Just Heuristics

    cs.DC 2026-05 accept novelty 4.0

    LLM serving requires mathematical optimization and algorithms with provable guarantees rather than generic heuristics that fail unpredictably on LLM workloads.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · cited by 5 Pith papers · 4 internal anchors

  1. [1]

    Agrawal, A., Kedia, N., Mohan, J., Panwar, A., Kwatra, N., Gulavani, B., Ramjee, R., and Tumanov, A. (2024a). Vidur: A large-scale simulation framework for llm inference.Proceedings of Machine Learning and Systems, 6:351–366. Agrawal, A., Kedia, N., Panwar, A., Mohan, J., Kwatra, N., Gulavani, B., Tumanov, A., and Ramjee, R. (2024b). Taming{Throughput-Lat...

  2. [2]

    Qwen Technical Report

    Springer. Bai, J., Bai, S., Chu, Y., Cui, Z., Dang, K., Deng, X., Fan, Y., Ge, W., Han, Y., Huang, F., et al. (2023). Qwen technical report.arXiv preprint arXiv:2309.16609. Balkanski, E., Rubinstein, A., and Singer, Y. (2016). The power of optimization from samples.Advances in Neural Information Processing Systems,

  3. [3]

    B¨ auerle, N. (2002). Optimal control of queueing networks: An approach via fluid models.Advances in Applied Probability, 34(2):313–328. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners.Advances in neural information processing...

  4. [4]

    Scaling Laws for Neural Language Models

    Cambridge university press. Fu, Y., Zhu, S., Su, R., Qiao, A., Stoica, I., and Zhang, H. (2025). Efficient llm scheduling by learning to rank. Advances in Neural Information Processing Systems, 37:59006–59029. Garc´ ıa-Mart´ ın, E., Rodrigues, C. F., Riley, G., and Grahn, H. (2019). Estimation of energy consumption in machine learning.Journal of Parallel ...

  5. [5]

    Li, W., Wang, L., Chai, X., and Yuan, H. (2020). Online batch scheduling of simple linear deteriorating jobs with incompatible families.Mathematics, 8(2):170. Li, Y., Dai, J., and Peng, T. (2025b). Throughput-optimal scheduling algorithms for llm inference and ai agents. Accessed: 2025-04-11. Liu, P. and Lu, X. (2015). Online unbounded batch scheduling on...

  6. [6]

    Wang, M., Ye, Y., and Zhou, Z

    ”In vLLM V1, the default preemption mode is RECOMPUTE rather than SWAP, as recomputation has lower overhead in the V1 architecture”. Wang, M., Ye, Y., and Zhou, Z. (2025). Llm serving optimization with variable prefill and decode lengths.arXiv preprint arXiv:2508.06133. Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bos...

  7. [7]

    At stage 0 and stage 1, both types appear identical to the scheduler

    The key constraint is that prompt types cannot be distinguished until a type-2 prompt enters stage 2 (generates its second output token). At stage 0 and stage 1, both types appear identical to the scheduler. Consider any admissible batching policy. For each possible batch configuration, one of the following must hold: 1.Memory underutilization:The batch l...

  8. [8]

    Define{ ˜W b}by: ˜W 0 = 2λ, ˜W b+1 = max{2λ, ˜W b +X b −λ}

    and maintains dominance throughout. Define{ ˜W b}by: ˜W 0 = 2λ, ˜W b+1 = max{2λ, ˜W b +X b −λ}. Then ˜W b ≥W b +λfor allb∈N. The proof is in Appendix E.2. 36 Lindley recursion.To analyze the coupled process, we apply the Lindley recursion representation (As- mussen 2003, Proposition 6.3): Lemma 3 (Lindley Representation)LetS k =Pk r=1(X r −λ)be the partia...

  9. [9]

    (19) These constraints ensure that for anyπ∈Π: when allmtypes are batched together, arrivals do not exceed completions for each type (i.e., Arrival j ≤Completion j)

    lj + l′ j 2 ≥M ∗. (19) These constraints ensure that for anyπ∈Π: when allmtypes are batched together, arrivals do not exceed completions for each type (i.e., Arrival j ≤Completion j). The equilibrium policy corresponds to π∗ = [ n∗ 1 l′ 1+1 ,· · ·, n∗ m l′m+1] whenM π =M ∗. Notation.Letbdenote the batch index,J b the set of types in batchb,t s(b) the star...

  10. [10]

    Negative drift ensures queue stability, preventing unbounded growth that would lead to eviction

    Expected Queue Length.SinceE h X b (k) i =n k−1pk −n k <0 (negative drift), by Kingman’s inequality (Kingman 1962): E h W b (k) i ≤E h ˜W b (k) i ≤n k + nk−1 ·p k(1−p k) 2 (nk −n k−1 ·p k) . Negative drift ensures queue stability, preventing unbounded growth that would lead to eviction. This bound ensures finite expected queue length for each segmentk≥2, ...

  11. [11]

    By Doob’s martingale inequality (Durrett 2019), we have that P( max 1≤i≤b−1 {eθkSi (k) } ≥a)≤ E M i a = E M0 a = 1 a

    This zero-drift property in log space enables Doob’s inequality to bound tail probabilities. By Doob’s martingale inequality (Durrett 2019), we have that P( max 1≤i≤b−1 {eθkSi (k) } ≥a)≤ E M i a = E M0 a = 1 a . Exponentiating both sides of the inequality event, we obtain: P( max 1≤i≤b−1 {eθS i (k) } ≥e θc)≤ 1 eθc ⇒P( max 1≤i≤b−1 {Si (k)} ≥c)≤e −θc. Using...

  12. [12]

    Thus we have the dominance chain: W b (1) ≤ ¯W b (1) ≤ ˜W b (1) ≤ W b (1),∀b∈N

    The process{ ˜W b (1)} is dominated by a Lindley process: W 0 (1) = 2n1, W b+1 (1) = max{2n1, W b (1) +X b (1)},∀b∈N. Thus we have the dominance chain: W b (1) ≤ ¯W b (1) ≤ ˜W b (1) ≤ W b (1),∀b∈N. Sincen 1 >Λ π, the process{ W b (1)}has negative drift. By Kingman’s inequality (Kingman 1962), with Var(Xb (1)) =Λ π, E h X b (1) i =n 1 −Λ π, we obtain E h W...