pith. sign in

arxiv: 2606.22327 · v2 · pith:F7AJY4QInew · submitted 2026-06-21 · 💻 cs.AI

Geometry-Aware Online Scheduling for LLM Serving: From Theoretical Bound to System Practice

Pith reviewed 2026-06-26 11:15 UTC · model grok-4.3

classification 💻 cs.AI
keywords LLM servingonline schedulingKV cachecompetitive ratiogeometry-aware algorithmsvLLMmemory management
0
0 comments X

The pith

Scheduling LLM requests by smallest KV-cache volume growth improves the worst-case competitive ratio from 48 to 3.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to replace time-centric heuristics such as Shortest Job First with a scheduling rule that accounts for the two-dimensional growth of memory in the key-value cache during LLM inference. It introduces the Smallest Volume First algorithm and proves, via a volume-certificate argument, that this rule guarantees a competitive ratio of 3 against the optimal offline schedule in high-concurrency settings. The same framework yields an efficient one-bit variant and a full classification of performance across traffic patterns and information levels. When inserted into vLLM, the method produces measurable drops in both average and tail latency on Llama-3.1 workloads. A reader would care because interactive LLM services are bottlenecked by memory pressure, and a tighter theoretical bound directly informs how much headroom can be reclaimed in production engines.

Core claim

The central claim is that modeling KV-cache expansion as a two-dimensional spatio-temporal geometric volume makes the Smallest Volume First ordering optimal in the sense that its worst-case competitive ratio is at most 3 in the high-concurrency regime, a sharpening of the previous best bound of 48; the proof is supplied by a novel volume-certificate technique, and the result extends to a complete taxonomy of the algorithm family under varying traffic and information assumptions.

What carries the argument

The Smallest Volume First (SVF) algorithm, which at each decision point selects the pending request whose projected KV-cache memory volume is smallest.

If this is right

  • SVF produces lower average and tail latency than time-centric baselines on Llama-3.1 models.
  • The one-bit SVF variant matches the performance of full SVF while requiring only a single bit of per-request information.
  • The algorithm integrates directly as a plug-and-play scheduler inside vLLM without altering the rest of the engine.
  • A complete theoretical taxonomy classifies the competitive ratios of SVF and its variants across different traffic scenarios and degrees of information availability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the volume metric remains predictive under request mixing, the same scheduler could be applied without modification to serving multiple model sizes or architectures on one GPU.
  • A direct measurement of realized competitive ratio on production traces would indicate whether the bound of 3 is tight or can be tightened further by refined volume estimators.
  • Extending the geometric model to account for cache eviction or quantization effects would test whether the same proof technique continues to deliver constant-factor guarantees.

Load-bearing premise

That the memory footprint of an LLM request grows in a way that is accurately summarized by a single scalar volume metric derived from a two-dimensional geometric model.

What would settle it

A high-concurrency request trace for which the total latency or makespan produced by SVF is more than three times larger than the latency of the optimal offline schedule computed from the same trace.

Figures

Figures reproduced from arXiv: 2606.22327 by Li Kong, Qi Qi, Yinyu Ye, Zijie Zhou.

Figure 1
Figure 1. Figure 1: LLM Inference Process and the Calculation of Volume. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Practical Performance under Poisson Arrivals [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Validation of Theoretical Claims under Poisson Arrivals [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
read the original abstract

The explosive demand for interactive Large Language Model serving has highlighted the management of the Key-Value cache's dynamic memory footprint as a critical area for performance optimization in inference engines. Modern inference systems overwhelmingly rely on time-centric scheduling heuristics, such as Shortest Job First. However, their theoretical optimality is rooted in traditional schedule modeling, failing to capture the highly dynamic, 2D spatio-temporal geometric growth specific to LLM inference mechanisms. To resolve this, we propose the geometry-aware online scheduling by introducing the Smallest Volume First (SVF) algorithm and its highly efficient variant, 1-bit SVF. Theoretically, we provide a rigorous mathematical foundation for our approach. Via a novel volume-certificate proof, we sharpen SVF's worst-case competitive ratio from the prior best of 48 towards \textbf{3} in the high-concurrency regime of LLM serving. Building upon this core breakthrough, we complete a comprehensive theoretical taxonomy analyzing our algorithms across different traffic scenarios and information availability. Practically, we seamlessly integrate our approach as a plug-and-play layer in vLLM. Extensive evaluations on Llama-3.1 models demonstrate comprehensive performance gains: SVF delivers strong reductions in both average and tail latency, while 1-bit SVF, with merely a single bit information, achieves competitive throughput and latency. This work establishes a theoretically sound and empirically proven approach for resolving memory-constrained scheduling in modern LLM deployments. To facilitate future research, our code is available at https://github.com/Aurora-Kl/Geometry-Aware-Online-Scheduling.git.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes Smallest Volume First (SVF) and 1-bit SVF as geometry-aware online schedulers for LLM inference, motivated by a 2D spatio-temporal geometric model of KV-cache growth. It claims a novel volume-certificate proof that tightens the worst-case competitive ratio to 3 (from prior 48) in high-concurrency regimes, supplies a theoretical taxonomy across traffic and information settings, and reports latency/throughput gains when plugged into vLLM on Llama-3.1 models.

Significance. If the geometric model is faithful and the proof is correct, the work would supply the first non-trivial competitive-ratio improvement for memory-constrained LLM scheduling and a practical drop-in replacement for time-centric heuristics such as SJF. The open-source integration and code release would further strengthen its impact.

major comments (2)
  1. [Abstract] Abstract and introduction: the central claim of a volume-certificate proof sharpening the competitive ratio to 3 is asserted without any model equations, derivation steps, or statement of the 2D geometric growth assumptions; this prevents verification of whether the ratio is independent of input parameters or reduces by construction.
  2. [Abstract] No quantitative experimental numbers, tables, or figures are referenced in the abstract or summary; the claimed reductions in average and tail latency therefore cannot be assessed for effect size or statistical significance.
minor comments (1)
  1. The abstract states that code is available at a GitHub link; confirm that the repository contains the exact experimental configurations and the proof scripts used to obtain the ratio-3 bound.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will make targeted revisions to the abstract and introduction to improve verifiability while preserving the manuscript's focus.

read point-by-point responses
  1. Referee: [Abstract] Abstract and introduction: the central claim of a volume-certificate proof sharpening the competitive ratio to 3 is asserted without any model equations, derivation steps, or statement of the 2D geometric growth assumptions; this prevents verification of whether the ratio is independent of input parameters or reduces by construction.

    Authors: The full manuscript presents the 2D spatio-temporal geometric model of KV-cache growth in Section 3 and the volume-certificate proof in Section 4, establishing that the competitive ratio of 3 holds in the high-concurrency regime and is independent of specific input parameters under the model's assumptions (rather than reducing by construction). To address the concern about accessibility in the abstract and introduction, we will revise both sections to include a concise statement of the key geometric assumptions and a high-level outline of the proof technique. revision: yes

  2. Referee: [Abstract] No quantitative experimental numbers, tables, or figures are referenced in the abstract or summary; the claimed reductions in average and tail latency therefore cannot be assessed for effect size or statistical significance.

    Authors: We agree that referencing specific quantitative results would allow better assessment of effect sizes. In the revised version, we will update the abstract to include key experimental outcomes from the vLLM integration on Llama-3.1 models (e.g., observed reductions in average and tail latency) and explicitly reference the relevant tables and figures that report the full statistical details. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper's central claim is a novel volume-certificate proof deriving an improved worst-case competitive ratio of 3 for the SVF algorithm under an explicit 2D spatio-temporal KV-cache model. No equations, fitted parameters, or self-citations are shown that would reduce this ratio to a quantity defined by the input data or prior results by construction. The modeling assumption is stated upfront as the reason time-centric bounds fail, and the proof technique is presented as independent mathematical content. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on a single domain modeling assumption about KV-cache geometry; no free parameters or new entities are introduced in the abstract.

axioms (1)
  • domain assumption The dynamic memory footprint of LLM KV cache exhibits highly dynamic 2D spatio-temporal geometric growth.
    This modeling premise is invoked to explain why time-centric heuristics are suboptimal and to motivate the volume-based metric.

pith-pipeline@v0.9.1-grok · 5817 in / 1179 out tokens · 37560 ms · 2026-06-26T11:15:32.074017+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 7 linked inside Pith

  1. [1]

    Llm inference serving: Survey of recent advances and opportunities

    Baolin Li, Yankai Jiang, Vijay Gadepally, and Devesh Tiwari. Llm inference serving: Survey of recent advances and opportunities. In2024 IEEE High Performance Extreme Computing Conference (HPEC), pages 1–8. IEEE, 2024

  2. [2]

    Taming the titans: A survey of efficient llm inference serving

    Ranran Zhen, Juntao Li, Yixin Ji, Zhenlin Yang, Tong Liu, Qingrong Xia, Xinyu Duan, Zhefeng Wang, Baoxing Huai, and Min Zhang. Taming the titans: A survey of efficient llm inference serving. InProceedings of the 18th International Natural Language Generation Conference, pages 522–541, 2025

  3. [3]

    Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  4. [4]

    When search engine services meet large language models: visions and challenges.IEEE Transactions on Services Computing, 17(6):4558–4577, 2024

    Haoyi Xiong, Jiang Bian, Yuchen Li, Xuhong Li, Mengnan Du, Shuaiqiang Wang, Dawei Yin, and Sumi Helal. When search engine services meet large language models: visions and challenges.IEEE Transactions on Services Computing, 17(6):4558–4577, 2024

  5. [5]

    Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

  6. [6]

    Ai-assisted code authoring at scale: Fine-tuning, deploying, and mixed methods evaluation.Proceedings of the ACM on Software Engineering, 1(FSE):1066–1085, 2024

    Vijayaraghavan Murali, Chandra Maddila, Imad Ahmad, Michael Bolin, Daniel Cheng, Negar Ghorbani, Renuka Fernandez, Nachiappan Nagappan, and Peter C Rigby. Ai-assisted code authoring at scale: Fine-tuning, deploying, and mixed methods evaluation.Proceedings of the ACM on Software Engineering, 1(FSE):1066–1085, 2024

  7. [7]

    Orca: A distributed serving system for {Transformer-Based} generative models

    Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. Orca: A distributed serving system for {Transformer-Based} generative models. In16th USENIX symposium on operating systems design and implementation (OSDI 22), pages 521–538, 2022

  8. [8]

    Efficient memory management for large language model serving with pagedattention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th symposium on operating systems principles, pages 611–626, 2023

  9. [9]

    Online scheduling for llm inference with kv cache constraints.arXiv preprint arXiv:2502.07115, 2025

    Patrick Jaillet, Jiashuo Jiang, Konstantina Mellou, Marco Molinaro, Chara Podimata, and Zijie Zhou. Online scheduling for llm inference with kv cache constraints.arXiv preprint arXiv:2502.07115, 2025

  10. [10]

    Llm serving optimization with variable prefill and decode lengths

    Meixuan Wang, Yinyu Ye, and Zijie Zhou. Llm serving optimization with variable prefill and decode lengths. arXiv preprint arXiv:2508.06133, 2025

  11. [11]

    Position: Llm serving needs mathematical optimization and algorithmic foundations, not just heuristics, 2026

    Zijie Zhou. Position: Llm serving needs mathematical optimization and algorithmic foundations, not just heuristics, 2026

  12. [12]

    Servegen: Workload characterization and generation of large language model serving in production.arXiv preprint arXiv:2505.09999, 2025

    Yuxing Xiang, Xue Li, Kun Qian, Wenyuan Yu, Ennan Zhai, and Xin Jin. Servegen: Workload characterization and generation of large language model serving in production.arXiv preprint arXiv:2505.09999, 2025

  13. [13]

    The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  14. [14]

    Megatron-lm: Training multi-billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053, 2019

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053, 2019

  15. [15]

    P Xing, Joseph E

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Tianle Li, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zhuohan Li, Zi Lin, Eric. P Xing, Joseph E. Gonzalez, Ion Stoica, and Hao Zhang. Lmsys-chat-1m: A large-scale real-world llm conversation dataset, 2023

  16. [16]

    Longbench: A bilingual, multitask benchmark for long context understanding, 2023

    Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. Longbench: A bilingual, multitask benchmark for long context understanding, 2023

  17. [17]

    Efficient interactive llm serving with proxy model-based sequence length prediction.arXiv preprint arXiv:2404.08509, 2024

    Haoran Qiu, Weichao Mao, Archit Patke, Shengkun Cui, Saurabh Jha, Chen Wang, Hubertus Franke, Zbigniew T Kalbarczyk, Tamer Ba¸ sar, and Ravishankar K Iyer. Efficient interactive llm serving with proxy model-based sequence length prediction.arXiv preprint arXiv:2404.08509, 2024

  18. [18]

    Bert: Pre-training of deep bidirectional transformers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019. 11 Geo...

  19. [19]

    Well-read students learn better: The impact of student initialization on knowledge distillation.CoRR, abs/1908.08962, 2019

    Iulia Turc, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Well-read students learn better: The impact of student initialization on knowledge distillation.CoRR, abs/1908.08962, 2019

  20. [20]

    Prompt-aware scheduling for low-latency llm serving.arXiv preprint arXiv:2510.03243, 2025

    Yiheng Tao, Yihe Zhang, Matthew T Dearing, Xin Wang, Yuping Fan, and Zhiling Lan. Prompt-aware scheduling for low-latency llm serving.arXiv preprint arXiv:2510.03243, 2025

  21. [21]

    Efficient llm scheduling by learning to rank.Advances in Neural Information Processing Systems, 37:59006–59029, 2024

    Yichao Fu, Siqi Zhu, Runlong Su, Aurick Qiao, Ion Stoica, and Hao Zhang. Efficient llm scheduling by learning to rank.Advances in Neural Information Processing Systems, 37:59006–59029, 2024

  22. [22]

    Shinjuku: Preemptive scheduling for {µsecond-scale} tail latency

    Kostis Kaffes, Timothy Chong, Jack Tigar Humphries, Adam Belay, David Mazières, and Christos Kozyrakis. Shinjuku: Preemptive scheduling for {µsecond-scale} tail latency. In16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19), pages 345–360, 2019

  23. [23]

    Fast distributed inference serving for large language models.arXiv preprint arXiv:2305.05920, 2023

    Bingyang Wu, Yinmin Zhong, Zili Zhang, Shengyu Liu, Fangyue Liu, Yuanhang Sun, Gang Huang, Xuanzhe Liu, and Xin Jin. Fast distributed inference serving for large language models.arXiv preprint arXiv:2305.05920, 2023

  24. [24]

    Yunho Jin, Chun-Feng Wu, David Brooks, and Gu-Yeon Wei.s3: Increasing gpu utilization during generative inference for higher throughput.Advances in Neural Information Processing Systems, 36:18015–18027, 2023

  25. [25]

    Dynamollm: Designing llm inference clusters for performance and energy efficiency

    Jovan Stojkovic, Chaojie Zhang, Íñigo Goiri, Josep Torrellas, and Esha Choukse. Dynamollm: Designing llm inference clusters for performance and energy efficiency. In2025 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 1348–1362. IEEE, 2025

  26. [26]

    Response length perception and sequence scheduling: An llm-empowered llm inference pipeline.Advances in Neural Information Processing Systems, 36:65517–65530, 2023

    Zangwei Zheng, Xiaozhe Ren, Fuzhao Xue, Yang Luo, Xin Jiang, and Yang You. Response length perception and sequence scheduling: An llm-empowered llm inference pipeline.Advances in Neural Information Processing Systems, 36:65517–65530, 2023

  27. [27]

    Inference without interference: Disaggregate llm inference for mixed downstream workloads.arXiv preprint arXiv:2401.11181, 2024

    Cunchen Hu, Heyang Huang, Liangliang Xu, Xusheng Chen, Jiang Xu, Shuang Chen, Hao Feng, Chenxi Wang, Sa Wang, Yungang Bao, et al. Inference without interference: Disaggregate llm inference for mixed downstream workloads.arXiv preprint arXiv:2401.11181, 2024

  28. [28]

    Power-aware deep learning model serving with{µ-Serve}

    Haoran Qiu, Weichao Mao, Archit Patke, Shengkun Cui, Saurabh Jha, Chen Wang, Hubertus Franke, Zbigniew Kalbarczyk, Tamer Ba¸ sar, and Ravishankar K Iyer. Power-aware deep learning model serving with{µ-Serve}. In 2024 USENIX Annual Technical Conference (USENIX ATC 24), pages 75–93, 2024

  29. [29]

    Don’t stop me now: Embedding based scheduling for llms.arXiv preprint arXiv:2410.01035, 2024

    Rana Shahout, Eran Malach, Chunwei Liu, Weifan Jiang, Minlan Yu, and Michael Mitzenmacher. Don’t stop me now: Embedding based scheduling for llms.arXiv preprint arXiv:2410.01035, 2024

  30. [30]

    Enabling efficient batch serving for lmaas via generation length prediction

    Ke Cheng, Wen Hu, Zhi Wang, Peng Du, Jianguo Li, and Sheng Zhang. Enabling efficient batch serving for lmaas via generation length prediction. In2024 IEEE International Conference on Web Services (ICWS), pages 853–864. IEEE, 2024

  31. [31]

    Predicting llm output length via entropy-guided representations.arXiv preprint arXiv:2602.11812, 2026

    Huanyi Xie, Yubin Chen, Liangyu Wang, Lijie Hu, and Di Wang. Predicting llm output length via entropy-guided representations.arXiv preprint arXiv:2602.11812, 2026

  32. [32]

    Various optimizers for single-stage production

    Wayne E Smith. Various optimizers for single-stage production. Technical report, 1955

  33. [33]

    A proof of the optimality of the shortest remaining processing time discipline.Operations Research, 16(3):687–690, 1968

    Linus Schrage. A proof of the optimality of the shortest remaining processing time discipline.Operations Research, 16(3):687–690, 1968

  34. [34]

    Adaptively robust llm inference optimization under prediction uncertainty

    Zixi Chen, Yinyu Ye, and Zijie Zhou. Adaptively robust llm inference optimization under prediction uncertainty. arXiv preprint arXiv:2508.14544, 2025

  35. [35]

    Optimizing llm inference: Fluid-guided online scheduling with memory constraints.arXiv preprint arXiv:2504.11320, 2025

    Ruicheng Ao, Gan Luo, David Simchi-Levi, and Xinshang Wang. Optimizing llm inference: Fluid-guided online scheduling with memory constraints.arXiv preprint arXiv:2504.11320, 2025

  36. [36]

    Greenllm: Slo-aware dynamic frequency scaling for energy-efficient llm serving.arXiv preprint arXiv:2508.16449, 2025

    Qunyou Liu, Darong Huang, Marina Zapater, and David Atienza. Greenllm: Slo-aware dynamic frequency scaling for energy-efficient llm serving.arXiv preprint arXiv:2508.16449, 2025

  37. [37]

    memory shadow

    Bowen Pang, Kai Li, and Feifan Wang. Optimizing llm inference throughput via memory-aware and sla-constrained dynamic batching.arXiv preprint arXiv:2503.05248, 2025. 12 Geometry-Aware Online Scheduling for LLM Serving: From Theoretical Bound to System Practice A Related Works LLM inference scheduling. Efficient serving of LLMs is pivotal for interactive a...

  38. [38]

    From the law of total expectation,p 1O1 +p 2O2 = 1 µ =⇒p 1O1 = 1 µ −p 2O2. Substituting this constraint: p1O2 1 +p 2O2 2 = ( 1 µ −p 2O2)2 p1 +p 2O2 2 = 1 µ2 − 2p2O2 µ +p 2 2O2 2 +p 1p2O2 2 p1 = 1 µ2 −p 2 2O2 µ −O 2 2 p1 (89) Because the long proxy is O2 =θ+ 1 µ, its internal term simplifies beautifully: 2O2 µ −O 2 2 = 1 µ2 −θ 2. Substituting this back: p1...