Recognition: unknown
Position: LLM Serving Needs Mathematical Optimization and Algorithmic Foundations, Not Just Heuristics
Pith reviewed 2026-05-09 18:50 UTC · model grok-4.3
The pith
LLM serving systems must replace generic heuristics with mathematical models that capture their distinctive traits to achieve reliable performance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LLM inference serving has outgrown generic heuristics from classical distributed computing and now requires mathematical optimization and algorithmic foundations. Systems such as vLLM and SGLang continue to use request routing via join-shortest-queue or round-robin, FIFO scheduling, and LRU cache eviction, all of which ignore the distinctive structure of LLM inference including dynamically growing KV cache memory, prefill-decode phase asymmetry, unknown output lengths, and continuous batching constraints. Mathematical models that capture these characteristics enable the design of algorithms with provable performance guarantees across diverse workloads, and emerging work at the intersection 0
What carries the argument
Mathematical models of LLM-specific traits such as dynamically growing KV cache and prefill-decode asymmetry that support algorithms with provable performance guarantees instead of general-purpose heuristics.
If this is right
- Request routing and scheduling policies can be replaced by algorithms that guarantee performance bounds across varying model sizes and request patterns.
- KV cache management can move beyond LRU to eviction rules derived from optimization objectives that account for token generation dynamics.
- Continuous batching decisions can incorporate provable analysis of prefill versus decode costs rather than ad-hoc rules.
- The community can treat LLM serving as an area for rigorous algorithmic research rather than solely systems engineering.
- Early intersections of operations research and ML systems already demonstrate that such principled methods are feasible.
Where Pith is reading between the lines
- These models could enable better resource provisioning in multi-tenant clusters by quantifying trade-offs between batch size and memory growth.
- Integration with existing queueing theory might yield new bounds on tail latency for variable-length outputs.
- Empirical validation on public traces from different model families would help test whether the guarantees translate outside controlled settings.
Load-bearing premise
That mathematical models capturing LLM characteristics like growing KV cache and prefill-decode asymmetry can be developed to produce algorithms with guarantees that match or exceed current heuristics in practice.
What would settle it
A set of production workloads where every algorithm derived from such mathematical models underperforms standard heuristics on both latency and throughput metrics without exception.
Figures
read the original abstract
This position paper argues that LLM inference serving has outgrown generic heuristics and now demands mathematical optimization and algorithmic foundations. Despite rapid advances in serving systems such as vLLM and SGLang, their algorithmic cores remain largely unchanged from classical distributed computing: request routing uses join-shortest-queue or round-robin, scheduling defaults to FIFO, and KV cache eviction follows LRU. These general-purpose policies ignore the distinctive structure of LLM inference--dynamically growing KV cache memory, prefill-decode phase asymmetry, unknown output lengths, and continuous batching constraints. We contend that the field must develop mathematical models capturing these characteristics, enabling the design of algorithms with provable performance guarantees across diverse workloads, rather than heuristics that may succeed in some scenarios but fail unpredictably in others. Emerging work at the intersection of operations research and ML systems demonstrates that principled methods can match or exceed heuristic performance while providing theoretical guarantees. We call on the community to recognize algorithmic design for LLM serving as a research frontier.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper is a position piece asserting that current LLM inference serving systems rely on classical heuristics (e.g., join-shortest-queue routing, FIFO scheduling, LRU KV cache eviction) that do not account for LLM-specific features such as dynamically growing KV caches, prefill-decode asymmetry, unknown output lengths, and continuous batching. It advocates for the creation of mathematical models and algorithms offering provable performance guarantees, referencing promising work at the operations research and ML systems intersection, and positions algorithmic foundations for LLM serving as an important research direction.
Significance. Should the community adopt this perspective, it would encourage the development of more reliable and theoretically grounded serving algorithms, potentially leading to improved efficiency and predictability in LLM deployments. The paper's identification of the mismatch between generic policies and LLM traits provides a clear motivation for shifting from heuristics to principled methods, which could yield algorithms with guarantees that generalize better across workloads.
Simulated Author's Rebuttal
We thank the referee for their positive review and recommendation to accept the manuscript. The referee's summary accurately reflects our core argument that LLM serving systems require mathematical optimization and algorithms with provable guarantees, rather than generic heuristics that overlook LLM-specific characteristics such as dynamic KV cache growth and prefill-decode asymmetry.
Circularity Check
No significant circularity; position paper with no derivations
full rationale
This position paper advocates for mathematical optimization in LLM serving by contrasting classical heuristics (JSQ, FIFO, LRU) with LLM-specific traits (growing KV cache, prefill-decode asymmetry, unknown lengths, continuous batching) and calling for future models with provable guarantees. It presents no equations, algorithms, fitted parameters, or derivations of its own; the sole empirical reference to emerging OR/ML work is offered as motivation rather than a load-bearing result. No self-citations, self-definitional steps, or reductions to inputs exist, so the argument remains self-contained external observation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLM inference exhibits distinctive structure including dynamically growing KV cache memory, prefill-decode phase asymmetry, unknown output lengths, and continuous batching constraints.
Reference graph
Works this paper leans on
-
[1]
Langley , title =
P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =
2000
-
[2]
T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980
1980
-
[3]
M. J. Kearns , title =
-
[4]
Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983
1983
-
[5]
R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000
2000
-
[6]
Suppressed for Anonymity , author=
-
[7]
Newell and P
A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981
1981
-
[8]
A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959
1959
-
[9]
18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24) , pages=
\ DistServe \ : Disaggregating prefill and decoding for goodput-optimized large language model serving , author=. 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24) , pages=
-
[10]
2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA) , pages=
Splitwise: Efficient generative llm inference using phase splitting , author=. 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA) , pages=. 2024 , organization=
2024
-
[11]
Step-3 is large yet affordable: Model-system co-design for cost-effective decoding , author=. arXiv preprint arXiv:2507.19427 , year=
-
[12]
Megascale- infer: Serving mixture-of-experts at scale with disaggregated expert parallelism
MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism , author=. arXiv preprint arXiv:2504.02263 , year=
-
[13]
Serving large language models on huawei cloudmatrix384,
Serving Large Language Models on Huawei CloudMatrix384 , author=. arXiv preprint arXiv:2506.12708 , year=
-
[14]
Janus: Disaggregating Attention and Experts for Scalable MoE Inference
Janus: Disaggregating Attention and Experts for Scalable MoE Inference , author=. arXiv preprint arXiv:2512.13525 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Advances in neural information processing systems , volume=
Language models are few-shot learners , author=. Advances in neural information processing systems , volume=
-
[16]
Journal of Machine Learning Research , volume=
Palm: Scaling language modeling with pathways , author=. Journal of Machine Learning Research , volume=
-
[17]
Scaling Laws for Neural Language Models
Scaling laws for neural language models , author=. arXiv preprint arXiv:2001.08361 , year=
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[18]
Achiam, Josh and Adler, Steven and Agarwal, Sandhini and Ahmad, Lama and Akkaya, Ilge and Aleman, Florencia Leoni and Almeida, Diogo and Altenschmidt, Janko and Altman, Sam and Anadkat, Shyamal and others , journal=
-
[19]
Efficient memory management for large language model serving with
Kwon, Woosuk and Li, Zhuohan and Zhuang, Siyuan and Sheng, Ying and Zheng, Lianmin and Yu, Cody Hao and Gonzalez, Joseph and Zhang, Hao and Stoica, Ion , booktitle=. Efficient memory management for large language model serving with
-
[20]
16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22) , pages=
Orca: A distributed serving system for \ Transformer-Based \ generative models , author=. 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22) , pages=
-
[21]
Deepseek-v3 technical report , author=. arXiv preprint arXiv:2412.19437 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
2025 , howpublished=
LPLB: Linear-Programming-Based Load Balancer for Mixture-of-Experts Models , author =. 2025 , howpublished=
2025
-
[23]
2025 , howpublished=
Expert Parallelism Load Balancer (EPLB) , author =. 2025 , howpublished=
2025
-
[24]
https://epoch.ai/gradient-updates/how-much-energy-does-chatgpt-use , url=
How much energy does ChatGPT use? , author=. https://epoch.ai/gradient-updates/how-much-energy-does-chatgpt-use , url=
-
[25]
https://www.theverge.com/news/710867/openai-chatgpt-daily-prompts-2-billionutmsource=chatgpt.com , url=
OpenAI says ChatGPT users send over 2.5 billion prompts every day , author=. https://www.theverge.com/news/710867/openai-chatgpt-daily-prompts-2-billionutmsource=chatgpt.com , url=
-
[26]
https://www.theguardian.com/environment/2025/may/22/ai-data-centre-power-consumption , year=
AI could account for nearly half of datacentre power usage by end of year , author=. https://www.theguardian.com/environment/2025/may/22/ai-data-centre-power-consumption , year=
2025
-
[27]
arXiv preprint arXiv:2502.01070 , year=
An Inquiry into Datacenter TCO for LLM Inference with FP8 , author=. arXiv preprint arXiv:2502.01070 , year=
-
[28]
Linear and Nonlinear Programming , pages=
Interior-point methods , author=. Linear and Nonlinear Programming , pages=. 2021 , publisher=
2021
-
[29]
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
Outrageously large neural networks: The sparsely-gated mixture-of-experts layer , author=. arXiv preprint arXiv:1701.06538 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
Journal of Machine Learning Research , volume=
Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity , author=. Journal of Machine Learning Research , volume=
-
[31]
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
Gshard: Scaling giant models with conditional computation and automatic sharding , author=. arXiv preprint arXiv:2006.16668 , year=
work page internal anchor Pith review arXiv 2006
-
[32]
Auxiliary-loss-free load balancing strategy for mixture-of-experts.arXiv preprint arXiv:2408.15664,
Auxiliary-loss-free load balancing strategy for mixture-of-experts , author=. arXiv preprint arXiv:2408.15664 , year=
-
[33]
Advances in Neural Information Processing Systems , volume=
Mixture-of-experts with expert choice routing , author=. Advances in Neural Information Processing Systems , volume=
-
[34]
2025 , month =
vLLM Router: A high-performance and light-weight router for vLLM large scale deployment , url =. 2025 , month =
2025
-
[35]
Advances in neural information processing systems , volume=
Sglang: Efficient execution of structured language model programs , author=. Advances in neural information processing systems , volume=
-
[36]
IEEE Transactions on Parallel and Distributed Systems , volume=
The power of two choices in randomized load balancing , author=. IEEE Transactions on Parallel and Distributed Systems , volume=. 2002 , publisher=
2002
-
[37]
Proceedings of the twenty-ninth annual ACM symposium on Theory of computing , pages=
Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the world wide web , author=. Proceedings of the twenty-ninth annual ACM symposium on Theory of computing , pages=
-
[38]
A Queueing-Theoretic Framework for Stability Analysis of
Anonymous , booktitle=. A Queueing-Theoretic Framework for Stability Analysis of. 2025 , url=
2025
-
[39]
2025 IEEE International Symposium on High Performance Computer Architecture (HPCA) , pages=
Dynamollm: Designing llm inference clusters for performance and energy efficiency , author=. 2025 IEEE International Symposium on High Performance Computer Architecture (HPCA) , pages=. 2025 , organization=
2025
-
[40]
Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 , pages=
Spotserve: Serving generative large language models on preemptible instances , author=. Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 , pages=
-
[41]
18th USENIX symposium on operating systems design and implementation (OSDI 24) , pages=
Llumnix: Dynamic scheduling for large language model serving , author=. 18th USENIX symposium on operating systems design and implementation (OSDI 24) , pages=
-
[42]
18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24) , pages=
\ ServerlessLLM \ : \ Low-Latency \ serverless inference for large language models , author=. 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24) , pages=
-
[43]
2025 , howpublished=
vLLM-Omni: A Framework for Omni-Modality Model Inference and Serving , author=. 2025 , howpublished=
2025
-
[44]
ModServe: Modality-and Stage-Aware Resource Disaggregation for Scalable Multimodal Model Serving , author=. arXiv preprint arXiv:2502.00937 , year=
-
[45]
arXiv preprint arXiv:2507.10069 , year=
Elasticmm: Efficient multimodal llms serving with elastic multimodal parallelism , author=. arXiv preprint arXiv:2507.10069 , year=
-
[46]
Cornfigurator: Automated Planning for Any-to-Any Multimodal Model Serving
Cornserve: Efficiently Serving Any-to-Any Multimodal Models , author=. arXiv preprint arXiv:2512.14098 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[47]
2025 , howpublished=
vLLM V1: Accelerating Multimodal Inference for Large Language Models , author=. 2025 , howpublished=
2025
-
[48]
Advances in Neural Information Processing Systems , volume=
Towards optimal caching and model selection for large model inference , author=. Advances in Neural Information Processing Systems , volume=
-
[49]
arXiv preprint arXiv:2601.17855 , year=
A Universal Load Balancing Principle and Its Application to Large Language Model Serving , author=. arXiv preprint arXiv:2601.17855 , year=
-
[50]
arXiv preprint arXiv:2504.11320 , year=
Optimizing LLM Inference: Fluid-Guided Online Scheduling with Memory Constraints , author=. arXiv preprint arXiv:2504.11320 , year=
-
[51]
arXiv preprint arXiv:2410.01035 , year=
Don't Stop Me Now: Embedding Based Scheduling for LLMs , author=. arXiv preprint arXiv:2410.01035 , year=
-
[52]
arXiv preprint arXiv:2502.07115 , year=
Online Scheduling for LLM Inference with KV Cache Constraints , author=. arXiv preprint arXiv:2502.07115 , year=
-
[53]
arXiv preprint arXiv:2508.06133 , year=
Llm serving optimization with variable prefill and decode lengths , author=. arXiv preprint arXiv:2508.06133 , year=
-
[54]
arXiv preprint arXiv:2508.14544 , year=
Adaptively robust llm inference optimization under prediction uncertainty , author=. arXiv preprint arXiv:2508.14544 , year=
-
[55]
Stochastic Systems , volume=
Queueing, predictions, and large language models: Challenges and open problems , author=. Stochastic Systems , volume=. 2025 , publisher=
2025
-
[56]
Journal of scheduling , volume=
Scheduling a batching machine , author=. Journal of scheduling , volume=. 1998 , publisher=
1998
-
[57]
Handbook of Combinatorial Optimization: Volume1--3 , pages=
A review of machine scheduling: Complexity, algorithms and approximability , author=. Handbook of Combinatorial Optimization: Volume1--3 , pages=. 1998 , publisher=
1998
-
[58]
2012 , publisher=
Scheduling , author=. 2012 , publisher=
2012
-
[59]
Advances in Neural Information Processing Systems , volume=
Simple and fast algorithm for binary integer and online linear programming , author=. Advances in Neural Information Processing Systems , volume=
-
[60]
Operations Research , volume=
A dynamic near-optimal algorithm for online linear programming , author=. Operations Research , volume=. 2014 , publisher=
2014
-
[61]
Communications of the ACM , volume=
Amortized efficiency of list update and paging rules , author=. Communications of the ACM , volume=. 1985 , publisher=
1985
-
[62]
Journal of Algorithms , volume=
Competitive paging algorithms , author=. Journal of Algorithms , volume=. 1991 , publisher=
1991
-
[63]
Journal of the ACM (JACM) , volume=
Competitive caching with machine learned advice , author=. Journal of the ACM (JACM) , volume=. 2021 , publisher=
2021
-
[64]
International Conference on Intelligent Networking and Collaborative Systems , pages=
Load balancing on cloud analyst using first come first serve scheduling algorithm , author=. International Conference on Intelligent Networking and Collaborative Systems , pages=. 2018 , organization=
2018
-
[65]
International Conference on Smart Computing and Communication , pages=
Design and implementation of First-In-First-Out (FIFO) buffer for distributed load balancing systems , author=. International Conference on Smart Computing and Communication , pages=. 2024 , organization=
2024
-
[66]
Recent Trends in Communication and Intelligent Systems: Proceedings of ICRTCIS 2019 , pages=
Load balancing in cloud through task scheduling , author=. Recent Trends in Communication and Intelligent Systems: Proceedings of ICRTCIS 2019 , pages=. 2020 , publisher=
2019
-
[67]
interfaces , volume=
Yield management at American airlines , author=. interfaces , volume=. 1992 , publisher=
1992
-
[68]
1992 , school=
Airline network seat inventory control: Methodologies and revenue impacts , author=. 1992 , school=
1992
-
[69]
Journal of the ACM (JACM) , volume=
New algorithms for bin packing , author=. Journal of the ACM (JACM) , volume=. 1980 , publisher=
1980
-
[70]
Algorithm design for computer system design , pages=
Approximation algorithms for bin-packing—an updated survey , author=. Algorithm design for computer system design , pages=. 1984 , publisher=
1984
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.