arxiv: 2605.01280 · v1 · submitted 2026-05-02 · 💻 cs.DC · cs.AI

Recognition: unknown

Position: LLM Serving Needs Mathematical Optimization and Algorithmic Foundations, Not Just Heuristics

Zijie Zhou

Authors on Pith no claims yet

Pith reviewed 2026-05-09 18:50 UTC · model grok-4.3

classification 💻 cs.DC cs.AI

keywords LLM servingmathematical optimizationalgorithmic foundationsKV cache managementprefill-decode asymmetryprovable guaranteesinference schedulingcontinuous batching

0 comments

The pith

LLM serving systems must replace generic heuristics with mathematical models that capture their distinctive traits to achieve reliable performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper contends that current LLM inference systems rely on classical policies like round-robin routing and LRU eviction that overlook key LLM features. It argues for building mathematical models of dynamically growing KV caches, prefill-decode asymmetry, unknown output lengths, and continuous batching so that algorithms with provable guarantees can be designed. A sympathetic reader would care because unpredictable heuristic failures limit scalable deployment of large models, while principled methods from operations research have already matched or beaten existing approaches in early cases. The position calls for treating algorithmic design in this domain as a distinct research area rather than an engineering afterthought.

Core claim

LLM inference serving has outgrown generic heuristics from classical distributed computing and now requires mathematical optimization and algorithmic foundations. Systems such as vLLM and SGLang continue to use request routing via join-shortest-queue or round-robin, FIFO scheduling, and LRU cache eviction, all of which ignore the distinctive structure of LLM inference including dynamically growing KV cache memory, prefill-decode phase asymmetry, unknown output lengths, and continuous batching constraints. Mathematical models that capture these characteristics enable the design of algorithms with provable performance guarantees across diverse workloads, and emerging work at the intersection 0

What carries the argument

Mathematical models of LLM-specific traits such as dynamically growing KV cache and prefill-decode asymmetry that support algorithms with provable performance guarantees instead of general-purpose heuristics.

If this is right

Request routing and scheduling policies can be replaced by algorithms that guarantee performance bounds across varying model sizes and request patterns.
KV cache management can move beyond LRU to eviction rules derived from optimization objectives that account for token generation dynamics.
Continuous batching decisions can incorporate provable analysis of prefill versus decode costs rather than ad-hoc rules.
The community can treat LLM serving as an area for rigorous algorithmic research rather than solely systems engineering.
Early intersections of operations research and ML systems already demonstrate that such principled methods are feasible.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

These models could enable better resource provisioning in multi-tenant clusters by quantifying trade-offs between batch size and memory growth.
Integration with existing queueing theory might yield new bounds on tail latency for variable-length outputs.
Empirical validation on public traces from different model families would help test whether the guarantees translate outside controlled settings.

Load-bearing premise

That mathematical models capturing LLM characteristics like growing KV cache and prefill-decode asymmetry can be developed to produce algorithms with guarantees that match or exceed current heuristics in practice.

What would settle it

A set of production workloads where every algorithm derived from such mathematical models underperforms standard heuristics on both latency and throughput metrics without exception.

Figures

Figures reproduced from arXiv: 2605.01280 by Zijie Zhou.

**Figure 1.** Figure 1: Under expert parallelism, imbalanced token routing causes straggler GPUs that delay synchronization. Lightly loaded GPUs must idle while waiting for heavily loaded GPUs to complete before all-to-all communication. Current approaches to MoE load balancing rely primarily on heuristics developed during training. The most common strategy, introduced in GShard (Lepikhin et al., 2020) and refined in Switch Trans… view at source ↗

**Figure 2.** Figure 2: Under DP with internal EP, each worker computes attention locally before synchronizing for EP all-to-all communication. Workers with larger KV caches (longer sequences) take longer, forcing lightly loaded workers to idle at the sync barrier. computing systems. However, they do not account for the specific structure of LLM decode workloads. First, the workload per request—determined by decode length—is unk… view at source ↗

read the original abstract

This position paper argues that LLM inference serving has outgrown generic heuristics and now demands mathematical optimization and algorithmic foundations. Despite rapid advances in serving systems such as vLLM and SGLang, their algorithmic cores remain largely unchanged from classical distributed computing: request routing uses join-shortest-queue or round-robin, scheduling defaults to FIFO, and KV cache eviction follows LRU. These general-purpose policies ignore the distinctive structure of LLM inference--dynamically growing KV cache memory, prefill-decode phase asymmetry, unknown output lengths, and continuous batching constraints. We contend that the field must develop mathematical models capturing these characteristics, enabling the design of algorithms with provable performance guarantees across diverse workloads, rather than heuristics that may succeed in some scenarios but fail unpredictably in others. Emerging work at the intersection of operations research and ML systems demonstrates that principled methods can match or exceed heuristic performance while providing theoretical guarantees. We call on the community to recognize algorithmic design for LLM serving as a research frontier.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This position paper flags real mismatches between classical heuristics and LLM serving traits but adds no new models, data, or algorithms itself.

read the letter

The main point here is that LLM inference serving has outgrown generic policies like join-shortest-queue routing, FIFO scheduling, and LRU eviction because of traits such as growing KV caches, prefill-decode asymmetry, unknown output lengths, and continuous batching. The authors want the field to build mathematical models that deliver provable guarantees instead of relying on heuristics that can fail unpredictably. That framing is coherent and matches observable system behavior.

Referee Report

0 major / 0 minor

Summary. The paper is a position piece asserting that current LLM inference serving systems rely on classical heuristics (e.g., join-shortest-queue routing, FIFO scheduling, LRU KV cache eviction) that do not account for LLM-specific features such as dynamically growing KV caches, prefill-decode asymmetry, unknown output lengths, and continuous batching. It advocates for the creation of mathematical models and algorithms offering provable performance guarantees, referencing promising work at the operations research and ML systems intersection, and positions algorithmic foundations for LLM serving as an important research direction.

Significance. Should the community adopt this perspective, it would encourage the development of more reliable and theoretically grounded serving algorithms, potentially leading to improved efficiency and predictability in LLM deployments. The paper's identification of the mismatch between generic policies and LLM traits provides a clear motivation for shifting from heuristics to principled methods, which could yield algorithms with guarantees that generalize better across workloads.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive review and recommendation to accept the manuscript. The referee's summary accurately reflects our core argument that LLM serving systems require mathematical optimization and algorithms with provable guarantees, rather than generic heuristics that overlook LLM-specific characteristics such as dynamic KV cache growth and prefill-decode asymmetry.

Circularity Check

0 steps flagged

No significant circularity; position paper with no derivations

full rationale

This position paper advocates for mathematical optimization in LLM serving by contrasting classical heuristics (JSQ, FIFO, LRU) with LLM-specific traits (growing KV cache, prefill-decode asymmetry, unknown lengths, continuous batching) and calling for future models with provable guarantees. It presents no equations, algorithms, fitted parameters, or derivations of its own; the sole empirical reference to emerging OR/ML work is offered as motivation rather than a load-bearing result. No self-citations, self-definitional steps, or reductions to inputs exist, so the argument remains self-contained external observation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The position rests on the domain assumption that classical heuristics are mismatched to LLM inference structure and that mathematical models can deliver superior guaranteed performance.

axioms (1)

domain assumption LLM inference exhibits distinctive structure including dynamically growing KV cache memory, prefill-decode phase asymmetry, unknown output lengths, and continuous batching constraints.
Invoked directly in the abstract as the reason general-purpose policies are inadequate.

pith-pipeline@v0.9.0 · 5465 in / 1072 out tokens · 45534 ms · 2026-05-09T18:50:10.352337+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

70 extracted references · 19 canonical work pages · 6 internal anchors

[1]

Langley , title =

P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

2000
[2]

T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

1980
[3]

M. J. Kearns , title =
[4]

Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

1983
[5]

R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

2000
[6]

Suppressed for Anonymity , author=
[7]

Newell and P

A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

1981
[8]

A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

1959
[9]

18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24) , pages=

\ DistServe \ : Disaggregating prefill and decoding for goodput-optimized large language model serving , author=. 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24) , pages=
[10]

2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA) , pages=

Splitwise: Efficient generative llm inference using phase splitting , author=. 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA) , pages=. 2024 , organization=

2024
[11]

Step-3 is large yet affordable: Model-system co-design for cost- effective decoding.arXiv preprint arXiv:2507.19427, 2025

Step-3 is large yet affordable: Model-system co-design for cost-effective decoding , author=. arXiv preprint arXiv:2507.19427 , year=

work page arXiv
[12]

Megascale- infer: Serving mixture-of-experts at scale with disaggregated expert parallelism

MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism , author=. arXiv preprint arXiv:2504.02263 , year=

work page arXiv
[13]

Serving large language models on huawei cloudmatrix384,

Serving Large Language Models on Huawei CloudMatrix384 , author=. arXiv preprint arXiv:2506.12708 , year=

work page arXiv
[14]

Janus: Disaggregating Attention and Experts for Scalable MoE Inference

Janus: Disaggregating Attention and Experts for Scalable MoE Inference , author=. arXiv preprint arXiv:2512.13525 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Advances in neural information processing systems , volume=

Language models are few-shot learners , author=. Advances in neural information processing systems , volume=
[16]

Journal of Machine Learning Research , volume=

Palm: Scaling language modeling with pathways , author=. Journal of Machine Learning Research , volume=
[17]

Scaling Laws for Neural Language Models

Scaling laws for neural language models , author=. arXiv preprint arXiv:2001.08361 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2001
[18]

Achiam, Josh and Adler, Steven and Agarwal, Sandhini and Ahmad, Lama and Akkaya, Ilge and Aleman, Florencia Leoni and Almeida, Diogo and Altenschmidt, Janko and Altman, Sam and Anadkat, Shyamal and others , journal=
[19]

Efficient memory management for large language model serving with

Kwon, Woosuk and Li, Zhuohan and Zhuang, Siyuan and Sheng, Ying and Zheng, Lianmin and Yu, Cody Hao and Gonzalez, Joseph and Zhang, Hao and Stoica, Ion , booktitle=. Efficient memory management for large language model serving with
[20]

16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22) , pages=

Orca: A distributed serving system for \ Transformer-Based \ generative models , author=. 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22) , pages=
[21]

DeepSeek-V3 Technical Report

Deepseek-v3 technical report , author=. arXiv preprint arXiv:2412.19437 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[22]

2025 , howpublished=

LPLB: Linear-Programming-Based Load Balancer for Mixture-of-Experts Models , author =. 2025 , howpublished=

2025
[23]

2025 , howpublished=

Expert Parallelism Load Balancer (EPLB) , author =. 2025 , howpublished=

2025
[24]

https://epoch.ai/gradient-updates/how-much-energy-does-chatgpt-use , url=

How much energy does ChatGPT use? , author=. https://epoch.ai/gradient-updates/how-much-energy-does-chatgpt-use , url=
[25]

https://www.theverge.com/news/710867/openai-chatgpt-daily-prompts-2-billionutmsource=chatgpt.com , url=

OpenAI says ChatGPT users send over 2.5 billion prompts every day , author=. https://www.theverge.com/news/710867/openai-chatgpt-daily-prompts-2-billionutmsource=chatgpt.com , url=
[26]

https://www.theguardian.com/environment/2025/may/22/ai-data-centre-power-consumption , year=

AI could account for nearly half of datacentre power usage by end of year , author=. https://www.theguardian.com/environment/2025/may/22/ai-data-centre-power-consumption , year=

2025
[27]

arXiv preprint arXiv:2502.01070 , year=

An Inquiry into Datacenter TCO for LLM Inference with FP8 , author=. arXiv preprint arXiv:2502.01070 , year=

work page arXiv
[28]

Linear and Nonlinear Programming , pages=

Interior-point methods , author=. Linear and Nonlinear Programming , pages=. 2021 , publisher=

2021
[29]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Outrageously large neural networks: The sparsely-gated mixture-of-experts layer , author=. arXiv preprint arXiv:1701.06538 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[30]

Journal of Machine Learning Research , volume=

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity , author=. Journal of Machine Learning Research , volume=
[31]

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Gshard: Scaling giant models with conditional computation and automatic sharding , author=. arXiv preprint arXiv:2006.16668 , year=

work page internal anchor Pith review arXiv 2006
[32]

Auxiliary-loss-free load balancing strategy for mixture-of-experts.arXiv preprint arXiv:2408.15664,

Auxiliary-loss-free load balancing strategy for mixture-of-experts , author=. arXiv preprint arXiv:2408.15664 , year=

work page arXiv
[33]

Advances in Neural Information Processing Systems , volume=

Mixture-of-experts with expert choice routing , author=. Advances in Neural Information Processing Systems , volume=
[34]

2025 , month =

vLLM Router: A high-performance and light-weight router for vLLM large scale deployment , url =. 2025 , month =

2025
[35]

Advances in neural information processing systems , volume=

Sglang: Efficient execution of structured language model programs , author=. Advances in neural information processing systems , volume=
[36]

IEEE Transactions on Parallel and Distributed Systems , volume=

The power of two choices in randomized load balancing , author=. IEEE Transactions on Parallel and Distributed Systems , volume=. 2002 , publisher=

2002
[37]

Proceedings of the twenty-ninth annual ACM symposium on Theory of computing , pages=

Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the world wide web , author=. Proceedings of the twenty-ninth annual ACM symposium on Theory of computing , pages=
[38]

A Queueing-Theoretic Framework for Stability Analysis of

Anonymous , booktitle=. A Queueing-Theoretic Framework for Stability Analysis of. 2025 , url=

2025
[39]

2025 IEEE International Symposium on High Performance Computer Architecture (HPCA) , pages=

Dynamollm: Designing llm inference clusters for performance and energy efficiency , author=. 2025 IEEE International Symposium on High Performance Computer Architecture (HPCA) , pages=. 2025 , organization=

2025
[40]

Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 , pages=

Spotserve: Serving generative large language models on preemptible instances , author=. Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 , pages=
[41]

18th USENIX symposium on operating systems design and implementation (OSDI 24) , pages=

Llumnix: Dynamic scheduling for large language model serving , author=. 18th USENIX symposium on operating systems design and implementation (OSDI 24) , pages=
[42]

18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24) , pages=

\ ServerlessLLM \ : \ Low-Latency \ serverless inference for large language models , author=. 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24) , pages=
[43]

2025 , howpublished=

vLLM-Omni: A Framework for Omni-Modality Model Inference and Serving , author=. 2025 , howpublished=

2025
[44]

Modserve: Modality- and stage-aware resource disaggregation for scalable multimodal model serving.arXiv preprint arXiv:2502.00937, 2025

ModServe: Modality-and Stage-Aware Resource Disaggregation for Scalable Multimodal Model Serving , author=. arXiv preprint arXiv:2502.00937 , year=

work page arXiv
[45]

arXiv preprint arXiv:2507.10069 , year=

Elasticmm: Efficient multimodal llms serving with elastic multimodal parallelism , author=. arXiv preprint arXiv:2507.10069 , year=

work page arXiv
[46]

Cornfigurator: Automated Planning for Any-to-Any Multimodal Model Serving

Cornserve: Efficiently Serving Any-to-Any Multimodal Models , author=. arXiv preprint arXiv:2512.14098 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[47]

2025 , howpublished=

vLLM V1: Accelerating Multimodal Inference for Large Language Models , author=. 2025 , howpublished=

2025
[48]

Advances in Neural Information Processing Systems , volume=

Towards optimal caching and model selection for large model inference , author=. Advances in Neural Information Processing Systems , volume=
[49]

arXiv preprint arXiv:2601.17855 , year=

A Universal Load Balancing Principle and Its Application to Large Language Model Serving , author=. arXiv preprint arXiv:2601.17855 , year=

work page arXiv
[50]

arXiv preprint arXiv:2504.11320 , year=

Optimizing LLM Inference: Fluid-Guided Online Scheduling with Memory Constraints , author=. arXiv preprint arXiv:2504.11320 , year=

work page arXiv
[51]

arXiv preprint arXiv:2410.01035 , year=

Don't Stop Me Now: Embedding Based Scheduling for LLMs , author=. arXiv preprint arXiv:2410.01035 , year=

work page arXiv
[52]

arXiv preprint arXiv:2502.07115 , year=

Online Scheduling for LLM Inference with KV Cache Constraints , author=. arXiv preprint arXiv:2502.07115 , year=

work page arXiv
[53]

arXiv preprint arXiv:2508.06133 , year=

Llm serving optimization with variable prefill and decode lengths , author=. arXiv preprint arXiv:2508.06133 , year=

work page arXiv
[54]

arXiv preprint arXiv:2508.14544 , year=

Adaptively robust llm inference optimization under prediction uncertainty , author=. arXiv preprint arXiv:2508.14544 , year=

work page arXiv
[55]

Stochastic Systems , volume=

Queueing, predictions, and large language models: Challenges and open problems , author=. Stochastic Systems , volume=. 2025 , publisher=

2025
[56]

Journal of scheduling , volume=

Scheduling a batching machine , author=. Journal of scheduling , volume=. 1998 , publisher=

1998
[57]

Handbook of Combinatorial Optimization: Volume1--3 , pages=

A review of machine scheduling: Complexity, algorithms and approximability , author=. Handbook of Combinatorial Optimization: Volume1--3 , pages=. 1998 , publisher=

1998
[58]

2012 , publisher=

Scheduling , author=. 2012 , publisher=

2012
[59]

Advances in Neural Information Processing Systems , volume=

Simple and fast algorithm for binary integer and online linear programming , author=. Advances in Neural Information Processing Systems , volume=
[60]

Operations Research , volume=

A dynamic near-optimal algorithm for online linear programming , author=. Operations Research , volume=. 2014 , publisher=

2014
[61]

Communications of the ACM , volume=

Amortized efficiency of list update and paging rules , author=. Communications of the ACM , volume=. 1985 , publisher=

1985
[62]

Journal of Algorithms , volume=

Competitive paging algorithms , author=. Journal of Algorithms , volume=. 1991 , publisher=

1991
[63]

Journal of the ACM (JACM) , volume=

Competitive caching with machine learned advice , author=. Journal of the ACM (JACM) , volume=. 2021 , publisher=

2021
[64]

International Conference on Intelligent Networking and Collaborative Systems , pages=

Load balancing on cloud analyst using first come first serve scheduling algorithm , author=. International Conference on Intelligent Networking and Collaborative Systems , pages=. 2018 , organization=

2018
[65]

International Conference on Smart Computing and Communication , pages=

Design and implementation of First-In-First-Out (FIFO) buffer for distributed load balancing systems , author=. International Conference on Smart Computing and Communication , pages=. 2024 , organization=

2024
[66]

Recent Trends in Communication and Intelligent Systems: Proceedings of ICRTCIS 2019 , pages=

Load balancing in cloud through task scheduling , author=. Recent Trends in Communication and Intelligent Systems: Proceedings of ICRTCIS 2019 , pages=. 2020 , publisher=

2019
[67]

interfaces , volume=

Yield management at American airlines , author=. interfaces , volume=. 1992 , publisher=

1992
[68]

1992 , school=

Airline network seat inventory control: Methodologies and revenue impacts , author=. 1992 , school=

1992
[69]

Journal of the ACM (JACM) , volume=

New algorithms for bin packing , author=. Journal of the ACM (JACM) , volume=. 1980 , publisher=

1980
[70]

Algorithm design for computer system design , pages=

Approximation algorithms for bin-packing—an updated survey , author=. Algorithm design for computer system design , pages=. 1984 , publisher=

1984