arxiv: 2605.04595 · v1 · submitted 2026-05-06 · 💻 cs.LG · cs.AI· math.OC

Recognition: unknown

A Queueing-Theoretic Framework for Stability Analysis of LLM Inference with KV Cache Memory Constraints

Chengyi Nie , Nian Si , Zijie Zhou

Authors on Pith no claims yet

Pith reviewed 2026-05-08 17:23 UTC · model grok-4.3

classification 💻 cs.LG cs.AImath.OC

keywords LLM inferenceKV cachequeueing theorystability analysisGPU memoryperformance modelingGPU provisioning

0 comments

The pith

A queueing model with KV cache memory states yields exact stability conditions for LLM inference services to avoid unbounded queue growth.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops the first queueing-theoretic framework that tracks both computation time and the GPU memory occupied by key-value caches during LLM decoding. From this model it derives rigorous mathematical conditions that separate stable operation from cases where queues grow without limit as requests arrive. These conditions directly address GPU provisioning by letting operators compute the minimum number of servers needed for a measured arrival rate. Validation on real production GPUs shows the predicted thresholds match observed behavior within roughly 10 percent.

Core claim

We introduce the first queueing-theoretic framework that explicitly incorporates both computation and GPU memory constraints into the analysis of LLM inference. Based on this framework, we derive rigorous stability and instability conditions that determine whether an LLM inference service can sustain incoming demand without unbounded queue growth. This result offers a powerful tool for system deployment, potentially addressing the core challenge of GPU provisioning. By combining an estimated request arrival rate with our derived stable service rate, operators can calculate the necessary cluster size to avoid both costly over-purchasing and performance-violating under-provisioning.

What carries the argument

A queueing process whose state tracks the number of active requests together with their cumulative KV cache memory occupancy, producing stability criteria via comparison of arrival rate against an effective service rate that shrinks under memory pressure.

Load-bearing premise

The queueing model assumptions about arrival processes, service time distributions, and memory consumption dynamics accurately represent real LLM inference workloads on production GPUs.

What would settle it

Run an LLM inference server at the exact critical arrival rate predicted by the model and measure whether average queue length stays bounded over long time horizons or grows without limit.

Figures

Figures reproduced from arXiv: 2605.04595 by Chengyi Nie, Nian Si, Zijie Zhou.

**Figure 1.** Figure 1: Cumulative Distribution Function for Batch Execution Time with PD ratio 1:1 requests System Stability. At each discrete time step, we represent the system state by the set of prompts that have arrived but are not yet completed, S (t) ∪ W(t) , along with the indices of the prefill chunk and output token processed for each request i ∈ S (t) ∪ W(t) at the end of time t, denoted by {c t i , dt i }. Here, W(t) … view at source ↗

**Figure 2.** Figure 2: Number of waiting requests in the queue under different arrival rates. (s) and decode (o) lengths, alongside their joint cumulative distribution function. The dataset is publicly accessible at https://longbench2.github.io/. In our experiment setup, we partition the dataset randomly, using 80% to compute the joint probability density function p(s, o) and to estimate the mean batch processing time ¯b via a 1… view at source ↗

**Figure 4.** Figure 4: Number of total waiting requests among 8 GPUs in the queue under different arrival rates. across replicas using the round robin over 56, 000 requests. The PD ratio is 1 : 1 and again the prefill and decode are independently sampled from Uniform(10,1600) view at source ↗

**Figure 3.** Figure 3: Joint PDF and CDF of the LongBench v2 dataset. 5.3. Eight-GPU Results Next, we conduct 8-GPU experiments to compare the real processing rate and the theoretical rate multiplies by 8, i.e. 8µtheory, to see whether our approximation still is accurate or not in the cluster deployment. In our 8-GPU configuration, we deploy eight identical single-GPU replicas (NVIDIA A100) serving Meta-Llama-3-8B with no intra-… view at source ↗

**Figure 5.** Figure 5: presents the cumulative distribution function (CDF) of batch execution times across several real-world traces. The subfigures are organized by workload: the first row corresponds to a prefill-decode (PD) ratio of 2:1, and the second row to a ratio of 4:1. The first column represents a low arrival rate (5 requests/second), while the second column represents a higher rate (20 requests/second). For context, view at source ↗

**Figure 6.** Figure 6: Cumulative Distribution Function (CDF) for Waiting Time of Each Request. observe that as the P/D ratio increases, the tail of the processing time distribution becomes significantly heavier, as shown in view at source ↗

**Figure 7.** Figure 7: Cumulative Distribution Function (CDF) for Batch Execution Time under an 8:1 P/D ratio view at source ↗

**Figure 8.** Figure 8: Trimmed Cumulative Distribution Function (CDF) for Batch Execution Time. 16 view at source ↗

read the original abstract

The rapid adoption of large language models (LLMs) has created significant challenges for efficient inference at scale. Unlike traditional workloads, LLM inference is constrained by both computation and the memory overhead of key-value (KV) caching, which accelerates decoding but quickly exhausts GPU memory. In this paper, we introduce the first queueing-theoretic framework that explicitly incorporates both computation and GPU memory constraints into the analysis of LLM inference. Based on this framework, we derive rigorous stability and instability conditions that determine whether an LLM inference service can sustain incoming demand without unbounded queue growth. This result offers a powerful tool for system deployment, potentially addressing the core challenge of GPU provisioning. By combining an estimated request arrival rate with our derived stable service rate, operators can calculate the necessary cluster size to avoid both costly over-purchasing and performance-violating under-provisioning. We further validate our theoretical predictions through extensive experiments in real GPU production environments. Our results show that the predicted stability conditions are highly accurate, with deviations typically within 10%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper maps standard queueing stability to LLM inference by folding in KV cache memory limits, with experiments showing roughly 10% accuracy, but the model may miss state-dependent effects from real schedulers.

read the letter

The main contribution is a queueing model that treats LLM inference as limited by both compute cycles and the memory footprint of the KV cache. They derive conditions on arrival rate versus an effective service rate that keep the queue from growing without bound, then test the predictions on actual GPU hardware and report typical deviations under 10 percent. That combination gives operators a way to estimate how many GPUs they need from measured arrival rates and the computed stable throughput, which is a direct practical output. The experiments in production-like settings are the part that lands best, because they move beyond pure theory to show the numbers are not wildly off. The modeling step itself applies familiar queueing results to the new constraints, which is straightforward but useful as a first cut. The soft spot is the handling of real dynamics. Production LLM serving relies on continuous batching and paged KV caches, so both service time and memory use depend on the current mix of requests, prefix sharing, and eviction decisions. Those factors can make the effective rate history-dependent or variable in ways that break the stationary assumptions needed for the stability conditions. The abstract gives little detail on how the derivation accounts for this coupling or on the exact service-time and memory-consumption distributions used, so it is hard to judge how much simplification occurred. If the full paper does not test against schedulers that exhibit strong state dependence, the claimed rigor will need qualification. This is aimed at systems researchers and engineers who size LLM clusters or design serving runtimes. It engages honestly with both queueing literature and the practical constraints of LLM inference, and the experimental check provides some external grounding. The work is timely enough and grounded enough to go to peer review, though referees will likely press on the modeling assumptions and ask for more explicit derivations and targeted stress tests against variable-rate schedulers.

Referee Report

2 major / 2 minor

Summary. The paper introduces the first queueing-theoretic framework for LLM inference that jointly models computational and KV-cache memory constraints. It derives rigorous stability and instability conditions determining whether an inference service can sustain a given arrival rate without unbounded queue growth, and validates the predictions via experiments on real GPU production hardware showing typical deviations within 10%.

Significance. If the derived conditions are predictive, the work supplies a practical provisioning rule: combine an observed arrival rate with the paper's stable service rate to compute the minimum cluster size that avoids both over-provisioning and SLA violations. The explicit experimental validation in production environments is a clear strength and distinguishes the contribution from purely theoretical queueing analyses.

major comments (2)

§3 (Queueing Model): The stability conditions rest on the assumption that effective service rate remains stationary once memory occupancy is accounted for. In continuous-batching systems with paged KV cache, prefix sharing, and eviction, service time and memory footprint become history-dependent; the manuscript does not demonstrate that the Markovian ergodicity conditions used in the derivation continue to hold under these dynamics.
§5 (Experimental Validation): The reported deviations are typically within 10%, yet the text provides no breakdown of how arrival processes, sequence-length distributions, or KV-cache eviction policies were matched between the model and the production traces. Without this mapping, it is unclear whether the agreement validates the core stability criterion or merely confirms a post-hoc fit.

minor comments (2)

Notation for the memory-constrained service rate is introduced without an explicit equation number in the main text; cross-referencing would improve readability.
The abstract claims the framework is 'the first'; a brief related-work paragraph citing prior queueing models for inference (even if they omit memory) would strengthen the novelty statement.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our paper. We address each major comment below and describe the revisions we plan to make to strengthen the manuscript.

read point-by-point responses

Referee: §3 (Queueing Model): The stability conditions rest on the assumption that effective service rate remains stationary once memory occupancy is accounted for. In continuous-batching systems with paged KV cache, prefix sharing, and eviction, service time and memory footprint become history-dependent; the manuscript does not demonstrate that the Markovian ergodicity conditions used in the derivation continue to hold under these dynamics.

Authors: The referee correctly identifies a key modeling assumption. Our framework models the system as a Markov chain where the state includes the current memory occupancy, and the service rate is a function of that state. While prefix sharing and eviction introduce dependencies, in practice for stability analysis, we use the long-run average effective rate, which preserves the ergodicity under standard conditions for state-dependent queues. We did not explicitly prove ergodicity for every possible eviction policy, as the paper focuses on the stability conditions derived from the model. In the revision, we will add a subsection in §3 discussing the validity of the Markovian assumption under common KV cache management policies and note the limitations. This will be a partial revision as the core derivation remains unchanged. revision: partial
Referee: §5 (Experimental Validation): The reported deviations are typically within 10%, yet the text provides no breakdown of how arrival processes, sequence-length distributions, or KV-cache eviction policies were matched between the model and the production traces. Without this mapping, it is unclear whether the agreement validates the core stability criterion or merely confirms a post-hoc fit.

Authors: We concur that a more explicit mapping is required to interpret the validation results. In the revised manuscript, we will augment §5 with a comprehensive description of the experimental setup, including how the arrival processes (using trace-driven or synthetic Poisson arrivals), sequence-length distributions (empirical from production data), and KV-cache eviction policies (modeled after the paged attention mechanism with LRU eviction) were aligned with the production traces. This addition will clarify that the observed agreement supports the predictive power of the stability conditions. revision: yes

Circularity Check

0 steps flagged

No circularity: standard queueing derivation applied to KV-cache constraints

full rationale

The paper introduces a queueing model that incorporates both compute and KV-cache memory constraints, then derives stability conditions (arrival rate below effective service rate under memory limits) using standard Markovian analysis. No step reduces a prediction to a fitted input by construction, nor does any load-bearing claim rest on self-citation of an unverified uniqueness theorem. The stability result is obtained from the model equations rather than being presupposed; experimental validation is presented as separate confirmation rather than part of the derivation. The central claim therefore remains independent of its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Based on abstract only; the framework likely relies on standard queueing assumptions such as Poisson arrivals and exponential service times, plus a memory consumption model for KV cache, but exact axioms and parameters are not stated.

axioms (2)

domain assumption Requests arrive according to a Poisson process and service times follow a distribution compatible with queueing stability analysis.
Typical for queueing models of inference services; invoked to derive stability conditions.
domain assumption KV cache memory usage can be modeled as an additional capacity constraint that interacts with compute service rate.
Core modeling choice that distinguishes this from standard compute-only queueing.

pith-pipeline@v0.9.0 · 5483 in / 1309 out tokens · 26204 ms · 2026-05-08T17:23:52.912038+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

124 extracted references · 29 canonical work pages · 5 internal anchors

[1]

2020 , publisher=

Processing networks: fluid models and stability , author=. 2020 , publisher=

2020
[2]

2013 , publisher=

Markov chains: Gibbs fields, Monte Carlo simulation, and queues , author=. 2013 , publisher=

2013
[3]

arXiv preprint arXiv:2412.04504 , year=

Multi-bin batching for increasing LLM inference throughput , author=. arXiv preprint arXiv:2412.04504 , year=

work page arXiv
[4]

Algorithmica , volume=

Bandwidth packing , author=. Algorithmica , volume=. 2001 , publisher=

2001
[5]

Queueing Systems , volume=

Asymptotic optimality of a greedy randomized algorithm in a large-scale service system with general packing constraints , author=. Queueing Systems , volume=. 2015 , publisher=

2015
[6]

ACM SIGMETRICS Performance Evaluation Review , volume=

A large-scale service system with packing constraints: Minimizing the number of occupied servers , author=. ACM SIGMETRICS Performance Evaluation Review , volume=. 2013 , publisher=

2013
[7]

Queueing systems , volume=

Stochastic bandwidth packing process: stability conditions via Lyapunov function technique , author=. Queueing systems , volume=. 2004 , publisher=

2004
[8]

Operations Research , volume=

An infinite server system with general packing constraints , author=. Operations Research , volume=. 2013 , publisher=

2013
[9]

ACM SIGMETRICS Performance Evaluation Review , volume=

Asymptotic optimality of BestFit for stochastic bin packing , author=. ACM SIGMETRICS Performance Evaluation Review , volume=. 2014 , publisher=

2014
[10]

ACM SIGMETRICS Performance Evaluation Review , volume=

Lagrangian-based online stochastic bin packing , author=. ACM SIGMETRICS Performance Evaluation Review , volume=. 2015 , publisher=

2015
[11]

2012 Proceedings IEEE Infocom , pages=

Stochastic models of load balancing and scheduling in cloud computing clusters , author=. 2012 Proceedings IEEE Infocom , pages=. 2012 , organization=

2012
[12]

Stability of queueing networks:

Bramson, Maury , year=. Stability of queueing networks:
[13]

Operations Research , year=

Orlm: A customizable framework in training large models for automated optimization modeling , author=. Operations Research , year=
[14]

Fast distributed inference serving for large language models,

Fast distributed inference serving for large language models , author=. arXiv preprint arXiv:2305.05920 , year=

work page arXiv
[15]

arXiv preprint arXiv:2503.07545 , year=

Queueing, predictions, and llms: Challenges and open problems , author=. arXiv preprint arXiv:2503.07545 , year=

work page arXiv
[16]

2024 22nd International Symposium on Modeling and Optimization in Mobile, Ad Hoc, and Wireless Networks (WiOpt) , pages=

A queueing theoretic perspective on low-latency llm inference with variable token length , author=. 2024 22nd International Symposium on Modeling and Optimization in Mobile, Ad Hoc, and Wireless Networks (WiOpt) , pages=. 2024 , organization=

2024
[17]

arXiv preprint arXiv:2508.06133 , year=

LLM Serving Optimization with Variable Prefill and Decode Lengths , author=. arXiv preprint arXiv:2508.06133 , year=

work page arXiv
[18]

arXiv preprint arXiv:2508.14544 , year=

Adaptively Robust LLM Inference Optimization under Prediction Uncertainty , author=. arXiv preprint arXiv:2508.14544 , year=

work page arXiv
[19]

arXiv preprint arXiv:2601.17855 , year=

A Universal Load Balancing Principle and Its Application to Large Language Model Serving , author=. arXiv preprint arXiv:2601.17855 , year=

work page arXiv
[20]

Training Compute-Optimal Large Language Models

Training compute-optimal large language models , author=. arXiv preprint arXiv:2203.15556 , year=

work page internal anchor Pith review arXiv
[21]

arXiv preprint arXiv:2502.07115 , year=

Online Scheduling for LLM Inference with KV Cache Constraints , author=. arXiv preprint arXiv:2502.07115 , year=

work page arXiv
[22]

arXiv preprint arXiv:2504.11320 , year=

Optimizing LLM Inference: Fluid-Guided Online Scheduling with Memory Constraints , author=. arXiv preprint arXiv:2504.11320 , year=

work page arXiv
[23]

arXiv preprint arXiv:2504.07347 , year=

Throughput-optimal scheduling algorithms for llm inference and ai agents , author=. arXiv preprint arXiv:2504.07347 , year=

work page arXiv
[24]

arXiv preprint arXiv:2410.01035 , year=

Don't Stop Me Now: Embedding Based Scheduling for LLMs , author=. arXiv preprint arXiv:2410.01035 , year=

work page arXiv
[25]

Management Science , volume=

Appointment scheduling with limited distributional information , author=. Management Science , volume=. 2015 , publisher=

2015
[26]

Discrete Applied Mathematics , volume=

Parallel machine scheduling with splitting jobs , author=. Discrete Applied Mathematics , volume=. 2000 , publisher=

2000
[27]

Operations research , volume=

Scheduling arrivals to a stochastic service delivery system using copositive cones , author=. Operations research , volume=. 2013 , publisher=

2013
[28]

European journal of operational research , volume=

A survey of scheduling problems with setup times or costs , author=. European journal of operational research , volume=. 2008 , publisher=

2008
[29]

Journal of scheduling , volume=

Scheduling a batching machine , author=. Journal of scheduling , volume=. 1998 , publisher=

1998
[30]

Handbook of Combinatorial Optimization: Volume1--3 , pages=

A review of machine scheduling: Complexity, algorithms and approximability , author=. Handbook of Combinatorial Optimization: Volume1--3 , pages=. 1998 , publisher=

1998
[31]

2012 , publisher=

Scheduling , author=. 2012 , publisher=

2012
[32]

arXiv preprint arXiv:2402.08804 , year=

When Should you Offer an Upgrade: Online Upgrading Mechanisms for Resource Allocation , author=. arXiv preprint arXiv:2402.08804 , year=

work page arXiv
[33]

Yao, Yuhang and Jin, Han and Shah, Alay Dilipbhai and Han, Shanshan and Hu, Zijian and Ran, Yide and Stripelis, Dimitris and Xu, Zhaozhuo and Avestimehr, Salman and He, Chaoyang , journal=
[34]

High-Dimensional Probability: An Introduction with Applications in Data Science , publisher=

Vershynin, Roman , year=. High-Dimensional Probability: An Introduction with Applications in Data Science , publisher=
[35]

European journal of operational research , volume=

Scheduling with batching: A review , author=. European journal of operational research , volume=. 2000 , publisher=

2000
[36]

Annals of Operations Research , volume=

Batch sizing and just-in-time scheduling with common due date , author=. Annals of Operations Research , volume=. 2014 , publisher=

2014
[37]

Journal of the Operational Research Society , volume=

Scheduling a single batch-processing machine with arbitrary job sizes and incompatible job families: an ant colony framework , author=. Journal of the Operational Research Society , volume=. 2008 , publisher=

2008
[38]

European Journal of Operational Research , volume=

Scheduling a single parallel-batching machine with non-identical job sizes and incompatible job families , author=. European Journal of Operational Research , volume=. 2022 , publisher=

2022
[39]

arXiv preprint arXiv:2210.07996 , year=

Degeneracy is ok: Logarithmic regret for network revenue management with indiscrete distributions , author=. arXiv preprint arXiv:2210.07996 , year=

work page arXiv
[40]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[41]

Proceedings of the 37th International Conference on Neural Information Processing Systems , pages=

On optimal caching and model multiplexing for large model inference , author=. Proceedings of the 37th International Conference on Neural Information Processing Systems , pages=
[42]

Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Li, Tianle and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Li, Zhuohan and Lin, Zi and Xing, Eric and others , journal=
[43]

Vidur: A Large-Scale Simulation Framework For

Agrawal, Amey and Kedia, Nitin and Mohan, Jayashree and Panwar, Ashish and Kwatra, Nipun and Gulavani, Bhargav and Ramjee, Ramachandran and Tumanov, Alexey , journal=. Vidur: A Large-Scale Simulation Framework For
[44]

Taming throughput-latency tradeoff in

Agrawal, Amey and Kedia, Nitin and Panwar, Ashish and Mohan, Jayashree and Kwatra, Nipun and Gulavani, Bhargav S and Tumanov, Alexey and Ramjee, Ramachandran , journal=. Taming throughput-latency tradeoff in
[45]

arXiv preprint arXiv:1804.05685 , year=

A discourse-aware attention model for abstractive summarization of long documents , author=. arXiv preprint arXiv:1804.05685 , year=

work page arXiv
[46]

Sarathi: Efficient

Agrawal, Amey and Panwar, Ashish and Mohan, Jayashree and Kwatra, Nipun and Gulavani, Bhargav S and Ramjee, Ramachandran , journal=. Sarathi: Efficient
[47]

Splitwise: Efficient generative

Patel, Pratyush and Choukse, Esha and Zhang, Chaojie and Shah, Aashaka and Goiri,. Splitwise: Efficient generative. Power , volume=
[48]

arXiv preprint arXiv:2404.08509 , year=

Qiu, Haoran and Mao, Weichao and Patke, Archit and Cui, Shengkun and Jha, Saurabh and Wang, Chen and Franke, Hubertus and Kalbarczyk, Zbigniew T and Ba. Efficient interactive. arXiv preprint arXiv:2404.08509 , year=

work page arXiv
[49]

Aladdin: Joint Placement and Scaling for SLO-Aware

Nie, Chengyi and Fonseca, Rodrigo and Liu, Zhenhua , journal=. Aladdin: Joint Placement and Scaling for SLO-Aware
[50]

ACM SIGMETRICS Performance Evaluation Review , volume=

The bayesian prophet: A low-regret framework for online decision making , author=. ACM SIGMETRICS Performance Evaluation Review , volume=. 2019 , publisher=

2019
[51]

Advances in neural information processing systems , volume=

Language models are few-shot learners , author=. Advances in neural information processing systems , volume=
[52]

Journal of Machine Learning Research , volume=

Palm: Scaling language modeling with pathways , author=. Journal of Machine Learning Research , volume=
[53]

Scaling Laws for Neural Language Models

Scaling laws for neural language models , author=. arXiv preprint arXiv:2001.08361 , year=

work page internal anchor Pith review arXiv 2001
[54]

Achiam, Josh and Adler, Steven and Agarwal, Sandhini and Ahmad, Lama and Akkaya, Ilge and Aleman, Florencia Leoni and Almeida, Diogo and Altenschmidt, Janko and Altman, Sam and Anadkat, Shyamal and others , journal=
[55]

Emergent Abilities of Large Language Models

Emergent abilities of large language models , author=. arXiv preprint arXiv:2206.07682 , year=

work page internal anchor Pith review arXiv
[56]

Perplexity , title =
[57]

Nigel Cannings , title =
[58]

Replit Ghostwriter , howpublished =
[59]

2024 , url =

GilPress , title =. 2024 , url =

2024
[60]

2023 , url =

Sun, Yan , title =. 2023 , url =

2023
[61]

2024 , url =

Gordon, Cindy , title =. 2024 , url =

2024
[62]

Evaluating the feasibility of

Cascella, Marco and Montomoli, Jonathan and Bellini, Valentina and Bignami, Elena , journal=. Evaluating the feasibility of. 2023 , publisher=

2023
[63]

NPJ digital medicine , volume=

A study of generative large language model for medical research and healthcare , author=. NPJ digital medicine , volume=. 2023 , publisher=

2023
[64]

The utility of

Sallam, Malik , journal=. The utility of. 2023 , publisher=

2023
[65]

Efficient memory management for large language model serving with

Kwon, Woosuk and Li, Zhuohan and Zhuang, Siyuan and Sheng, Ying and Zheng, Lianmin and Yu, Cody Hao and Gonzalez, Joseph and Zhang, Hao and Stoica, Ion , booktitle=. Efficient memory management for large language model serving with
[66]

Proceedings of Machine Learning and Systems , volume=

Efficiently scaling transformer inference , author=. Proceedings of Machine Learning and Systems , volume=
[67]

Flexgen: High-throughput generative inference of large language models with a single

Sheng, Ying and Zheng, Lianmin and Yuan, Binhang and Li, Zhuohan and Ryabinin, Max and Chen, Beidi and Liang, Percy and R. Flexgen: High-throughput generative inference of large language models with a single. International Conference on Machine Learning , pages=. 2023 , organization=

2023
[68]

16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22) , pages=

Orca: A distributed serving system for \ Transformer-Based \ generative models , author=. 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22) , pages=
[69]

bsm,mHh->bsHh

Distserve: Disaggregating prefill and decoding for goodput-optimized large language model serving , author=. arXiv preprint arXiv:2401.09670 , year=

work page arXiv
[70]

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis , pages=

Batch: Machine learning inference serving on serverless platforms with adaptive batching , author=. SC20: International Conference for High Performance Computing, Networking, Storage and Analysis , pages=. 2020 , organization=

2020
[71]

, author=

Online scheduling. , author=
[72]

Introduction to scheduling , pages=

Online scheduling , author=. Introduction to scheduling , pages=. 2009 , publisher=

2009
[73]

Advances in Neural Information Processing Systems , volume=

The power of optimization from samples , author=. Advances in Neural Information Processing Systems , volume=
[74]

Proceedings of the Twenty-Seventh Annual ACM-SIAM Symposium on Discrete Algorithms , pages =

Agrawal, Kunal and Li, Jing and Lu, Kefu and Moseley, Benjamin , title =. Proceedings of the Twenty-Seventh Annual ACM-SIAM Symposium on Discrete Algorithms , pages =. 2016 , isbn =

2016
[75]

Non-clairvoyant scheduling with precedence constraints , booktitle =

Julien Robert and Nicolas Schabanel , editor =. Non-clairvoyant scheduling with precedence constraints , booktitle =. 2008 , url =

2008
[76]

On-line scheduling with precedence constraints , journal =

Yossi Azar and Leah Epstein , abstract =. On-line scheduling with precedence constraints , journal =. 2002 , note =. doi:https://doi.org/10.1016/S0166-218X(01)00272-4 , url =

work page doi:10.1016/s0166-218x(01)00272-4 2002
[77]

Non-Clairvoyant Precedence Constrained Scheduling , booktitle =

Naveen Garg and Anupam Gupta and Amit Kumar and Sahil Singla , editor =. Non-Clairvoyant Precedence Constrained Scheduling , booktitle =. 2019 , url =. doi:10.4230/LIPICS.ICALP.2019.63 , timestamp =

work page doi:10.4230/lipics.icalp.2019.63 2019
[78]

2020 , url =

Beyond the Worst-Case Analysis of Algorithms , publisher =. 2020 , url =. doi:10.1017/9781108637435 , isbn =

work page doi:10.1017/9781108637435 2020
[79]

2004 , url =

Handbook of Scheduling - Algorithms, Models, and Performance Analysis , publisher =. 2004 , url =

2004
[80]

Susanne Albers , title =. Math. Program. , volume =. 2003 , url =. doi:10.1007/S10107-003-0436-0 , timestamp =

work page doi:10.1007/s10107-003-0436-0 2003

Showing first 80 references.