WarmServe: Enabling One-for-Many GPU Prewarming for Multi-LLM Serving

Chen Sun; Chiheng Lou; Pengcheng Wang; Rui Kang; Sheng Qi; Xin Jin; Xuanzhe Liu; Yong Zhang

arxiv: 2512.09472 · v2 · pith:VHHD5NGWnew · submitted 2025-12-10 · 💻 cs.DC · cs.LG

WarmServe: Enabling One-for-Many GPU Prewarming for Multi-LLM Serving

Chiheng Lou , Sheng Qi , Rui Kang , Yong Zhang , Chen Sun , Pengcheng Wang , Xuanzhe Liu , Xin Jin This is my paper

Pith reviewed 2026-05-22 12:18 UTC · model grok-4.3

classification 💻 cs.DC cs.LG

keywords multi-LLM servingGPU prewarmingworkload forecastingtime-to-first-tokenKV cache managementmodel placementinference performance

0 comments

The pith

WarmServe preloads multiple LLM model weights on shared GPUs using workload forecasts to enable fast instance startup during bursts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current multi-LLM serving systems share GPUs to raise utilization but incur high tail time-to-first-token delays because they react only after requests arrive. The paper notes that real workloads display strong periodicity and predictability, which can be used to forecast demand. WarmServe therefore performs one-for-many prewarming by proactively loading parameters from several models onto GPUs ahead of time. Three supporting mechanisms handle placement to limit interference, repurpose idle KV cache, and switch memory efficiently. If these steps work, clusters can run more models at once while keeping first-token latency low and throughput high.

Core claim

One-for-many GPU prewarming proactively loads parameters from multiple models onto GPUs based on workload forecasts; these prewarmed weights let the system instantiate serving instances promptly when request bursts occur. WarmServe realizes this idea through a model placement algorithm that minimizes cross-model interference, a KV cache reservation strategy that uses idle space on active GPUs, and an efficient GPU memory switching mechanism for tensor management.

What carries the argument

one-for-many GPU prewarming: proactively loading parameters from multiple models onto GPUs based on workload forecasts to support quick instance creation during bursts.

If this is right

Reduces tail TTFT by up to 50.8× compared to the state-of-the-art autoscaling-based system.
Supports up to 2.5× higher request throughput than the GPU-sharing system.
Minimizes cross-model prewarming interference through an optimized model placement algorithm.
Repurposes idle KV cache space on running GPUs for prewarming new models without extra hardware.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same forecast-driven prewarming pattern could be tested on other bursty multi-tenant workloads such as video encoding or database query serving.
Accurate long-term prediction becomes the new bottleneck; systems may need lightweight rollback when forecasts prove wrong.
Cloud operators could use this method to increase model density per GPU cluster while still meeting strict latency targets.

Load-bearing premise

Real-world LLM serving workloads exhibit strong periodicity and long-term predictability that can be leveraged for effective proactive prewarming decisions.

What would settle it

A production multi-LLM workload trace that shows no periodicity or predictability, causing WarmServe's forecast-driven prewarming to miss actual bursts and produce no improvement in tail TTFT.

Figures

Figures reproduced from arXiv: 2512.09472 by Chen Sun, Chiheng Lou, Pengcheng Wang, Rui Kang, Sheng Qi, Xin Jin, Xuanzhe Liu, Yong Zhang.

**Figure 2.** Figure 2: Real and predicted peak loads under 5-minute windows [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Example of cluster-wide prewarming interference. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: WarmServe system overview. manager to control a pool of GPUs (i.e., GPU workers), categorized as idle, universal, or dedicated. The global manager dynamically loads or evicts model weights on these GPUs, adapting their roles in different scenarios. Specifically, idle GPU workers transition into universal workers after prewarming several LLMs. Universal workers, in turn, become dedicated GPU workers when… view at source ↗

**Figure 5.** Figure 5: Overview of GPU worker lifecycle in WarmServe. [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Overview of zero-overhead memory switching in Warm [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Placement guideline of WarmServe. 5.2 Evict-Aware Model Placement As shown in §2.3, prewarming LLMs across multiple GPUs can lead to interference between colocated models. To mitigate this, WarmServe employs a placement strategy governed by two primary guidelines. The first guideline strictly prohibits partial GPU sharing. Specifically, for any two models, the set of GPUs allocated to them must either be… view at source ↗

**Figure 8.** Figure 8: Prewarming performance breakdown. latency for generating the first token (§7.2). In end-to-end experiments, WarmServe significantly reduces the P95 and P99 TTFT by up to 50.79× compared to the autoscaling-based system, while being capable of serving up to 2.5× more requests than the GPU-sharing system (§7.3). Our analysis of the workload predictor reveals an average relative error ranging from 5.25%–11.… view at source ↗

**Figure 9.** Figure 9: TTFT of systems in different settings. A logarithmic scale is used for the y-axis. [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗

**Figure 10.** Figure 10: TTFT for models under RPS=25. A logarithmic scale is used for the y-axis. [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗

**Figure 11.** Figure 11: Prewarming hit ratios in end-to-end experiments. 0 1 2 3 4 5 TTFT (s) 0.0 0.2 0.4 0.6 0.8 1.0 Percentage 0.0 0.1 0.2 0.3 0.00 0.25 0.50 0.75 1.00 No Place. No Proac. W=3 W=5 W=10 W=40 [PITH_FULL_IMAGE:figures/full_fig_p010_11.png] view at source ↗

**Figure 13.** Figure 13: TPOT CDF of systems for α=0.5. 10 15 20 25 RPS 2 −4 2 −2 2 0 2 2 2 4 2 6 2 8 2 10 P99 Latency (s) SLLM-GPU MuxServe WarmServe w/o Proac. WarmServe [PITH_FULL_IMAGE:figures/full_fig_p011_13.png] view at source ↗

**Figure 15.** Figure 15: TPOT CDF on AzureCode for α=0.5 and RPS=25. tail TTFT of systems under various scenarios. To validate the effectiveness of proactive prewarming, we also conduct experiments on WarmServe with proactive prewarming disabled. WarmServe consistently delivers low TTFT across all settings, significantly outperforming other systems that exhibit high latency, particularly under heavy loads. Compared to SLLM-GPU… view at source ↗

**Figure 16.** Figure 16: Average relative error of CSP along with the average [PITH_FULL_IMAGE:figures/full_fig_p012_16.png] view at source ↗

**Figure 17.** Figure 17: TPOT CDF of systems under α=2.0. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_17.png] view at source ↗

read the original abstract

Deploying multiple models within shared GPU clusters is a key strategy to improve resource efficiency in large language model (LLM) serving. Existing multi-LLM serving systems improve GPU utilization at the cost of degraded inference performance, particularly time-to-first-token (TTFT). We attribute this degradation to the lack of awareness regarding future workload characteristics. In contrast, recent analyses have shown the strong periodicity and long-term predictability of real-world LLM serving workloads. In this paper, we propose one-for-many GPU prewarming, which proactively loads parameters from multiple models onto GPUs based on workload forecasts. These prewarmed weights enable the system to promptly instantiate serving instances upon encountering request bursts. We design and implement WarmServe, a multi-LLM serving system incorporating three key techniques: (1) a model placement algorithm that optimizes prewarming decisions to minimize cross-model prewarming interference, (2) a KV cache reservation strategy that repurposes idle KV cache space on running GPUs for prewarming new models, and (3) an efficient GPU memory switching mechanism for tensor management. Evaluation on real-world datasets shows that WarmServe reduces tail TTFT by up to 50.8$\times$ compared to the state-of-the-art autoscaling-based system, while supporting up to 2.5$\times$ higher request throughput than the GPU-sharing system.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

WarmServe shows a working system for prewarming multiple LLMs on shared GPUs via workload forecasts, with clear engineering wins but the big speedups still need tighter checks on prediction accuracy.

read the letter

The main point for you is that this paper builds and measures a multi-LLM serving system that preloads weights for several models onto the same GPUs ahead of time, using the fact that real traces show periodic bursts. They report up to 50.8x lower tail TTFT versus autoscaling baselines and 2.5x higher throughput than plain GPU sharing, which would matter in production clusters where you want to keep utilization high without killing latency.

Referee Report

2 major / 2 minor

Summary. The manuscript presents WarmServe, a multi-LLM serving system that enables one-for-many GPU prewarming based on predicted workload patterns. It introduces a model placement algorithm to minimize interference, a KV cache reservation strategy, and an efficient GPU memory switching mechanism. The evaluation on real-world datasets claims significant improvements: up to 50.8× reduction in tail time-to-first-token (TTFT) compared to autoscaling-based systems and up to 2.5× higher request throughput than GPU-sharing systems.

Significance. If the performance claims hold under robust conditions, this work could advance the state of efficient multi-model inference serving by leveraging the periodicity of real-world workloads for proactive resource management. The techniques address a practical gap in current systems that lack future workload awareness, potentially leading to better GPU utilization without sacrificing latency in production environments.

major comments (2)

[§5 (Evaluation)] §5 (Evaluation) and abstract: The reported 50.8× tail TTFT reduction and 2.5× throughput gains are attributed to proactive prewarming that relies on long-term workload forecasts, yet the section provides no quantification of forecast accuracy (e.g., error rates, precision/recall of burst predictions), sensitivity analysis to mispredictions, or the fraction of prewarming decisions that were correct versus wasted. This is load-bearing for the central claim, as the abstract explicitly grounds the approach in “strong periodicity and long-term predictability.”
[§4.2 (Baselines)] §4.2 (Baselines) and Table 2: The comparison to the “state-of-the-art autoscaling-based system” does not specify whether the baseline incorporates any form of workload prediction or uses purely reactive scaling; without this, it is unclear whether the measured gains arise from the proposed one-for-many prewarming or from differences in prediction assumptions between WarmServe and the baseline.

minor comments (2)

[§3.1] The notation for prewarming interference in §3.1 could be clarified with a small example or pseudocode to show how the placement algorithm accounts for cross-model KV-cache contention.
[Figure 7] Figure 7 (throughput vs. latency curves) would benefit from error bars or multiple runs to indicate statistical variability across the real-world traces.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below and will incorporate clarifications and additional analysis in the revised manuscript.

read point-by-point responses

Referee: §5 (Evaluation) and abstract: The reported 50.8× tail TTFT reduction and 2.5× throughput gains are attributed to proactive prewarming that relies on long-term workload forecasts, yet the section provides no quantification of forecast accuracy (e.g., error rates, precision/recall of burst predictions), sensitivity analysis to mispredictions, or the fraction of prewarming decisions that were correct versus wasted. This is load-bearing for the central claim, as the abstract explicitly grounds the approach in “strong periodicity and long-term predictability.”

Authors: We agree that explicit quantification of forecast accuracy is necessary to substantiate the central claims. In the revised version we will add a dedicated subsection (new §5.4) reporting prediction accuracy on the real-world traces, including mean absolute percentage error for request-rate forecasts, precision/recall for burst detection, the fraction of prewarming decisions that proved correct versus wasted, and a sensitivity study that replays the same traces under injected forecast errors of 10–30 %. These additions will directly support the “strong periodicity and long-term predictability” premise stated in the abstract. revision: yes
Referee: §4.2 (Baselines) and Table 2: The comparison to the “state-of-the-art autoscaling-based system” does not specify whether the baseline incorporates any form of workload prediction or uses purely reactive scaling; without this, it is unclear whether the measured gains arise from the proposed one-for-many prewarming or from differences in prediction assumptions between WarmServe and the baseline.

Authors: The autoscaling baseline is a purely reactive system that adjusts GPU allocation solely on the basis of instantaneous load (modeled after standard Kubernetes HPA and production LLM autoscalers) and contains no workload forecasting component. WarmServe’s measured improvements therefore derive from its proactive one-for-many prewarming rather than from any predictive advantage granted to the baseline. We will revise the first paragraph of §4.2 and add a clarifying footnote to Table 2 to state explicitly that the baseline is reactive and prediction-free. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical evaluation on external workloads

full rationale

The paper describes an implemented system (WarmServe) with model placement, KV cache reservation, and GPU memory switching techniques. Its central claims are measured performance gains (tail TTFT reduction and throughput) obtained by running the system on real-world datasets and comparing against autoscaling and GPU-sharing baselines. Workload periodicity and predictability are explicitly attributed to 'recent analyses' rather than defined or fitted inside this paper. No equations, fitted parameters renamed as predictions, self-definitional steps, or load-bearing self-citations appear in the derivation chain; the results rest on external benchmarks and implementation rather than reducing to the paper's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on one domain assumption about workload predictability and introduces no new mathematical free parameters or postulated physical entities; the contribution is in the engineering techniques and their integration.

axioms (1)

domain assumption Real-world LLM serving workloads exhibit strong periodicity and long-term predictability.
Invoked in the abstract to justify moving from reactive to proactive prewarming.

pith-pipeline@v0.9.0 · 5788 in / 1233 out tokens · 41343 ms · 2026-05-22T12:18:45.641102+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/reality_from_one_distinction reality_from_one_distinction (8-tick period) echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

recent analysis on real-world traces has shown the high periodicity and long-term predictability of LLM serving workloads... our predictor achieves an average accuracy of 92.7%
IndisputableMonolith/Cost/FunctionalEquation washburn_uniqueness_aczel (J-cost uniqueness) unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

one-for-many GPU prewarming... evict-aware model placement strategy... zero-overhead memory switching

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Foundry: Template-Based CUDA Graph Context Materialization for Fast LLM Serving Cold Start
cs.DC 2026-04 unverdicted novelty 6.0

Foundry uses template-based CUDA graph context materialization to reduce LLM serving cold-start latency by up to 99% while preserving CUDA graph throughput gains.
The Workload-Router-Pool Architecture for LLM Inference Optimization: A Vision Paper from the vLLM Semantic Router Project
cs.LG 2026-03 unverdicted novelty 5.0

The Workload-Router-Pool architecture is a 3D framework for LLM inference optimization that synthesizes prior vLLM work into a 3x3 interaction matrix and proposes 21 research directions at the intersections.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · cited by 2 Pith papers · 6 internal anchors

[1]

GPT-4 Technical Report

OpenAI, “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

GPT-4o System Card,

“GPT-4o System Card,” 2025. https://openai.com /index/gpt-4o-system-card/

work page 2025
[3]

DeepSeek-V3 Technical Report

DeepSeek-AI, “Deepseek-v3 technical report,”arXiv preprint arXiv:2412.19437, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

The Llama 3 Herd of Models

A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Ka- dian, A. Al-Dahle,et al., “The llama 3 herd of models,” arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Anthropic Claude,

“Anthropic Claude,” 2025. https://www.anthropic. com/claude

work page 2025
[6]

Qwen2.5-Coder Series: Powerful, Diverse, Practical,

“Qwen2.5-Coder Series: Powerful, Diverse, Practical,”

work page
[7]

5-coder-family/

https://qwenlm.github.io/blog/qwen2. 5-coder-family/

work page
[8]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI, “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,”arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Introducing OpenAI o3 and o4-mini,

“Introducing OpenAI o3 and o4-mini,” 2023. https: //openai.com/index/introducing-o3-and-o4-m ini/

work page 2023
[10]

https://deepmind.google/tech nologies/gemini/

“Gemini,” 2025. https://deepmind.google/tech nologies/gemini/

work page 2025
[11]

Qwen3: Think Deeper, Act Faster,

“Qwen3: Think Deeper, Act Faster,” 2025. https: //qwenlm.github.io/blog/qwen3/

work page 2025
[12]

Models - Openai Platform,

“Models - Openai Platform,” 2025. https://platfo rm.openai.com/docs/models

work page 2025
[13]

Fast and live model auto scaling with o(1) host caching,

D. Zhang, H. Wang, Y . Liu, X. Wei, Y . Shan, R. Chen, and H. Chen, “Fast and live model auto scaling with o(1) host caching,” inUSENIX OSDI, 2025

work page 2025
[14]

Muxserve: Flexible spatial- temporal multiplexing for multiple llm serving,

J. Duan, R. Lu, H. Duanmu, X. Li, X. Zhang, D. Lin, I. Stoica, and H. Zhang, “Muxserve: Flexible spatial- temporal multiplexing for multiple llm serving,” in ICML, 2024

work page 2024
[15]

Prism: Unleashing gpu sharing for cost-efficient multi- llm serving,

S. Yu, J. Xing, Y . Qiao, M. Ma, Y . Li, Y . Wang, S. Yang, Z. Xie, S. Cao, K. Bao, I. Stoica, H. Xu, and Y . Sheng, “Prism: Unleashing gpu sharing for cost-efficient multi- llm serving,”arXiv preprint arXiv:2505.04021, 2025

work page arXiv 2025
[16]

Serverlessllm: Low-latency server- less inference for large language models,

Y . Fu, L. Xue, Y . Huang, A.-O. Brabete, D. Ustiugov, Y . Patel, and L. Mai, “Serverlessllm: Low-latency server- less inference for large language models,” inUSENIX OSDI, 2024

work page 2024
[17]

Lambdas- cale: Enabling fast scaling for serverless large language model inference,

M. Yu, R. Yang, C. Jia, Z. Su, S. Yao, T. Lan, Y . Yang, Y . Cheng, W. Wang, A. Wang, and R. Chen, “Lambdas- cale: Enabling fast scaling for serverless large language model inference,”arXiv preprint arXiv:2502.09922, 2025

work page arXiv 2025
[18]

Towards swift serverless llm cold starts with paraserve,

C. Lou, S. Qi, C. Jin, D. Nie, H. Yang, X. Liu, and X. Jin, “Towards swift serverless llm cold starts with paraserve,” arXiv preprint arXiv:2502.15524, 2025

work page arXiv 2025
[19]

Tor- por: Gpu-enabled serverless computing for low-latency, resource-efficient inference,

M. Yu, A. Wang, D. Chen, H. Yu, X. Luo, Z. Li, W. Wang, R. Chen, D. Nie, H. Yang, and Y . Ding, “Tor- por: Gpu-enabled serverless computing for low-latency, resource-efficient inference,” inUSENIX ATC, 2025

work page 2025
[20]

Deepserve: Serverless large lan- guage model serving at scale,

J. Hu, J. Xu, Z. Liu, Y . He, Y . Chen, H. Xu, J. Liu, J. Meng, B. Zhang, S. Wan, G. Dan, Z. Dong, Z. Ren, C. Liu, T. Xie, D. Lin, Q. Zhang, Y . Yu, H. Feng, X. Chen, and Y . Shan, “Deepserve: Serverless large lan- guage model serving at scale,” inUSENIX ATC, 2025

work page 2025
[21]

Al- paServe: Statistical multiplexing with model parallelism for deep learning serving,

Z. Li, L. Zheng, Y . Zhong, V . Liu, Y . Sheng, X. Jin, Y . Huang, Z. Chen, H. Zhang, J. E. Gonzalez,et al., “Al- paServe: Statistical multiplexing with model parallelism for deep learning serving,” inUSENIX OSDI, 2023

work page 2023
[22]

Queue management for slo-oriented large language model serv- ing,

A. Patke, D. Reddy, S. Jha, H. Qiu, C. Pinto, C. Narayanaswami, Z. Kalbarczyk, and R. Iyer, “Queue management for slo-oriented large language model serv- ing,” inACM Symposium on Cloud Computing, 2024. 13

work page 2024
[23]

Burstgpt: A real-world workload dataset to optimize llm serving systems,

Y . Wang, Y . Chen, Z. Li, X. Kang, Z. Tang, X. He, R. Guo, X. Wang, Q. Wang, A. C. Zhou, and X. Chu, “Burstgpt: A real-world workload dataset to optimize llm serving systems,”arXiv preprint arXiv:2401.17644, 2024

work page arXiv 2024
[24]

Dynamollm: Designing llm inference clus- ters for performance and energy efficiency,

J. Stojkovic, C. Zhang, I. Goiri, J. Torrellas, and E. Choukse, “Dynamollm: Designing llm inference clus- ters for performance and energy efficiency,” inIEEE HPCA, March 2025

work page 2025
[25]

ServeGen: Workload Characterization and Generation of Large Language Model Serving in Production

Y . Xiang, X. Li, K. Qian, W. Yu, E. Zhai, and X. Jin, “Servegen: Workload characterization and generation of large language model serving in production,”arXiv preprint arXiv:2505.09999, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Loongserve: Efficiently serving long-context large lan- guage models with elastic sequence parallelism,

B. Wu, S. Liu, Y . Zhong, P. Sun, X. Liu, and X. Jin, “Loongserve: Efficiently serving long-context large lan- guage models with elastic sequence parallelism,” in ACM SOSP, 2024

work page 2024
[27]

Shuffleinfer: Disaggregate llm inference for mixed downstream workloads,

C. Hu, H. Huang, L. Xu, X. Chen, C. Wang, J. Xu, S. Chen, H. Feng, S. Wang, Y . Bao, N. Sun, and Y . Shan, “Shuffleinfer: Disaggregate llm inference for mixed downstream workloads,”ACM Transactions on Archi- tecture and Code Optimization, July 2025

work page 2025
[28]

Kunserve: Efficient parameter-centric memory manage- ment for llm serving,

R. Cheng, Y . Lai, X. Wei, R. Chen, and H. Chen, “Kunserve: Efficient parameter-centric memory manage- ment for llm serving,”arXiv preprint arXiv:2412.18169, 2025

work page arXiv 2025
[29]

Deepspeed-fastgen: High-throughput text generation for llms via MII and deepspeed-inference

C. Holmes, M. Tanaka, M. Wyatt, A. A. Awan, J. Rasley, S. Rajbhandari, R. Y . Aminabadi, H. Qin, A. Bakhtiari, L. Kurilenko, and Y . He, “Deepspeed-fastgen: High- throughput text generation for llms via mii and deepspeed-inference,”arXiv preprint arXiv:2401.08671, 2025

work page arXiv 2025
[30]

Kvcache cache in the wild: Characterizing and optimizing kvcache cache at a large cloud provider,

J. Wang, J. Han, X. Wei, S. Shen, D. Zhang, C. Fang, R. Chen, W. Yu, and H. Chen, “Kvcache cache in the wild: Characterizing and optimizing kvcache cache at a large cloud provider,” inUSENIX ATC, July 2025

work page 2025
[31]

Mooncake: Trading more storage for less computation — a KVCache-centric architecture for serving LLM chatbot,

R. Qin, Z. Li, W. He, J. Cui, F. Ren, M. Zhang, Y . Wu, W. Zheng, and X. Xu, “Mooncake: Trading more storage for less computation — a KVCache-centric architecture for serving LLM chatbot,” inUSENIX Conference on File and Storage Technologies, 2025

work page 2025
[32]

Tokenlake: A unified segment-level pre- fix cache pool for fine-grained elastic long-context llm serving,

B. Wu, Z. Zhang, Y . Zhong, G. Huang, Y . Zhu, X. Liu, and X. Jin, “Tokenlake: A unified segment-level pre- fix cache pool for fine-grained elastic long-context llm serving,”arXiv preprint arXiv:2508.17219, 2025

work page arXiv 2025
[33]

Xfaas: Hyperscale and low cost serverless functions at meta,

A. Sahraei, S. Demetriou, A. Sobhgol, H. Zhang, A. Nagaraja, N. Pathak, G. Joshi, C. Souza, B. Huang, W. Cook, A. Golovei, P. Venkat, A. Mcfague, D. Skar- latos, V . Patel, R. Thind, E. Gonzalez, Y . Jin, and C. Tang, “Xfaas: Hyperscale and low cost serverless functions at meta,” inACM SOSP, 2023

work page 2023
[34]

Rainbowcake: Miti- gating cold-starts in serverless with layer-wise container caching and sharing,

H. Yu, R. Basu Roy, C. Fontenot, D. Tiwari, J. Li, H. Zhang, H. Wang, and S.-J. Park, “Rainbowcake: Miti- gating cold-starts in serverless with layer-wise container caching and sharing,” inACM ASPLOS, 2024

work page 2024
[35]

Catalyzer: Sub-millisecond startup for serverless computing with initialization-less booting,

D. Du, T. Yu, Y . Xia, B. Zang, G. Yan, C. Qin, Q. Wu, and H. Chen, “Catalyzer: Sub-millisecond startup for serverless computing with initialization-less booting,” inACM ASPLOS, 2020

work page 2020
[36]

No provisioned concurrency: Fast RDMA- codesigned remote fork for serverless computing,

X. Wei, F. Lu, T. Wang, J. Gu, Y . Yang, R. Chen, and H. Chen, “No provisioned concurrency: Fast RDMA- codesigned remote fork for serverless computing,” in USENIX OSDI, 2023

work page 2023
[37]

Faascache: keeping serverless computing alive with greedy-dual caching,

A. Fuerst and P. Sharma, “Faascache: keeping serverless computing alive with greedy-dual caching,” inACM ASPLOS, 2021

work page 2021
[38]

Anthropic,

“Anthropic,” 2025.https://www.anthropic.com/

work page 2025
[39]

{MLaaS} in the wild: Workload analysis and scheduling in {Large- Scale} heterogeneous {GPU} clusters,

Q. Weng, W. Xiao, Y . Yu, W. Wang, C. Wang, J. He, Y . Li, L. Zhang, W. Lin, and Y . Ding, “{MLaaS} in the wild: Workload analysis and scheduling in {Large- Scale} heterogeneous {GPU} clusters,” inUSENIX NSDI, 2022

work page 2022
[40]

Pipeswitch: Fast pipelined context switching for deep learning applica- tions,

Z. Bai, Z. Zhang, Y . Zhu, and X. Jin, “Pipeswitch: Fast pipelined context switching for deep learning applica- tions,” inUSENIX OSDI, 2020

work page 2020
[41]

Fast Distributed Inference Serving for Large Language Models

B. Wu, Y . Zhong, Z. Zhang, S. Liu, F. Liu, Y . Sun, G. Huang, X. Liu, and X. Jin, “Fast distributed infer- ence serving for large language models,”arXiv preprint arXiv:2305.05920, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[42]

SuperServe: Fine-Grained inference serv- ing for unpredictable workloads,

A. Khare, D. Garg, S. Kalra, S. Grandhi, I. Stoica, and A. Tumanov, “SuperServe: Fine-Grained inference serv- ing for unpredictable workloads,” inUSENIX NSDI, 2025

work page 2025
[43]

Introducing low-level gpu virtual memory manage- ment,

“Introducing low-level gpu virtual memory manage- ment,” 2025. https://developer.nvidia.com /blog/introducing-low-level-gpu-virtual-m emory-management/

work page 2025
[44]

vattention: Dynamic memory management for serving llms without pagedattention,

R. Prabhu, A. Nayak, J. Mohan, R. Ramjee, and A. Pan- war, “vattention: Dynamic memory management for serving llms without pagedattention,” inACM ASPLOS, 2025

work page 2025
[45]

Forecasting seasonals and trends by ex- ponentially weighted moving averages,

C. C. Holt, “Forecasting seasonals and trends by ex- ponentially weighted moving averages,”International Journal of Forecasting, 2004. 14

work page 2004
[46]

P. R. Winters,Forecasting Sales by Exponentially Weighted Moving Averages. Springer Berlin Heidelberg, 1976

work page 1976
[47]

Multivariate short-term traffic flow forecasting using time-series anal- ysis,

B. Ghosh, B. Basu, and M. O’Mahony, “Multivariate short-term traffic flow forecasting using time-series anal- ysis,”Intelligent Transportation Systems, IEEE Trans- actions on, 2009

work page 2009
[48]

G. E. P. Box and G. M. Jenkins,Time Series Analysis: Forecasting and Control. Prentice Hall PTR, 5th ed., 2015

work page 2015
[49]

Short-term load fore- casting using a long short-term memory network,

C. Liu, Z. Jin, J. Gu, and C. Qiu, “Short-term load fore- casting using a long short-term memory network,” in ISGT-Europe, 2017

work page 2017
[50]

Efficient mem- ory management for large language model serving with pagedattention,

W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica, “Efficient mem- ory management for large language model serving with pagedattention,” inACM SOSP, 2023

work page 2023
[51]

Ray: The AI Compute Engine,

“Ray: The AI Compute Engine,” 2025. https://www. ray.io

work page 2025
[52]

Enabling elas- tic model serving with multiworld,

M. Lee, A. Jajoo, and R. R. Kompella, “Enabling elas- tic model serving with multiworld,”arXiv preprint arXiv:2407.08980, 2024

work page arXiv 2024
[53]

Llama 2: Open Foundation and Fine-Tuned Chat Mod- els | Research - AI at Meta,

“Llama 2: Open Foundation and Fine-Tuned Chat Mod- els | Research - AI at Meta,” 2023. https://ai.meta. com/research/publications/llama-2-open-fou ndation-and-fine-tuned-chat-models

work page 2023
[54]

CUDA Multi-Process Service

“CUDA Multi-Process Service.” https://docs.nvi dia.com/deploy/pdf/CUDA_Multi_Process_Serv ice_Overview.pdf

work page
[55]

Infinite-llm: Efficient llm service for long context with distattention and distributed kvcache, 2024

B. Lin, C. Zhang, T. Peng, H. Zhao, W. Xiao, M. Sun, A. Liu, Z. Zhang, L. Li, X. Qiu, S. Li, Z. Ji, T. Xie, Y . Li, and W. Lin, “Infinite-llm: Efficient llm service for long context with distattention and distributed kvcache,” arXiv preprint arXiv:2401.02669, 2024

work page arXiv 2024
[56]

Infinigen: Efficient generative inference of large language models with dy- namic KV cache management,

W. Lee, J. Lee, J. Seo, and J. Sim, “Infinigen: Efficient generative inference of large language models with dy- namic KV cache management,” inUSENIX OSDI, 2024

work page 2024
[57]

Kraken: Adaptive container provisioning for deploying dynamic dags in serverless platforms,

V . M. Bhasi, J. R. Gunasekaran, P. Thinakaran, C. S. Mishra, M. T. Kandemir, and C. Das, “Kraken: Adaptive container provisioning for deploying dynamic dags in serverless platforms,” inACM Symposium on Cloud Computing, 2021

work page 2021
[58]

Fifer: Tackling re- source underutilization in the serverless era,

J. R. Gunasekaran, P. Thinakaran, N. C. Nachiappan, M. T. Kandemir, and C. R. Das, “Fifer: Tackling re- source underutilization in the serverless era,” inMiddle- ware, 2020

work page 2020
[59]

Incendio: Priority-based scheduling for alle- viating cold start in serverless computing,

X. Cai, Q. Sang, C. Hu, Y . Gong, K. Suo, X. Zhou, and D. Cheng, “Incendio: Priority-based scheduling for alle- viating cold start in serverless computing,”IEEE Trans- actions on Computers, 2024

work page 2024
[60]

Serverless in the wild: Characterizing and optimizing the serverless workload at a large cloud provider,

M. Shahrad, R. Fonseca, I. Goiri, G. Chaudhry, P. Batum, J. Cooke, E. Laureano, C. Tresness, M. Russinovich, and R. Bianchini, “Serverless in the wild: Characterizing and optimizing the serverless workload at a large cloud provider,” inUSENIX ATC, 2020

work page 2020
[61]

Spec- faas: Accelerating serverless applications with specula- tive function execution,

J. Stojkovic, T. Xu, H. Franke, and J. Torrellas, “Spec- faas: Accelerating serverless applications with specula- tive function execution,” inIEEE HPCA, 2023. 15 A TPOT CDF We provide the TPOT CDF of systems underα=2.0 in Figure 17. MuxServe incurs higher TPOT since it enlarges the parallelism degree of models and limits the computational power of an inst...

work page 2023

[1] [1]

GPT-4 Technical Report

OpenAI, “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

GPT-4o System Card,

“GPT-4o System Card,” 2025. https://openai.com /index/gpt-4o-system-card/

work page 2025

[3] [3]

DeepSeek-V3 Technical Report

DeepSeek-AI, “Deepseek-v3 technical report,”arXiv preprint arXiv:2412.19437, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

The Llama 3 Herd of Models

A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Ka- dian, A. Al-Dahle,et al., “The llama 3 herd of models,” arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

Anthropic Claude,

“Anthropic Claude,” 2025. https://www.anthropic. com/claude

work page 2025

[6] [6]

Qwen2.5-Coder Series: Powerful, Diverse, Practical,

“Qwen2.5-Coder Series: Powerful, Diverse, Practical,”

work page

[7] [7]

5-coder-family/

https://qwenlm.github.io/blog/qwen2. 5-coder-family/

work page

[8] [8]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI, “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,”arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

Introducing OpenAI o3 and o4-mini,

“Introducing OpenAI o3 and o4-mini,” 2023. https: //openai.com/index/introducing-o3-and-o4-m ini/

work page 2023

[10] [10]

https://deepmind.google/tech nologies/gemini/

“Gemini,” 2025. https://deepmind.google/tech nologies/gemini/

work page 2025

[11] [11]

Qwen3: Think Deeper, Act Faster,

“Qwen3: Think Deeper, Act Faster,” 2025. https: //qwenlm.github.io/blog/qwen3/

work page 2025

[12] [12]

Models - Openai Platform,

“Models - Openai Platform,” 2025. https://platfo rm.openai.com/docs/models

work page 2025

[13] [13]

Fast and live model auto scaling with o(1) host caching,

D. Zhang, H. Wang, Y . Liu, X. Wei, Y . Shan, R. Chen, and H. Chen, “Fast and live model auto scaling with o(1) host caching,” inUSENIX OSDI, 2025

work page 2025

[14] [14]

Muxserve: Flexible spatial- temporal multiplexing for multiple llm serving,

J. Duan, R. Lu, H. Duanmu, X. Li, X. Zhang, D. Lin, I. Stoica, and H. Zhang, “Muxserve: Flexible spatial- temporal multiplexing for multiple llm serving,” in ICML, 2024

work page 2024

[15] [15]

Prism: Unleashing gpu sharing for cost-efficient multi- llm serving,

S. Yu, J. Xing, Y . Qiao, M. Ma, Y . Li, Y . Wang, S. Yang, Z. Xie, S. Cao, K. Bao, I. Stoica, H. Xu, and Y . Sheng, “Prism: Unleashing gpu sharing for cost-efficient multi- llm serving,”arXiv preprint arXiv:2505.04021, 2025

work page arXiv 2025

[16] [16]

Serverlessllm: Low-latency server- less inference for large language models,

Y . Fu, L. Xue, Y . Huang, A.-O. Brabete, D. Ustiugov, Y . Patel, and L. Mai, “Serverlessllm: Low-latency server- less inference for large language models,” inUSENIX OSDI, 2024

work page 2024

[17] [17]

Lambdas- cale: Enabling fast scaling for serverless large language model inference,

M. Yu, R. Yang, C. Jia, Z. Su, S. Yao, T. Lan, Y . Yang, Y . Cheng, W. Wang, A. Wang, and R. Chen, “Lambdas- cale: Enabling fast scaling for serverless large language model inference,”arXiv preprint arXiv:2502.09922, 2025

work page arXiv 2025

[18] [18]

Towards swift serverless llm cold starts with paraserve,

C. Lou, S. Qi, C. Jin, D. Nie, H. Yang, X. Liu, and X. Jin, “Towards swift serverless llm cold starts with paraserve,” arXiv preprint arXiv:2502.15524, 2025

work page arXiv 2025

[19] [19]

Tor- por: Gpu-enabled serverless computing for low-latency, resource-efficient inference,

M. Yu, A. Wang, D. Chen, H. Yu, X. Luo, Z. Li, W. Wang, R. Chen, D. Nie, H. Yang, and Y . Ding, “Tor- por: Gpu-enabled serverless computing for low-latency, resource-efficient inference,” inUSENIX ATC, 2025

work page 2025

[20] [20]

Deepserve: Serverless large lan- guage model serving at scale,

J. Hu, J. Xu, Z. Liu, Y . He, Y . Chen, H. Xu, J. Liu, J. Meng, B. Zhang, S. Wan, G. Dan, Z. Dong, Z. Ren, C. Liu, T. Xie, D. Lin, Q. Zhang, Y . Yu, H. Feng, X. Chen, and Y . Shan, “Deepserve: Serverless large lan- guage model serving at scale,” inUSENIX ATC, 2025

work page 2025

[21] [21]

Al- paServe: Statistical multiplexing with model parallelism for deep learning serving,

Z. Li, L. Zheng, Y . Zhong, V . Liu, Y . Sheng, X. Jin, Y . Huang, Z. Chen, H. Zhang, J. E. Gonzalez,et al., “Al- paServe: Statistical multiplexing with model parallelism for deep learning serving,” inUSENIX OSDI, 2023

work page 2023

[22] [22]

Queue management for slo-oriented large language model serv- ing,

A. Patke, D. Reddy, S. Jha, H. Qiu, C. Pinto, C. Narayanaswami, Z. Kalbarczyk, and R. Iyer, “Queue management for slo-oriented large language model serv- ing,” inACM Symposium on Cloud Computing, 2024. 13

work page 2024

[23] [23]

Burstgpt: A real-world workload dataset to optimize llm serving systems,

Y . Wang, Y . Chen, Z. Li, X. Kang, Z. Tang, X. He, R. Guo, X. Wang, Q. Wang, A. C. Zhou, and X. Chu, “Burstgpt: A real-world workload dataset to optimize llm serving systems,”arXiv preprint arXiv:2401.17644, 2024

work page arXiv 2024

[24] [24]

Dynamollm: Designing llm inference clus- ters for performance and energy efficiency,

J. Stojkovic, C. Zhang, I. Goiri, J. Torrellas, and E. Choukse, “Dynamollm: Designing llm inference clus- ters for performance and energy efficiency,” inIEEE HPCA, March 2025

work page 2025

[25] [25]

ServeGen: Workload Characterization and Generation of Large Language Model Serving in Production

Y . Xiang, X. Li, K. Qian, W. Yu, E. Zhai, and X. Jin, “Servegen: Workload characterization and generation of large language model serving in production,”arXiv preprint arXiv:2505.09999, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [26]

Loongserve: Efficiently serving long-context large lan- guage models with elastic sequence parallelism,

B. Wu, S. Liu, Y . Zhong, P. Sun, X. Liu, and X. Jin, “Loongserve: Efficiently serving long-context large lan- guage models with elastic sequence parallelism,” in ACM SOSP, 2024

work page 2024

[27] [27]

Shuffleinfer: Disaggregate llm inference for mixed downstream workloads,

C. Hu, H. Huang, L. Xu, X. Chen, C. Wang, J. Xu, S. Chen, H. Feng, S. Wang, Y . Bao, N. Sun, and Y . Shan, “Shuffleinfer: Disaggregate llm inference for mixed downstream workloads,”ACM Transactions on Archi- tecture and Code Optimization, July 2025

work page 2025

[28] [28]

Kunserve: Efficient parameter-centric memory manage- ment for llm serving,

R. Cheng, Y . Lai, X. Wei, R. Chen, and H. Chen, “Kunserve: Efficient parameter-centric memory manage- ment for llm serving,”arXiv preprint arXiv:2412.18169, 2025

work page arXiv 2025

[29] [29]

Deepspeed-fastgen: High-throughput text generation for llms via MII and deepspeed-inference

C. Holmes, M. Tanaka, M. Wyatt, A. A. Awan, J. Rasley, S. Rajbhandari, R. Y . Aminabadi, H. Qin, A. Bakhtiari, L. Kurilenko, and Y . He, “Deepspeed-fastgen: High- throughput text generation for llms via mii and deepspeed-inference,”arXiv preprint arXiv:2401.08671, 2025

work page arXiv 2025

[30] [30]

Kvcache cache in the wild: Characterizing and optimizing kvcache cache at a large cloud provider,

J. Wang, J. Han, X. Wei, S. Shen, D. Zhang, C. Fang, R. Chen, W. Yu, and H. Chen, “Kvcache cache in the wild: Characterizing and optimizing kvcache cache at a large cloud provider,” inUSENIX ATC, July 2025

work page 2025

[31] [31]

Mooncake: Trading more storage for less computation — a KVCache-centric architecture for serving LLM chatbot,

R. Qin, Z. Li, W. He, J. Cui, F. Ren, M. Zhang, Y . Wu, W. Zheng, and X. Xu, “Mooncake: Trading more storage for less computation — a KVCache-centric architecture for serving LLM chatbot,” inUSENIX Conference on File and Storage Technologies, 2025

work page 2025

[32] [32]

Tokenlake: A unified segment-level pre- fix cache pool for fine-grained elastic long-context llm serving,

B. Wu, Z. Zhang, Y . Zhong, G. Huang, Y . Zhu, X. Liu, and X. Jin, “Tokenlake: A unified segment-level pre- fix cache pool for fine-grained elastic long-context llm serving,”arXiv preprint arXiv:2508.17219, 2025

work page arXiv 2025

[33] [33]

Xfaas: Hyperscale and low cost serverless functions at meta,

A. Sahraei, S. Demetriou, A. Sobhgol, H. Zhang, A. Nagaraja, N. Pathak, G. Joshi, C. Souza, B. Huang, W. Cook, A. Golovei, P. Venkat, A. Mcfague, D. Skar- latos, V . Patel, R. Thind, E. Gonzalez, Y . Jin, and C. Tang, “Xfaas: Hyperscale and low cost serverless functions at meta,” inACM SOSP, 2023

work page 2023

[34] [34]

Rainbowcake: Miti- gating cold-starts in serverless with layer-wise container caching and sharing,

H. Yu, R. Basu Roy, C. Fontenot, D. Tiwari, J. Li, H. Zhang, H. Wang, and S.-J. Park, “Rainbowcake: Miti- gating cold-starts in serverless with layer-wise container caching and sharing,” inACM ASPLOS, 2024

work page 2024

[35] [35]

Catalyzer: Sub-millisecond startup for serverless computing with initialization-less booting,

D. Du, T. Yu, Y . Xia, B. Zang, G. Yan, C. Qin, Q. Wu, and H. Chen, “Catalyzer: Sub-millisecond startup for serverless computing with initialization-less booting,” inACM ASPLOS, 2020

work page 2020

[36] [36]

No provisioned concurrency: Fast RDMA- codesigned remote fork for serverless computing,

X. Wei, F. Lu, T. Wang, J. Gu, Y . Yang, R. Chen, and H. Chen, “No provisioned concurrency: Fast RDMA- codesigned remote fork for serverless computing,” in USENIX OSDI, 2023

work page 2023

[37] [37]

Faascache: keeping serverless computing alive with greedy-dual caching,

A. Fuerst and P. Sharma, “Faascache: keeping serverless computing alive with greedy-dual caching,” inACM ASPLOS, 2021

work page 2021

[38] [38]

Anthropic,

“Anthropic,” 2025.https://www.anthropic.com/

work page 2025

[39] [39]

{MLaaS} in the wild: Workload analysis and scheduling in {Large- Scale} heterogeneous {GPU} clusters,

Q. Weng, W. Xiao, Y . Yu, W. Wang, C. Wang, J. He, Y . Li, L. Zhang, W. Lin, and Y . Ding, “{MLaaS} in the wild: Workload analysis and scheduling in {Large- Scale} heterogeneous {GPU} clusters,” inUSENIX NSDI, 2022

work page 2022

[40] [40]

Pipeswitch: Fast pipelined context switching for deep learning applica- tions,

Z. Bai, Z. Zhang, Y . Zhu, and X. Jin, “Pipeswitch: Fast pipelined context switching for deep learning applica- tions,” inUSENIX OSDI, 2020

work page 2020

[41] [41]

Fast Distributed Inference Serving for Large Language Models

B. Wu, Y . Zhong, Z. Zhang, S. Liu, F. Liu, Y . Sun, G. Huang, X. Liu, and X. Jin, “Fast distributed infer- ence serving for large language models,”arXiv preprint arXiv:2305.05920, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[42] [42]

SuperServe: Fine-Grained inference serv- ing for unpredictable workloads,

A. Khare, D. Garg, S. Kalra, S. Grandhi, I. Stoica, and A. Tumanov, “SuperServe: Fine-Grained inference serv- ing for unpredictable workloads,” inUSENIX NSDI, 2025

work page 2025

[43] [43]

Introducing low-level gpu virtual memory manage- ment,

“Introducing low-level gpu virtual memory manage- ment,” 2025. https://developer.nvidia.com /blog/introducing-low-level-gpu-virtual-m emory-management/

work page 2025

[44] [44]

vattention: Dynamic memory management for serving llms without pagedattention,

R. Prabhu, A. Nayak, J. Mohan, R. Ramjee, and A. Pan- war, “vattention: Dynamic memory management for serving llms without pagedattention,” inACM ASPLOS, 2025

work page 2025

[45] [45]

Forecasting seasonals and trends by ex- ponentially weighted moving averages,

C. C. Holt, “Forecasting seasonals and trends by ex- ponentially weighted moving averages,”International Journal of Forecasting, 2004. 14

work page 2004

[46] [46]

P. R. Winters,Forecasting Sales by Exponentially Weighted Moving Averages. Springer Berlin Heidelberg, 1976

work page 1976

[47] [47]

Multivariate short-term traffic flow forecasting using time-series anal- ysis,

B. Ghosh, B. Basu, and M. O’Mahony, “Multivariate short-term traffic flow forecasting using time-series anal- ysis,”Intelligent Transportation Systems, IEEE Trans- actions on, 2009

work page 2009

[48] [48]

G. E. P. Box and G. M. Jenkins,Time Series Analysis: Forecasting and Control. Prentice Hall PTR, 5th ed., 2015

work page 2015

[49] [49]

Short-term load fore- casting using a long short-term memory network,

C. Liu, Z. Jin, J. Gu, and C. Qiu, “Short-term load fore- casting using a long short-term memory network,” in ISGT-Europe, 2017

work page 2017

[50] [50]

Efficient mem- ory management for large language model serving with pagedattention,

W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica, “Efficient mem- ory management for large language model serving with pagedattention,” inACM SOSP, 2023

work page 2023

[51] [51]

Ray: The AI Compute Engine,

“Ray: The AI Compute Engine,” 2025. https://www. ray.io

work page 2025

[52] [52]

Enabling elas- tic model serving with multiworld,

M. Lee, A. Jajoo, and R. R. Kompella, “Enabling elas- tic model serving with multiworld,”arXiv preprint arXiv:2407.08980, 2024

work page arXiv 2024

[53] [53]

Llama 2: Open Foundation and Fine-Tuned Chat Mod- els | Research - AI at Meta,

“Llama 2: Open Foundation and Fine-Tuned Chat Mod- els | Research - AI at Meta,” 2023. https://ai.meta. com/research/publications/llama-2-open-fou ndation-and-fine-tuned-chat-models

work page 2023

[54] [54]

CUDA Multi-Process Service

“CUDA Multi-Process Service.” https://docs.nvi dia.com/deploy/pdf/CUDA_Multi_Process_Serv ice_Overview.pdf

work page

[55] [55]

Infinite-llm: Efficient llm service for long context with distattention and distributed kvcache, 2024

B. Lin, C. Zhang, T. Peng, H. Zhao, W. Xiao, M. Sun, A. Liu, Z. Zhang, L. Li, X. Qiu, S. Li, Z. Ji, T. Xie, Y . Li, and W. Lin, “Infinite-llm: Efficient llm service for long context with distattention and distributed kvcache,” arXiv preprint arXiv:2401.02669, 2024

work page arXiv 2024

[56] [56]

Infinigen: Efficient generative inference of large language models with dy- namic KV cache management,

W. Lee, J. Lee, J. Seo, and J. Sim, “Infinigen: Efficient generative inference of large language models with dy- namic KV cache management,” inUSENIX OSDI, 2024

work page 2024

[57] [57]

Kraken: Adaptive container provisioning for deploying dynamic dags in serverless platforms,

V . M. Bhasi, J. R. Gunasekaran, P. Thinakaran, C. S. Mishra, M. T. Kandemir, and C. Das, “Kraken: Adaptive container provisioning for deploying dynamic dags in serverless platforms,” inACM Symposium on Cloud Computing, 2021

work page 2021

[58] [58]

Fifer: Tackling re- source underutilization in the serverless era,

J. R. Gunasekaran, P. Thinakaran, N. C. Nachiappan, M. T. Kandemir, and C. R. Das, “Fifer: Tackling re- source underutilization in the serverless era,” inMiddle- ware, 2020

work page 2020

[59] [59]

Incendio: Priority-based scheduling for alle- viating cold start in serverless computing,

X. Cai, Q. Sang, C. Hu, Y . Gong, K. Suo, X. Zhou, and D. Cheng, “Incendio: Priority-based scheduling for alle- viating cold start in serverless computing,”IEEE Trans- actions on Computers, 2024

work page 2024

[60] [60]

Serverless in the wild: Characterizing and optimizing the serverless workload at a large cloud provider,

M. Shahrad, R. Fonseca, I. Goiri, G. Chaudhry, P. Batum, J. Cooke, E. Laureano, C. Tresness, M. Russinovich, and R. Bianchini, “Serverless in the wild: Characterizing and optimizing the serverless workload at a large cloud provider,” inUSENIX ATC, 2020

work page 2020

[61] [61]

Spec- faas: Accelerating serverless applications with specula- tive function execution,

J. Stojkovic, T. Xu, H. Franke, and J. Torrellas, “Spec- faas: Accelerating serverless applications with specula- tive function execution,” inIEEE HPCA, 2023. 15 A TPOT CDF We provide the TPOT CDF of systems underα=2.0 in Figure 17. MuxServe incurs higher TPOT since it enlarges the parallelism degree of models and limits the computational power of an inst...

work page 2023