arxiv: 2602.18755 · v3 · submitted 2026-02-21 · 💻 cs.DC

Recognition: 2 theorem links

· Lean Theorem

DualScale: Energy-Efficient Disaggregated LLM Serving via Phase-Aware Placement and DVFS

Omar Basit , Yunzhao Liu , Z. Jonny Kong , Y. Charlie Hu

Authors on Pith no claims yet

Pith reviewed 2026-05-15 20:51 UTC · model grok-4.3

classification 💻 cs.DC

keywords LLM servingenergy efficiencyDVFSdisaggregationprefill decodemodel predictive controlGPU power managementservice level objectives

0 comments

The pith

A two-tier framework for disaggregated LLM serving cuts energy use by up to 48 percent during decode while still meeting TTFT and TPOT latency targets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DualScale as a system that optimizes energy for large language model inference split into separate prefill and decode phases on GPU clusters. It claims that coarse-scale decisions on where to place work combined with fine-scale adjustments to processor frequency can track rapid changes in demand more effectively than autoscaling or uniform DVFS alone. A sympathetic reader would care because current LLM serving consumes substantial power, and coordinated controls across phases could allow the same hardware to handle more requests without raising electricity costs or heat output. The approach relies on predictive models to set baseline placements and frequencies at longer intervals and then applies different adaptation rules per phase at each iteration. Evaluation on a 16-GPU H100 cluster with real traces shows the targets are met with reported energy reductions of 39 percent in prefill and 48 percent in decode compared to an existing disaggregated baseline.

Core claim

DualScale is a two-tier energy optimization framework for disaggregated LLM serving. It jointly optimizes placement and DVFS across prefill and decode using predictive latency and power models. At coarse timescales, DualScale computes phase-aware placement and baseline frequencies that minimize energy while satisfying SLO constraints. At fine timescales, DualScale dynamically adapts GPU frequency per iteration using stage-specific control: model predictive control for prefill to account for queue evolution and future TTFT impact, and lightweight slack-aware adaptation for decode to exploit its smoother, memory-bound dynamics. This hierarchical design enables coordinated control across time.

What carries the argument

DualScale's two-tier hierarchical control that separates coarse phase-aware placement and baseline frequency selection from fine per-iteration DVFS using MPC for prefill and slack-aware adaptation for decode.

If this is right

Energy use drops by as much as 39 percent in the prefill phase and 48 percent in the decode phase relative to prior disaggregated methods.
Strict TTFT and TPOT service level objectives continue to be met under production-style workload traces.
The system tracks fast workload changes more closely than autoscaling or single-tier DVFS because placement and frequency decisions are coordinated across two time scales.
Separate control rules for prefill and decode preserve the latency benefits of disaggregation while adding energy savings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same separation of coarse placement from fine frequency control could be tested on other workloads that show distinct compute-bound and memory-bound phases, such as certain database queries or video processing pipelines.
If the predictive models are replaced with online learning versions, the framework might adapt to new model architectures without manual retuning of parameters.
Extending the approach to clusters with heterogeneous GPUs would require only updating the power and latency predictors rather than redesigning the placement logic.

Load-bearing premise

The predictive latency and power models must accurately capture the different dynamics of the prefill and decode phases and the interactions between placement choices and frequency settings.

What would settle it

Measurements on the same 16x H100 cluster with the production traces that show energy savings below 20 percent or more than occasional SLO violations would show the models or controls do not deliver the claimed benefits.

Figures

Figures reproduced from arXiv: 2602.18755 by Omar Basit, Y. Charlie Hu, Yunzhao Liu, Z. Jonny Kong.

**Figure 1.** Figure 1: The RPS timelines of the Azure LLM inference trace [7] over 10 hours, 10 minutes, and 1 minute. smoother and predominantly memory-bound behavior, DualScale applies lightweight per-iteration frequency adaptation to safely harvest slack. Both tiers rely on data-driven iteration-level latency and power models trained offline, enabling accurate prediction of performance and energy across configurations. Toge… view at source ↗

**Figure 2.** Figure 2: Variance-time plot of request-per-second (RPS) in the Azure LLM inference trace [7]. The trace exhibits notable fluctuation across both short and long timescales, with slightly greater variance observed at shorter timescales. 0 50 100 RPS 0 10 20 Batch Size Prefill 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Time (minutes) 0 500 Batch Size Decode [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Number of running requests in prefill and decode instances plotted with workload in RPS. that the workload exhibits burstiness consistently across different timescales – from hours to minutes to seconds. To quantify this workload fluctuation, we show the normalized variance-time plot [13, 19] of the full trace, which spans 7 days. To calculate normalized variance-time, we first divide the trace into non-o… view at source ↗

**Figure 4.** Figure 4: Architecture overview of DualScale. Both controllers rely on the offline-trained latency and power models to predict SLO feasibility and energy under candidate configurations, enabling coordinated, energyefficient placement and DVFS decisions under dynamically varying workloads. 4.3 Tier 1: Coarse-Grained Provisioning Tier 1 establishes an energy-minimizing operating point that guarantees SLO feasibility… view at source ↗

**Figure 5.** Figure 5: Results for various controlled workloads with constant average RPS [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: The production workload in terms of RPS, divided into 5-minute chunks. 5-10 10-15 15-20 20-25 25-30 30-35 Trace Time Range (mins) 0.0 0.2 0.4 0.6 0.8 TTFT P99 (s) Distserve PlaceOnly DualScale (a) TTFT 5-10 10-15 15-20 20-25 25-30 30-35 Trace Time Range (mins) 0.0 0.025 0.05 0.075 0.1 TPOT P99 (s) (b) TPOT 5-10 10-15 15-20 20-25 25-30 30-35 Trace Time Range (mins) 0.0 0.1 0.2 0.3 0.4 Energy Consumed Per To… view at source ↗

**Figure 7.** Figure 7: Results for 30 minute long production traces at 67% (1st row) and 85% (2nd row) capacity of the system (6% on average) for decode, and 9% to 29% (15% on average) for prefill. Note that the -4% energy gain is from a 5-minute window where PlaceOnly violated the SLO. Hence, most of the incremental savings from DVFS come from prefill, which contributes about 2.5× more than decode. Comparing controlled and prod… view at source ↗

**Figure 8.** Figure 8: Decode over-configuration case (5-10 minutes 85% capacity workload). 0 100 200 RPS 0 1000 Frequency (MHz) 0 2000 Power (W) 0 5000 Total power (W) 0 1000 2000 Batch size 0 1 2 3 4 5 Time (minutes) 0.0 0.1 Batch latency(s) PlaceOnly DualScale [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: Decode under-configuration case (20-25 minutes 85% capacity workload). DualScale. This over-provisioning comes from load overprediction in the previous window, so PlaceOnly runs at unnecessarily high fixed frequencies. In contrast, DualScale applies DVFS online to lower frequency when slack exists, mitigating the energy penalty of over-provisioning. As a 0 100 200 RPS 0 1000 Frequency (MHz) 0 2000 Power (… view at source ↗

**Figure 11.** Figure 11: Microscopic behavior of the prefill instance. Batches & tokens 56866 56908 57413 57049 56999 57078 55509 55344 55240 54551 54116 54330 54346 54288 53519 53049 53049 53488 53100 53076 54757 56121 56863 56863 56880 58677 58655 57911 59457 58026 57937 58609 57388 56414 56632 55213 MPC & Freq Adj 0 500 1000 GPU Freqs (MHz) 0 1000 Power (W) 0.00 0.50 1.00 1.50 2.00 2.50 3.00 Time (s) 0.0 0.1 TBT(s) SLO [PITH_… view at source ↗

**Figure 12.** Figure 12: Microscopic behavior of the decode instance. In the top figure, each request count as N tokens, where N is the running request length. compute demand drives frequency and thus power variations in DualScale, while PlaceOnly keeps a fixed frequency. During low-load periods, DualScale can reduce frequency aggressively to save energy, while during bursts or backlog growth it can temporarily raise frequency t… view at source ↗

**Figure 13.** Figure 13: Accuracy of latency and power models against measured values. 0 500 1000 1500 TTFT (ms) 0.0 0.2 0.4 0.6 0.8 1.0 real sim (a) CDF of TTFT. 20 40 60 80 TPOT (ms) 0.0 0.2 0.4 0.6 0.8 1.0 (b) CDF of TPOT. 0 5 10 15 Actual Energy for 10s period (kJ) 0 5 10 15 20 Predicted Energy for 10s period (kJ) (c) Prefill energy consumption (MAPE=2.3%). 0 5 10 15 Actual Energy for 10s period (kJ) 0 5 10 15 Predicted Energ… view at source ↗

**Figure 14.** Figure 14: Accuracy of the Tier-1 simulator against real runs, shown by TTFT CDF, TPOT CDF, and energy comparisons. in the constant-frequency experiments (DistServe and PlaceOnly in §6.2.1) to evaluate model accuracy. Figure 13a and Figure 13b compare predicted and measured iteration latency. The latency model achieves mean absolute percentage error (MAPE) of 2.9% for prefill instances and 2.7% for decode instances… view at source ↗

read the original abstract

Prefill/decode disaggregation is increasingly adopted in LLM serving to improve the latency-throughput tradeoff and meet strict TTFT and TPOT SLOs. However, LLM inference remains energy-hungry: autoscaling alone is too coarse-grained to track fast workload fluctuations, and applying fine-grained DVFS under disaggregation is complicated by phase-asymmetric dynamics and coupling between provisioning and frequency control. We present DualScale, a two-tier energy optimization framework for disaggregated LLM serving. DualScale jointly optimizes placement and DVFS across prefill and decode using predictive latency and power models. At coarse timescales, DualScale computes phase-aware placement and baseline frequencies that minimize energy while satisfying SLO constraints. At fine timescales, DualScale dynamically adapts GPU frequency per iteration using stage-specific control: model predictive control (MPC) for prefill to account for queue evolution and future TTFT impact, and lightweight slack-aware adaptation for decode to exploit its smoother, memory-bound dynamics. This hierarchical design enables coordinated control across timescales while preserving strict serving SLOs. Evaluation on a 16x H100 cluster serving Llama 3.3 70B with production-style traces shows that DualScale meets TTFT/TPOT SLOs while reducing energy by up to 39% in prefill and 48% in decode relative to DistServe.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DualScale layers hierarchical DVFS and MPC control on disaggregated serving to claim 39-48% energy cuts, but the abstract leaves model validation and robustness details thin.

read the letter

DualScale is a two-tier framework that jointly optimizes placement and DVFS for prefill and decode phases in disaggregated LLM serving. Coarse placement sets baseline frequencies to meet SLOs, while fine control uses MPC for prefill to handle queue evolution and simpler slack adaptation for decode to exploit its smoother dynamics. The paper evaluates this on a 16x H100 cluster with Llama 3.3 70B and production traces, reporting up to 39% energy reduction in prefill and 48% in decode versus DistServe while keeping TTFT and TPOT within bounds. That combination of phase-aware placement plus tailored controllers is the concrete extension beyond prior disaggregation work. The hierarchical split across timescales is a reasonable response to the asymmetric behavior of the two phases. The evaluation setup with real traces and a production-scale cluster gives the claims a practical anchor. The soft spot is the thin support for the predictive latency and power models that drive both tiers. The abstract mentions these models but supplies no validation metrics, error bars, sensitivity tests, or checks against workload shifts. If the models systematically under-estimate prefill queue growth or decode memory sensitivity, the reported savings would shrink or SLOs would slip. The stress-test note correctly flags this as the load-bearing assumption. This work is aimed at systems builders tuning large inference clusters for power. Readers focused on energy-aware serving would pick up usable ideas from the control design even if the exact numbers need more scrutiny. It deserves peer review so the model accuracy and experimental controls can be examined in detail.

Referee Report

2 major / 1 minor

Summary. The paper presents DualScale, a two-tier energy optimization framework for disaggregated LLM serving. It jointly optimizes phase-aware placement and DVFS using predictive latency and power models: coarse-grained placement sets baseline frequencies to minimize energy under SLO constraints, while fine-grained control applies MPC for prefill (accounting for queue evolution) and slack-aware adaptation for decode (exploiting memory-bound dynamics). Evaluation on a 16x H100 cluster serving Llama 3.3 70B with production traces claims up to 39% energy reduction in prefill and 48% in decode relative to DistServe while meeting TTFT/TPOT SLOs.

Significance. If the predictive models prove accurate, the work offers a practical hierarchical approach to energy efficiency in LLM inference that addresses the limitations of coarse autoscaling and phase-asymmetric DVFS challenges in disaggregated systems. The explicit use of production traces and SLO-preserving claims strengthen its potential impact for real-world serving deployments.

major comments (2)

[Evaluation] Evaluation section: the abstract reports energy reductions of 39% (prefill) and 48% (decode) on a 16x H100 cluster, but provides no details on predictive model validation, error bars, statistical significance tests, or sensitivity analysis to workload assumptions; this directly weakens support for the central claim that the two-tier optimizer reliably meets SLOs without eroding savings.
[§4] §4 (framework description): the claim that MPC for prefill and slack-aware control for decode accurately capture phase-asymmetric coupling between placement and DVFS rests on unvalidated predictive latency/power models; any systematic under-estimation of queue evolution or memory-bound sensitivity would propagate to both placement and frequency decisions, undermining the reported energy savings.

minor comments (1)

[§3] Notation for baseline frequencies and MPC horizon/weights is introduced without explicit definition of their ranges or initialization procedure, making it difficult to reproduce the coarse-to-fine transition.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the evaluation and framework sections. We address each point below and will revise the manuscript to provide the requested validation details and analysis.

read point-by-point responses

Referee: [Evaluation] Evaluation section: the abstract reports energy reductions of 39% (prefill) and 48% (decode) on a 16x H100 cluster, but provides no details on predictive model validation, error bars, statistical significance tests, or sensitivity analysis to workload assumptions; this directly weakens support for the central claim that the two-tier optimizer reliably meets SLOs without eroding savings.

Authors: We agree that the manuscript would benefit from explicit reporting of model validation, error bars, statistical tests, and sensitivity analysis. In the revised version we will add a new subsection (Evaluation §5.X) that reports: (1) mean absolute percentage error and R² for the latency and power predictors across profiled batch sizes and frequencies; (2) error bars and standard deviations from five repeated runs of each workload trace; (3) paired t-test results confirming that DualScale’s energy reductions versus DistServe are statistically significant (p < 0.01) while SLO violation rates remain statistically indistinguishable; and (4) sensitivity sweeps over trace intensity, SLO tightness, and model-size scaling. These additions will directly substantiate the reliability of the reported 39 % / 48 % savings. revision: yes
Referee: [§4] §4 (framework description): the claim that MPC for prefill and slack-aware control for decode accurately capture phase-asymmetric coupling between placement and DVFS rests on unvalidated predictive latency/power models; any systematic under-estimation of queue evolution or memory-bound sensitivity would propagate to both placement and frequency decisions, undermining the reported energy savings.

Authors: We acknowledge that §4 currently presents the MPC and slack-aware controllers without accompanying validation of the underlying models. We will revise §4 to include: (a) a concise description of the offline profiling procedure used to fit the latency and power models; (b) quantitative validation results (prediction error distributions for queue length under prefill and for memory-bandwidth sensitivity under decode); and (c) a short robustness argument showing that the control policies remain SLO-compliant and energy-efficient even when model predictions are perturbed by their observed maximum error. This will make the phase-asymmetric coupling claim explicit and evidence-based. revision: yes

Circularity Check

0 steps flagged

Derivation chain is self-contained with no circular reductions

full rationale

The paper describes a two-tier optimizer that takes predictive latency and power models as inputs to compute phase-aware placement and DVFS settings, with coarse and fine timescale controls. Reported results are measured energy reductions (39% prefill, 48% decode) on a 16x H100 cluster against the external DistServe baseline while meeting TTFT/TPOT SLOs. No equations or derivations are shown that define outputs in terms of fitted parameters by construction, no self-citations are load-bearing for uniqueness or ansatzes, and no renaming of known results occurs. The framework is model-driven but the validation remains independent and externally falsifiable through direct measurements.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

Abstract-only review; framework rests on unstated predictive models and workload assumptions whose parameters and validation are not provided.

free parameters (2)

baseline frequencies
Computed at coarse timescale to minimize energy subject to SLOs
MPC horizon and weights
Parameters for predictive control in prefill phase

axioms (1)

domain assumption Phase-asymmetric dynamics and provisioning-frequency coupling can be captured by predictive models
Invoked to justify separate MPC and slack-aware controllers

pith-pipeline@v0.9.0 · 5548 in / 1390 out tokens · 35761 ms · 2026-05-15T20:51:54.471557+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost Jcost functional equation and convexity echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

DualScale jointly optimizes placement and DVFS across prefill and decode using predictive latency and power models... phase-aware placement and baseline frequencies that minimize energy while satisfying SLO constraints... stage-specific control: model predictive control (MPC) for prefill... lightweight slack-aware adaptation for decode
IndisputableMonolith/Foundation/ArrowOfTime phase-specific workload characteristics and monotonicity echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

phase-asymmetric dynamics and coupling between provisioning and frequency control... prefill is typically compute-bound... decode is often memory-bandwidth-bound

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

KAIROS: Stateful, Context-Aware Power-Efficient Agentic Inference Serving
cs.DC 2026-04 unverdicted novelty 6.0

KAIROS reduces power by 27% on average (up to 39.8%) for agentic AI inference by using long-lived context to jointly manage GPU frequency, concurrency, and request routing across instances.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · cited by 1 Pith paper · 2 internal anchors

[1]

[n. d.]. Nebius AI Cloud Platform.https://nebius.com/

work page
[2]

NVIDIA Management Library (NVML).https://developer.nvidia

2025. NVIDIA Management Library (NVML).https://developer.nvidia. com/management-library-nvml

work page 2025
[3]

Taming the tail utilization of ads inference at Meta scale.https://engineering.fb.com/2024/07/10/production-engineering/ tail-utilization-ads-inference-meta/?utm_source=chatgpt.com

2025. Taming the tail utilization of ads inference at Meta scale.https://engineering.fb.com/2024/07/10/production-engineering/ tail-utilization-ads-inference-meta/?utm_source=chatgpt.com

work page 2025
[4]

2026. Reducing Cold Start Latency for LLM Inference with NVIDIA Run:ai Model Streamer.https://developer.nvidia.com/blog/reducing- cold-start-latency-for-llm-inference-with-nvidia-runai-model- streamer/

work page 2026
[5]

Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tumanov, and Ramachandran Ram- jee. 2024. Taming {Throughput-Latency} tradeoff in {LLM} inference with {Sarathi-Serve}. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 117–134

work page 2024
[6]

Anthropic. 2025. Claude Models Overview.https://docs.anthropic. com/en/docs/about-claude/models/overview

work page 2025
[7]

Azure. 2025. Azure Public Dataset.https://github.com/Azure/ AzurePublicDataset

work page 2025
[8]

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. 2024. Sharegpt4v: Improving large multi-modal models with better captions. InEuropean Conference on Computer Vision. Springer, 370–387

work page 2024
[9]

Jae-Won Chung, Yile Gu, Insu Jang, Luoxi Meng, Nikhil Bansal, and Mosharaf Chowdhury. 2024. Reducing energy bloat in large model training. InProceedings of the ACM SIGOPS 30th Symposium on Oper- ating Systems Principles. 144–159

work page 2024
[10]

Daniel Crankshaw, Gur-Eyal Sela, Xiangxi Mo, Corey Zumar, Ion Stoica, Joseph Gonzalez, and Alexey Tumanov. 2020. InferLine: latency- aware provisioning and scaling for prediction serving pipelines. In Proc. of ACM SoCC. 477–491

work page 2020
[11]

Daniel Crankshaw, Xin Wang, Guanyu Zhou, Michael J Franklin, Joseph E Gonzalez, and Ion Stoica. 2020. InferLine: ML Inference Pipeline Provisioning and Management for Tight Latency SLOs. In 14th USENIX Symposium on Operating Systems Design and Implemen- tation. 283–300

work page 2020
[12]

Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, and Luo Mai. 2024. {ServerlessLLM}:{Low- Latency} serverless inference for large language models. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 135–153

work page 2024
[13]

Mark W Garrett and Walter Willinger. 1994. Analysis, modeling and generation of self-similar VBR video traffic.ACM SIGCOMM computer communication review24, 4 (1994), 269–280

work page 1994
[14]

Ruihao Gong, Shihao Bai, Siyu Wu, Yunqian Fan, Zaijun Wang, Xi- uhong Li, Hailong Yang, and Xianglong Liu. 2025. Past-Future Sched- uler for LLM Serving under SLA Guarantees. InProceedings of the 30th ACM International Conference on Architectural Support for Program- ming Languages and Operating Systems, Volume 2. 798–813

work page 2025
[15]

Andreas Kosmas Kakolyris, Dimosthenis Masouros, Petros Vavarout- sos, Sotirios Xydis, and Dimitrios Soudris. 2025. throttLL’eM: Predic- tive GPU Throttling for Energy Efficient LLM Inference Serving. In 2025 IEEE International Symposium on High Performance Computer Architecture (HPCA). 1363–1378. doi:10.1109/HPCA61900.2025.00103

work page doi:10.1109/hpca61900.2025.00103 2025
[16]

Andreas Kosmas Kakolyris, Dimosthenis Masouros, Sotirios Xydis, and Dimitrios Soudris. 2024. SLO-Aware GPU DVFS for Energy-Efficient LLM Inference Serving.IEEE Computer Architecture Letters23, 2 (2024), 150–153. doi:10.1109/LCA.2024.3406038

work page doi:10.1109/lca.2024.3406038 2024
[17]

Kimi Team. 2025. Kimi K2 Technical Report.arXiv preprint arXiv:2507.20534(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica

work page
[19]

InProceedings of the 29th Symposium on Operating Systems Principles

Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th Symposium on Operating Systems Principles. 611–626

work page
[20]

Will E Leland, Murad S Taqqu, Walter Willinger, and Daniel V Wilson

work page
[21]

IEEE/ACM Transactions on networking2, 1 (2002), 1–15

On the self-similar nature of Ethernet traffic (extended version). IEEE/ACM Transactions on networking2, 1 (2002), 1–15

work page 2002
[22]

Jiamin Li, Yimin Jiang, Yibo Zhu, Cong Wang, and Hong Xu. 2023. Accelerating distributed {MoE} training and inference with lina. In 2023 USENIX Annual Technical Conference (USENIX ATC 23). 945–959

work page 2023
[23]

Gon- zalez, and Ion Stoica

Zhuohan Li, Lianmin Zheng, Yinmin Zhong, Vincent Liu, Ying Sheng, Xin Jin, Yanping Huang, Zhifeng Chen, Hao Zhang, Joseph E. Gon- zalez, and Ion Stoica. 2023. AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving. InProc. of USENIX OSDI. USENIX Association, Boston, MA, 663–679.https://www.usenix.org/ conference/osdi23/presen...

work page 2023
[24]

Chaofan Lin, Zhenhua Han, Chengruidong Zhang, Yuqing Yang, Fan Yang, Chen Chen, and Lili Qiu. 2024. Parrot: Efficient serving of {LLM-based} applications with semantic variable. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 929–945

work page 2024
[25]

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al

work page
[26]

Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437 16 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Anantha- narayanan, et al. 2024. Cachegen: Kv cache compression and streaming for fast large language model serving. InProceedings of the ACM SIG- COMM 2024 Conference. 38–56

work page 2024
[28]

Tania Lorido-Botran, Jose Miguel-Alonso, and Jose A Lozano. 2014. A review of auto-scaling techniques for elastic applications in cloud environments.Journal of grid computing12, 4 (2014), 559–592

work page 2014
[29]

Meta. 2024. Meta Llama 3.https://llama.meta.com/llama3

work page 2024
[30]

Microsoft Research. 2025. The growing energy footprint of AI infer- ence.https://www.microsoft.com/en-us/research/publication/energy- use-of-ai-inference-efficiency-pathways-and-test-time-compute/

work page 2025
[31]

Chengyi Nie, Rodrigo Fonseca, and Zhenhua Liu. 2024. Aladdin: Joint Placement and Scaling for SLO-Aware LLM Serving.arXiv preprint arXiv:2405.06856(2024)

work page arXiv 2024
[32]

Chenxu Niu, Wei Zhang, Yongjian Zhao, and Yong Chen. 2025. Energy Efficient or Exhaustive? Benchmarking Power Consumption of LLM Inference Engines. 5, 2 (Aug. 2025), 56–62. doi:10.1145/3757892.3757900

work page doi:10.1145/3757892.3757900 2025
[33]

NVIDIA. 2025. NVIDIA Dynamo, A Low-Latency Distributed Inference Framework for Scaling Reasoning AI Models. https://developer.nvidia.com/blog/introducing-nvidia-dynamo- a-low-latency-distributed-inference-framework-for-scaling- reasoning-ai-models/

work page 2025
[34]

OpenAI. 2025. GPT-5.https://openai.com/gpt-5

work page 2025
[35]

Pratyush Patel, Esha Choukse, Chaojie Zhang, Íñigo Goiri, Brijesh Warrier, Nithish Mahalingam, and Ricardo Bianchini. 2024. Charac- terizing power management opportunities for llms in the cloud. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3. 207–222

work page 2024
[36]

Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. 2024. Splitwise: Efficient generative llm inference using phase splitting. In2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA). IEEE, 118–132

work page 2024
[37]

Pedregosa, G

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay

work page
[38]

Scikit-learn: Machine Learning in Python.Journal of Machine Learning Research12 (2011), 2825–2830

work page 2011
[39]

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He

work page
[40]

InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC)

DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC)

work page
[41]

Yixin Song, Zeyu Mi, Haotong Xie, and Haibo Chen. 2024. Powerinfer: Fast large language model serving with a consumer-grade gpu. In Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles. 590–606

work page 2024
[42]

Jovan Stojkovic, Esha Choukse, Chaojie Zhang, Inigo Goiri, and Josep Torrellas. 2024. Towards greener llms: Bringing energy-efficiency to the forefront of llm inference.arXiv preprint arXiv:2403.20306(2024)

work page arXiv 2024
[43]

Jovan Stojkovic, Chaojie Zhang, Íñigo Goiri, Josep Torrellas, and Esha Choukse. 2025. Dynamollm: Designing llm inference clusters for per- formance and energy efficiency. In2025 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 1348–1362

work page 2025
[44]

vLLM Project. 2025. Disaggregated Prefill V1.https://docs.vllm.ai/en/ latest/features/disagg_prefill.html

work page 2025
[45]

Bingyang Wu, Shengyu Liu, Yinmin Zhong, Peng Sun, Xuanzhe Liu, and Xin Jin. 2024. Loongserve: Efficiently serving long-context large language models with elastic sequence parallelism. InProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles. 640–654

work page 2024
[46]

Jie You, Jae-Won Chung, and Mosharaf Chowdhury. 2023. Zeus: Un- derstanding and optimizing {GPU } energy consumption of {DNN} training. In20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). 119–139

work page 2023
[47]

Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. 2022. Orca: A distributed serving system for {Transformer-Based} generative models. InProc. of USENIX OSDI. 521–538

work page 2022
[48]

Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xu- anzhe Liu, Xin Jin, and Hao Zhang. 2024. {DistServe}: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving. InProc. of USENIX OSDI(2024). 193–210. Appendix A Placement Configurations Table 2 lists the full Tier 1 placement plans used in the time-varying produ...

work page 2024