pith. sign in

arxiv: 2509.04827 · v3 · pith:ALPU2AHEnew · submitted 2025-09-05 · 💻 cs.DC · cs.AI· cs.LG

VoltanaLLM: Energy-Efficient and SLO-Aware Disaggregated LLM Serving via Adaptive Frequency Control and State-Space Routing

classification 💻 cs.DC cs.AIcs.LG
keywords voltanallmservingfrequencyenergyroutingselectionstate-spacedecode
0
0 comments X
read the original abstract

The energy cost of Large Language Model (LLM) inference is rapidly becoming a barrier to sustainable and scalable deployment. Although modern serving architectures expose distinct prefill and decode behaviors, existing systems fail to exploit these phase differences for energy-efficient serving under strict latency SLOs. This paper introduces VoltanaLLM, the first system that explicitly targets and reduces the energy bloat in modern prefill-decode (P/D) disaggregated LLM serving. Guided by a control-theory perspective, VoltanaLLM separates two levers: per-instance operating-point selection (GPU frequency per iteration) and system-level state-space routing of requests. We empirically observe that LLM inference exhibits a U-shaped energy-frequency curve creating "sweet spots" that depend on phase behavior and load. VoltanaLLM exploits this by combining phase-specific, iteration-level frequency selection driven by a lightweight, online-adaptive latency predictor, with a decode state-space guided router that avoids architectural granularity-induced inefficiencies, all while meeting desired SLOs. We implement VoltanaLLM using SGLang and evaluate it across multiple models and real-world workloads. Our results show VoltanaLLM reduces end-to-end energy by up to 36.3% versus a static max-frequency baseline while maintaining high SLO attainment, and generalizes to newer GPUs. These results point to sustainable LLM serving via phase-aware, iteration-level frequency selection coupled with architecture-aware routing. Source code is available in https://github.com/Supercomputing-System-AI-Lab/VoltanaLLM.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. XWind: A Cross-site Router for Large Language Model Inference Serving at Renewable Energy Farms

    cs.DC 2026-05 unverdicted novelty 7.0

    XWind is a reactive cross-site router for LLM inference at wind farms that cuts P99 latency by up to 52% versus strong baselines in a 64-GPU emulation of three sites.

  2. SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for LLM Inference on Superchips

    cs.DC 2026-01 conditional novelty 7.0

    SuperInfer improves TTFT SLO attainment by up to 74.7% on GH200 Superchips via SLO-aware rotary scheduling (RotaSched) and full-duplex KV cache rotation (DuplexKV) over NVLink-C2C while preserving TBT and throughput.

  3. KAIROS: Stateful, Context-Aware Power-Efficient Agentic Inference Serving

    cs.DC 2026-04 unverdicted novelty 6.0

    KAIROS reduces power by 27% on average (up to 39.8%) for agentic AI inference by using long-lived context to jointly manage GPU frequency, concurrency, and request routing across instances.

  4. Energy-Aware Scheduling for Serverless LLM Serving on Shared GPUs

    cs.DC 2026-06 unverdicted novelty 4.0

    Festina reduces energy consumption by up to 56% for serverless LLM inference on shared GPUs while keeping TTFT/TBT SLO attainment within 2% of four state-of-the-art baselines.