pith. sign in

arxiv: 2505.04021 · v3 · pith:PO3H6CMWnew · submitted 2025-05-06 · 💻 cs.DC · cs.AI· cs.LG· cs.PF

Prism: Cost-Efficient Multi-LLM Serving via GPU Memory Ballooning

classification 💻 cs.DC cs.AIcs.LGcs.PF
keywords memorymodelsprismacrossballooningefficiencykvcachedproduction
0
0 comments X
read the original abstract

Inference providers must maintain availability for many LLMs, including low-volume but essential models, making resource efficiency increasingly important as token prices fall. Analysis of production traces reveals a dynamic bursty-group pattern in which sets of models become active together and shift over time; existing space- and time-sharing approaches lack principled mechanisms to adapt to this variability, forcing trade-offs between SLO adherence and efficiency. We observe that elastic memory allocation can unify spatial and temporal sharing. Based on this insight, we have developed Prism, a memory-centric LLM co-serving framework that applies memory ballooning to reclaim memory across models and support both forms of sharing under a single scheme. Prism's balloon driver, referred to as kvcached, has been open-sourced at https://github.com/ovg-project/kvcached, and deployed in production environments across 10K+ GPUs.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 13 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Coral: Cost-Efficient Multi-LLM Serving over Heterogeneous Cloud GPUs

    cs.DC 2026-05 unverdicted novelty 7.0

    Coral cuts multi-LLM serving costs by up to 2.79x and raises goodput by up to 2.39x on heterogeneous GPUs through adaptive joint optimization and a lossless two-stage decomposition that solves quickly.

  2. CacheFlow: Efficient LLM Serving with 3D-Parallel KV Cache Restoration

    cs.DC 2026-04 unverdicted novelty 7.0

    CacheFlow cuts TTFT by 10-62% in batched LLM serving via 3D-parallel KV cache restoration and a two-pointer scheduler that overlaps recompute and I/O.

  3. ROSE: Rollout On Serving GPUs via Cooperative Elasticity for Agentic RL

    cs.DC 2026-05 unverdicted novelty 6.0

    ROSE is a system for cooperative elasticity that co-locates serving and rollout models on shared GPUs, delivering 1.3-3.3x higher end-to-end throughput than fixed-resource baselines while preserving serving SLOs.

  4. ROSE: Rollout On Serving GPUs via Cooperative Elasticity for Agentic RL

    cs.DC 2026-05 unverdicted novelty 6.0

    ROSE delivers 1.2-3.3x higher end-to-end throughput for agentic RL by safely co-using underutilized serving GPUs for rollouts while meeting serving SLOs.

  5. SPECTRE: Hybrid Ordinary-Parallel Speculative Serving for Resource-Efficient LLM Inference

    cs.DC 2026-05 unverdicted novelty 6.0

    SPECTRE achieves up to 2.28x speedup for large-model LLM serving by running speculative draft generation and target verification in parallel using idle tail-model services.

  6. SPECTRE: Hybrid Ordinary-Parallel Speculative Serving for Resource-Efficient LLM Inference

    cs.DC 2026-05 conditional novelty 6.0

    SPECTRE delivers up to 2.28x speedup on large-model LLM inference by turning idle tail-model services into remote speculative drafters using hybrid parallel decoding and priority scheduling.

  7. JigsawRL: Assembling RL Pipelines for Efficient LLM Post-Training

    cs.LG 2026-04 unverdicted novelty 6.0

    JigsawRL achieves up to 1.85x higher throughput in LLM RL pipelines via pipeline multiplexing, sub-stage graphs, and look-ahead scheduling compared to prior systems.

  8. Scepsy: Serving Agentic Workflows Using Aggregate LLM Pipelines

    cs.DC 2026-04 unverdicted novelty 6.0

    Scepsy schedules arbitrary multi-LLM agentic workflows on GPU clusters by constructing Aggregate LLM Pipelines from stable per-LLM execution time shares, then searching fractional GPU allocations, tensor parallelism, ...

  9. Valve: Production Online-Offline Inference Colocation with Jointly-Bounded Preemption Latency and Rate

    cs.OS 2026-04 unverdicted novelty 6.0

    Valve jointly bounds preemption latency and rate for online-offline LLM colocation on GPUs, delivering 34.6% higher cluster utilization and a 2,170-GPU saving in a production deployment of 8,054 GPUs with under 5% TTF...

  10. Foundry: Template-Based CUDA Graph Context Materialization for Fast LLM Serving Cold Start

    cs.DC 2026-04 unverdicted novelty 6.0

    Foundry uses template-based CUDA graph context materialization to reduce LLM serving cold-start latency by up to 99% while preserving CUDA graph throughput gains.

  11. The Energy Cost of Execution-Idle in GPU Clusters

    cs.DC 2026-04 unverdicted novelty 6.0

    Execution-idle accounts for 19.7% of GPU execution time and 10.7% of energy in a large cluster, motivating power management that treats it as a distinct operating state.

  12. WarmServe: Enabling One-for-Many GPU Prewarming for Multi-LLM Serving

    cs.DC 2025-12 unverdicted novelty 6.0

    WarmServe reduces tail TTFT by up to 50.8× versus autoscaling and supports 2.5× higher throughput than GPU-sharing by using one-for-many prewarming, model placement, KV cache reservation, and efficient tensor switching.

  13. Towards Multi-Model LLM Schedulers: Empirical Insights into Offloading and Preemption

    cs.AI 2026-05 unverdicted novelty 5.0

    Empirical study finds non-linear, model-size-dependent throughput degradation from offloading and high model-state reload costs from preemption in multi-LLM serving.