Efficient Serving of LLM Applications with Probabilistic Demand Modeling

Chen Chen; Minyi Guo; Shixuan Sun; Weiye Wang; Xusheng Chen; Yifei Liu; Yifei Zhu; Yizhou Shan; Zhenghao Gan; Zhenhua Han

arxiv: 2506.14851 · v1 · pith:KITIJ35Tnew · submitted 2025-06-17 · 💻 cs.DC · cs.AI· cs.LG

Efficient Serving of LLM Applications with Probabilistic Demand Modeling

Yifei Liu , Zuo Gan , Zhenghao Gan , Weiye Wang , Chen Chen , Yizhou Shan , Xusheng Chen , Zhenhua Han

show 3 more authors

Yifei Zhu Shixuan Sun Minyi Guo

This is my paper

classification 💻 cs.DC cs.AIcs.LG

keywords applicationsdemandservingcompletionhermespdgraphprobabilistictime

0 comments

read the original abstract

Applications based on Large Language Models (LLMs) contains a series of tasks to address real-world problems with boosted capability, which have dynamic demand volumes on diverse backends. Existing serving systems treat the resource demands of LLM applications as a blackbox, compromising end-to-end efficiency due to improper queuing order and backend warm up latency. We find that the resource demands of LLM applications can be modeled in a general and accurate manner with Probabilistic Demand Graph (PDGraph). We then propose Hermes, which leverages PDGraph for efficient serving of LLM applications. Confronting probabilistic demand description, Hermes applies the Gittins policy to determine the scheduling order that can minimize the average application completion time. It also uses the PDGraph model to help prewarm cold backends at proper moments. Experiments with diverse LLM applications confirm that Hermes can effectively improve the application serving efficiency, reducing the average completion time by over 70% and the P95 completion time by over 80%.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Maestro: Workload-Aware Cross-Cluster Scheduling for LLM-Based Multi-Agent Systems
cs.DC 2026-06 unverdicted novelty 6.0

Maestro is a workload-aware scheduler for LLM-based multi-agent systems that cuts KV-reservation HBM by 67.2% and raises high-contention SLO attainment by 23.6 points over EDF via prediction-driven hierarchical scheduling.
TokenCake: A KV-Cache-centric Serving Framework for LLM-based Multi-Agent Applications
cs.DC 2025-10 unverdicted novelty 6.0

TokenCake introduces agent-aware temporal and spatial schedulers for KV cache management in LLM multi-agent serving, claiming over 47% lower end-to-end latency and up to 16.9% better GPU memory utilization than vLLM o...