pith. sign in

arxiv: 2506.14851 · v1 · pith:KITIJ35Tnew · submitted 2025-06-17 · 💻 cs.DC · cs.AI· cs.LG

Efficient Serving of LLM Applications with Probabilistic Demand Modeling

classification 💻 cs.DC cs.AIcs.LG
keywords applicationsdemandservingcompletionhermespdgraphprobabilistictime
0
0 comments X
read the original abstract

Applications based on Large Language Models (LLMs) contains a series of tasks to address real-world problems with boosted capability, which have dynamic demand volumes on diverse backends. Existing serving systems treat the resource demands of LLM applications as a blackbox, compromising end-to-end efficiency due to improper queuing order and backend warm up latency. We find that the resource demands of LLM applications can be modeled in a general and accurate manner with Probabilistic Demand Graph (PDGraph). We then propose Hermes, which leverages PDGraph for efficient serving of LLM applications. Confronting probabilistic demand description, Hermes applies the Gittins policy to determine the scheduling order that can minimize the average application completion time. It also uses the PDGraph model to help prewarm cold backends at proper moments. Experiments with diverse LLM applications confirm that Hermes can effectively improve the application serving efficiency, reducing the average completion time by over 70% and the P95 completion time by over 80%.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Maestro: Workload-Aware Cross-Cluster Scheduling for LLM-Based Multi-Agent Systems

    cs.DC 2026-06 unverdicted novelty 6.0

    Maestro is a workload-aware scheduler for LLM-based multi-agent systems that cuts KV-reservation HBM by 67.2% and raises high-contention SLO attainment by 23.6 points over EDF via prediction-driven hierarchical scheduling.

  2. TokenCake: A KV-Cache-centric Serving Framework for LLM-based Multi-Agent Applications

    cs.DC 2025-10 unverdicted novelty 6.0

    TokenCake introduces agent-aware temporal and spatial schedulers for KV cache management in LLM multi-agent serving, claiming over 47% lower end-to-end latency and up to 16.9% better GPU memory utilization than vLLM o...