Efficient interactive llm serving with proxy model-based sequence length prediction

Haoran Qiu, Weichao Mao, Archit Patke, Shengkun Cui, Saurabh Jha, Nedeljko Vujic, Zhenhua Liu, Chenyang Wang, Ioannis Stavrakakis, Stratis Ioannidis, David A · 2024 · arXiv 2404.08509

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

read on arXiv browse 9 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Autopoiesis: A Self-Evolving System Paradigm for LLM Serving Under Runtime Dynamics

cs.DC · 2026-04-08 · unverdicted · novelty 7.0

Autopoiesis uses LLM-driven program synthesis to evolve serving policies online during deployment, delivering up to 53% and average 34% gains over prior LLM serving systems under runtime dynamics.

SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for LLM Inference on Superchips

cs.DC · 2026-01-28 · conditional · novelty 7.0

SuperInfer improves TTFT SLO attainment by up to 74.7% on GH200 Superchips via SLO-aware rotary scheduling (RotaSched) and full-duplex KV cache rotation (DuplexKV) over NVLink-C2C while preserving TBT and throughput.

Tackling the Data-Parallel Load Balancing Bottleneck in LLM Serving: Practical Online Routing at Scale

cs.DC · 2026-05-07 · unverdicted · novelty 6.0 · 2 refs

BalanceRoute uses a piecewise-linear F-score (with optional short lookahead) for sticky request routing in LLM serving, reducing DP imbalance and raising end-to-end throughput versus vLLM baselines on production and Azure traces.

A Queueing-Theoretic Framework for Stability Analysis of LLM Inference with KV Cache Memory Constraints

cs.LG · 2026-05-06 · unverdicted · novelty 6.0

A queueing model derives stability conditions for LLM inference services under combined compute and KV cache memory limits, with experimental validation showing typical deviations under 10%.

Robust Length Prediction: A Perspective from Heavy-Tailed Prompt-Conditioned Distributions

cs.LG · 2026-04-09 · unverdicted · novelty 6.0

LLM output lengths conditioned on a prompt form heavy-tailed distributions, so robust estimation from multiple samples outperforms single-sample labels for prediction.

CascadeInfer: Length-Aware Scheduling of LLM Serving with Low Latency and Load Balancing

cs.DC · 2025-12-22 · conditional · novelty 6.0

CascadeInfer partitions LLM instances into length-specialized groups, uses dynamic programming for stage partitioning, and applies runtime refinement plus decentralized load balancing to cut latency and raise throughput.

Clairvoyant: Predictive SJF Scheduling to Mitigate Head-of-Line Blocking in Serial LLM Backends

cs.DC · 2026-06-05 · unverdicted · novelty 5.0

Clairvoyant predicts LLM response lengths from 19 lexical features with an XGBoost classifier to enable SJF scheduling in serial backends, reporting 70-76% P50 latency reduction for short requests under high load.

STAR: Decode-Phase Rescheduling for LLM Inference

cs.DC · 2025-10-15 · unverdicted · novelty 5.0

STAR cuts P99 TPOT by 75.1% and raises goodput 2.63x via a lightweight hidden-state length predictor and dynamic decode rescheduling that combines current and predicted loads.

Energy-Aware Scheduling for Serverless LLM Serving on Shared GPUs

cs.DC · 2026-06-29 · unverdicted · novelty 4.0

Festina reduces energy consumption by up to 56% for serverless LLM inference on shared GPUs while keeping TTFT/TBT SLO attainment within 2% of four state-of-the-art baselines.

citing papers explorer

Showing 2 of 2 citing papers after filters.

CascadeInfer: Length-Aware Scheduling of LLM Serving with Low Latency and Load Balancing cs.DC · 2025-12-22 · conditional · none · ref 21
CascadeInfer partitions LLM instances into length-specialized groups, uses dynamic programming for stage partitioning, and applies runtime refinement plus decentralized load balancing to cut latency and raise throughput.
STAR: Decode-Phase Rescheduling for LLM Inference cs.DC · 2025-10-15 · unverdicted · none · ref 29
STAR cuts P99 TPOT by 75.1% and raises goodput 2.63x via a lightweight hidden-state length predictor and dynamic decode rescheduling that combines current and predicted loads.

Efficient interactive llm serving with proxy model-based sequence length prediction

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer