Autopoiesis uses LLM-driven program synthesis to evolve serving policies online during deployment, delivering up to 53% and average 34% gains over prior LLM serving systems under runtime dynamics.
Efficient interactive llm serving with proxy model-based sequence length prediction
9 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
SuperInfer improves TTFT SLO attainment by up to 74.7% on GH200 Superchips via SLO-aware rotary scheduling (RotaSched) and full-duplex KV cache rotation (DuplexKV) over NVLink-C2C while preserving TBT and throughput.
BalanceRoute uses a piecewise-linear F-score (with optional short lookahead) for sticky request routing in LLM serving, reducing DP imbalance and raising end-to-end throughput versus vLLM baselines on production and Azure traces.
A queueing model derives stability conditions for LLM inference services under combined compute and KV cache memory limits, with experimental validation showing typical deviations under 10%.
LLM output lengths conditioned on a prompt form heavy-tailed distributions, so robust estimation from multiple samples outperforms single-sample labels for prediction.
CascadeInfer partitions LLM instances into length-specialized groups, uses dynamic programming for stage partitioning, and applies runtime refinement plus decentralized load balancing to cut latency and raise throughput.
Clairvoyant predicts LLM response lengths from 19 lexical features with an XGBoost classifier to enable SJF scheduling in serial backends, reporting 70-76% P50 latency reduction for short requests under high load.
STAR cuts P99 TPOT by 75.1% and raises goodput 2.63x via a lightweight hidden-state length predictor and dynamic decode rescheduling that combines current and predicted loads.
Festina reduces energy consumption by up to 56% for serverless LLM inference on shared GPUs while keeping TTFT/TBT SLO attainment within 2% of four state-of-the-art baselines.
citing papers explorer
-
CascadeInfer: Length-Aware Scheduling of LLM Serving with Low Latency and Load Balancing
CascadeInfer partitions LLM instances into length-specialized groups, uses dynamic programming for stage partitioning, and applies runtime refinement plus decentralized load balancing to cut latency and raise throughput.
-
STAR: Decode-Phase Rescheduling for LLM Inference
STAR cuts P99 TPOT by 75.1% and raises goodput 2.63x via a lightweight hidden-state length predictor and dynamic decode rescheduling that combines current and predicted loads.