Autopoiesis uses LLM-driven program synthesis to evolve serving policies online during deployment, delivering up to 53% and average 34% gains over prior LLM serving systems under runtime dynamics.
Efficient interactive llm serving with proxy model-based sequence length prediction
8 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
SuperInfer improves TTFT SLO attainment by up to 74.7% on GH200 Superchips via SLO-aware rotary scheduling (RotaSched) and full-duplex KV cache rotation (DuplexKV) over NVLink-C2C while preserving TBT and throughput.
BalanceRoute uses a piecewise-linear F-score (with optional short lookahead) for sticky request routing in LLM serving, reducing DP imbalance and raising end-to-end throughput versus vLLM baselines on production and Azure traces.
A queueing model derives stability conditions for LLM inference services under combined compute and KV cache memory limits, with experimental validation showing typical deviations under 10%.
LLM output lengths conditioned on a prompt form heavy-tailed distributions, so robust estimation from multiple samples outperforms single-sample labels for prediction.
CascadeInfer partitions LLM instances into length-specialized groups, uses dynamic programming for stage partitioning, and applies runtime refinement plus decentralized load balancing to cut latency and raise throughput.
STAR cuts P99 TPOT by 75.1% and raises goodput 2.63x via a lightweight hidden-state length predictor and dynamic decode rescheduling that combines current and predicted loads.
Festina reduces energy consumption by up to 56% for serverless LLM inference on shared GPUs while keeping TTFT/TBT SLO attainment within 2% of four state-of-the-art baselines.
citing papers explorer
-
Autopoiesis: A Self-Evolving System Paradigm for LLM Serving Under Runtime Dynamics
Autopoiesis uses LLM-driven program synthesis to evolve serving policies online during deployment, delivering up to 53% and average 34% gains over prior LLM serving systems under runtime dynamics.