arXiv preprint arXiv:2404.08509 , year=

Efficient interactive llm serving with proxy model-based sequence length prediction , author= · 2024 · arXiv 2404.08509

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

representative citing papers

Autopoiesis: A Self-Evolving System Paradigm for LLM Serving Under Runtime Dynamics

cs.DC · 2026-04-08 · unverdicted · novelty 7.0

Autopoiesis uses LLM-driven program synthesis to evolve serving policies online during deployment, delivering up to 53% and average 34% gains over prior LLM serving systems under runtime dynamics.

Tackling the Data-Parallel Load Balancing Bottleneck in LLM Serving: Practical Online Routing at Scale

cs.DC · 2026-05-07 · unverdicted · novelty 6.0 · 2 refs

BalanceRoute uses a piecewise-linear F-score (with optional short lookahead) for sticky request routing in LLM serving, reducing DP imbalance and raising end-to-end throughput versus vLLM baselines on production and Azure traces.

A Queueing-Theoretic Framework for Stability Analysis of LLM Inference with KV Cache Memory Constraints

cs.LG · 2026-05-06 · unverdicted · novelty 6.0

A queueing model derives stability conditions for LLM inference services under combined compute and KV cache memory limits, with experimental validation showing typical deviations under 10%.

Robust Length Prediction: A Perspective from Heavy-Tailed Prompt-Conditioned Distributions

cs.LG · 2026-04-09 · unverdicted · novelty 6.0

LLM output lengths conditioned on a prompt form heavy-tailed distributions, so robust estimation from multiple samples outperforms single-sample labels for prediction.

citing papers explorer

Showing 4 of 4 citing papers.

Autopoiesis: A Self-Evolving System Paradigm for LLM Serving Under Runtime Dynamics cs.DC · 2026-04-08 · unverdicted · none · ref 16
Autopoiesis uses LLM-driven program synthesis to evolve serving policies online during deployment, delivering up to 53% and average 34% gains over prior LLM serving systems under runtime dynamics.
Tackling the Data-Parallel Load Balancing Bottleneck in LLM Serving: Practical Online Routing at Scale cs.DC · 2026-05-07 · unverdicted · none · ref 14 · 2 links
BalanceRoute uses a piecewise-linear F-score (with optional short lookahead) for sticky request routing in LLM serving, reducing DP imbalance and raising end-to-end throughput versus vLLM baselines on production and Azure traces.
A Queueing-Theoretic Framework for Stability Analysis of LLM Inference with KV Cache Memory Constraints cs.LG · 2026-05-06 · unverdicted · none · ref 48
A queueing model derives stability conditions for LLM inference services under combined compute and KV cache memory limits, with experimental validation showing typical deviations under 10%.
Robust Length Prediction: A Perspective from Heavy-Tailed Prompt-Conditioned Distributions cs.LG · 2026-04-09 · unverdicted · none · ref 15
LLM output lengths conditioned on a prompt form heavy-tailed distributions, so robust estimation from multiple samples outperforms single-sample labels for prediction.

arXiv preprint arXiv:2404.08509 , year=

fields

years

verdicts

representative citing papers

citing papers explorer