arxiv: 2604.17627 · v1 · submitted 2026-04-19 · 💻 cs.LG · cs.DC· cs.PF

Recognition: unknown

SLO-Guard: Crash-Aware, Budget-Consistent Autotuning for SLO-Constrained LLM Serving

Christian Lysenst{\o}en

Authors on Pith no claims yet

Pith reviewed 2026-05-10 05:56 UTC · model grok-4.3

classification 💻 cs.LG cs.DCcs.PF

keywords LLM servingautotuningSLOcrash-aware optimizationlatency targetsconfiguration searchvLLM

0 comments

The pith

SLO-Guard treats crashes as useful data so that a fixed tuning budget for latency-constrained LLM serving produces more consistent results.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SLO-Guard as a way to autotune LLM serving configurations when many candidate settings either crash or miss latency targets. It runs a first phase that prioritizes feasible configurations through budget annealing, then hands off to a second phase that refines with a tree-structured estimator while replaying every prior trial, crashes included. The central result is not a superior final latency but steadier use of the tuning budget: more trials land in fast-serving regimes and the spread of outcomes across repeated runs shrinks. This predictability matters for production systems that must operate within strict search-time limits rather than relying on occasional lucky configurations.

Core claim

SLO-Guard pairs feasible-first Thermal Budget Annealing exploration with a warm-started Tree-structured Parzen Estimator exploitation phase whose handoff replays the entire history, including crashes encoded as extreme constraint violations. It adds a configuration-repair pass, a GPU-aware KV-cache memory guard, and a four-category crash taxonomy. On Qwen2-1.5B with vLLM 0.19 on A100 hardware, it reaches the same feasibility and best latency as uniform random search yet allocates more trials to the fast regime and reduces cross-seed latency variance by a factor of 4.4 under concurrent load.

What carries the argument

A two-phase optimizer that first explores with Thermal Budget Annealing to locate feasible regions while avoiding crashes, then exploits with a Tree-structured Parzen Estimator that re-uses all prior observations including those marked as extreme constraint violations.

Load-bearing premise

The four-category crash taxonomy, configuration-repair pass, and GPU-aware KV-cache guard must generalize beyond the tested Qwen2-1.5B plus vLLM 0.19 plus A100 combination, and the concurrent harness must reflect real production load patterns.

What would settle it

Replicate the five-seed study with the same total trial budget and measure whether the count of fast-regime trials and the standard deviation of best latency become statistically indistinguishable between SLO-Guard and random search.

Figures

Figures reproduced from arXiv: 2604.17627 by Christian Lysenst{\o}en.

**Figure 1.** Figure 1: visualizes the two primary statistics. The left panel shows per-seed fast-cluster counts; the right panel shows post-hit consistency. Both panels move in the same direction and support the headline framing [PITH_FULL_IMAGE:figures/full_fig_p009_1.png] view at source ↗

**Figure 2.** Figure 2: plots best-so-far latency across all five seeds. Random search sometimes enters the fast regime earlier than SLO-Guard, but several random trajectories continue to sample slower configurations after already finding a fast one. SLO-Guard deliberately spends early trials exploring structure; once its TBA-to-TPE handoff fires, the trajectories flatten and remain inside the fast regime. This is the trajectory-… view at source ↗

**Figure 3.** Figure 3: Best latency reached by each seed under the concurrent harness. Means are statistically tied (two-sided Mann–Whitney p = 0.84); the cross-seed spread is not. Random σ = 10.00 ms vs. SLO-Guard σ=2.26 ms, a ratio of 4.42×. This matters in deployment: an operator does not run a tuning job once. A method that occasionally lands on an excellent point but varies widely across restarts is less attractive than one… view at source ↗

**Figure 4.** Figure 4: Sequential-versus-concurrent harness comparison. Consistency statistics move negligibly under correction; absolute latency rises and its variance becomes more realistic. The central conclusion survives; the corrected harness additionally reveals a variance structure that sequential dispatch compressed away. That two independent measurement protocols yield statistically equivalent conclusions on the consist… view at source ↗

**Figure 5.** Figure 5: Trial-level view of the concurrent-harness search space. The dominant explanatory variable is enforce_eager: true ⇒ slow regime, false ⇒ fast regime. This structural observation is why the paper frames its contribution as regime-discovery consistency rather than general many-knob co-optimization. This structure keeps the paper honest. A smooth, gently varying landscape would invite broader autotuning claim… view at source ↗

**Figure 6.** Figure 6: Representative seed-42 trajectory for SLO-Guard under the concurrent harness. Trials 1–6: TBA-explore (slow regime). Trial 7: handoff to TPE-exploit. Trials 7–15: fast regime, with variance tightening as TPE concentrates sampling. The handoff corresponds to a real change in search behavior rather than a cosmetic phase label. 5.7 Ablations the evidence suggests The current results motivate several natural a… view at source ↗

read the original abstract

Serving large language models under latency service-level objectives (SLOs) is a configuration-heavy systems problem with an unusually failure-prone search space: many plausible configurations crash outright or miss user-visible latency targets, and standard black-box optimizers treat these failures as wasted trials. We present SLO-Guard, a crash-aware autotuner for vLLM serving that treats crashes as first-class observations. SLO-Guard combines a feasible-first Thermal Budget Annealing (TBA) exploration phase with a warm-started Tree-structured Parzen Estimator (TPE) exploitation phase; the handoff replays all exploration history, including crashes encoded as extreme constraint violations. We additionally contribute a configuration-repair pass, a GPU-aware KV-cache memory guard, and a four-category crash taxonomy. We evaluate SLO-Guard on Qwen2-1.5B served with vLLM 0.19 on an NVIDIA A100 40GB. Across a pre-specified five-seed study, both SLO-Guard and uniform random search attain 75/75 feasibility with zero crashes under the corrected concurrent harness, and are statistically tied on best-achieved latency (Mann-Whitney two-sided p=0.84). SLO-Guard's advantage is in budget consistency: more trials in the fast-serving regime (10.20 vs. 7.40 out of 15; one-sided p=0.014) and higher post-handoff consistency (0.876 vs. 0.539; p=0.010). Under concurrent load, SLO-Guard's cross-seed standard deviation on best latency is 4.4x tighter than random search's (2.26 ms vs. 10.00 ms). A harness-replication analysis shows that the consistency findings survive an independent sequential-dispatch measurement condition. The central claim is not that SLO-Guard finds a better final configuration, but that it spends a fixed tuning budget more predictably once the fast regime has been found.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SLO-Guard improves tuning consistency for vLLM by treating crashes as data, but the gains stay narrow to one small-model setup.

read the letter

The main thing to take away is that this work focuses on making the tuning process itself more reliable rather than chasing a better final number. SLO-Guard uses crashes as information instead of throwing them away, which leads to more trials landing in the fast-serving area and less variation in outcomes across different runs. They introduce Thermal Budget Annealing for the initial exploration to find feasible configurations first, then switch to TPE for exploitation while feeding in the full history with crashes marked as severe violations. On top of that they add a specific crash taxonomy with four categories, a repair step for configs, and a memory guard that accounts for KV cache on the GPU. These pieces fit together for the vLLM environment. The results section does a good job with the numbers. Five seeds show clear advantages on the consistency measures with p-values below 0.05, and they are upfront that the best latency is the same as random search. The extra check with sequential dispatch instead of concurrent helps confirm the findings aren't an artifact of the test harness. Still, the work stays narrow. All the data comes from one 1.5 billion parameter model on a single GPU type with one version of vLLM. How the crash categories or the repair logic would apply to bigger models or other frameworks like TensorRT-LLM is left open. The load model they use for testing is simplified, so real-world traffic patterns might change how often crashes happen. Engineers working on production LLM deployments who have to tune parameters repeatedly would find the consistency improvements practical. It gives a way to reduce the risk that a tuning run wastes most of its budget on bad trials. I think this paper is worth sending out for review. The evidence lines up with the claims they actually make, and the problem they target is common enough that others could build on the approach.

Referee Report

2 major / 3 minor

Summary. The paper presents SLO-Guard, a crash-aware autotuner for vLLM-based LLM serving under latency SLOs. It combines a feasible-first Thermal Budget Annealing (TBA) exploration phase with a warm-started Tree-structured Parzen Estimator (TPE) exploitation phase that replays all prior trials (including crashes encoded as extreme violations). Additional contributions are a four-category crash taxonomy, a configuration-repair pass, and a GPU-aware KV-cache memory guard. In a pre-specified five-seed study on Qwen2-1.5B with vLLM 0.19 on A100, both SLO-Guard and uniform random search achieve 75/75 feasibility with no crashes; they are statistically tied on best latency (Mann-Whitney p=0.84) but SLO-Guard shows superior budget consistency (more trials in the fast-serving regime, higher post-handoff consistency, and 4.4x tighter cross-seed latency variance), with supporting one-sided p-values and a sequential-dispatch harness replication.

Significance. If the consistency results hold, the work offers a practical advance for autotuning in failure-prone LLM serving spaces by prioritizing predictable use of a fixed tuning budget over merely finding a better final configuration. The statistical tests (Mann-Whitney, p-values), explicit acknowledgment of best-latency tie, and independent harness replication are strengths that increase credibility. The approach could influence production tuning pipelines where crashes and variance are costly.

major comments (2)

[Evaluation / Results] Evaluation section (results on fast-serving regime): the claim of 10.20 vs. 7.40 trials in the fast regime (p=0.014) is load-bearing for the central consistency argument, yet the exact latency threshold or quantitative definition of 'fast-serving regime' is not stated in the reported results or methods; without it the metric cannot be independently verified or reproduced.
[§3] §3 (TBA-to-TPE handoff and crash encoding): the replay of exploration history with crashes as extreme constraint violations is described at a high level, but no detail is given on how the TPE surrogate model incorporates these values (e.g., the precise penalty magnitude or kernel handling); this step is central to the claimed advantage over random search and requires explicit pseudocode or equations for reproducibility.

minor comments (3)

[Abstract / Evaluation] The abstract and results mention 'one-sided p=0.014' and 'p=0.010' but should consistently name the test (Mann-Whitney) and confirm whether the one-sided direction was pre-specified.
[Evaluation] The concurrent harness is used for the main results while a sequential-dispatch replication is reported as a check; the paper should clarify the exact differences in load modeling between the two and why the sequential version is only a replication rather than the primary condition.
[Methods] Hyperparameters for TBA (budget schedule, temperature decay) and TPE (number of initial samples, acquisition function) are not listed; these are needed to reproduce the 15-trial budget experiments.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive evaluation, the recognition of our statistical approach and replication efforts, and the recommendation for minor revision. The comments correctly identify areas where additional clarity will strengthen reproducibility. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Evaluation / Results] Evaluation section (results on fast-serving regime): the claim of 10.20 vs. 7.40 trials in the fast regime (p=0.014) is load-bearing for the central consistency argument, yet the exact latency threshold or quantitative definition of 'fast-serving regime' is not stated in the reported results or methods; without it the metric cannot be independently verified or reproduced.

Authors: We agree that the quantitative definition of the fast-serving regime must be stated explicitly for reproducibility. This definition was applied in our analysis but was omitted from the manuscript text. In the revision we will add a precise statement of the latency threshold (and its relation to the target SLO) to both the methods and the results sections so that the reported trial counts and p-value can be independently verified. revision: yes
Referee: [§3] §3 (TBA-to-TPE handoff and crash encoding): the replay of exploration history with crashes as extreme constraint violations is described at a high level, but no detail is given on how the TPE surrogate model incorporates these values (e.g., the precise penalty magnitude or kernel handling); this step is central to the claimed advantage over random search and requires explicit pseudocode or equations for reproducibility.

Authors: We acknowledge that the description of crash encoding and the TPE handoff is currently high-level. In the revised manuscript we will expand §3 with the exact penalty magnitude used for crashes, the manner in which these values are passed to the TPE surrogate (including kernel handling), and pseudocode for the replay procedure. This will make the mechanism fully reproducible and clarify its contribution relative to random search. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is an empirical systems contribution. It defines SLO-Guard as a two-phase autotuner (TBA exploration followed by TPE exploitation with crash encoding and repair), then measures its behavior on a fixed harness against random search. All load-bearing claims (budget consistency, post-handoff stability, cross-seed variance) are backed by direct experimental counts, Mann-Whitney tests, and a replication check under sequential dispatch. No equations, fitted parameters, or self-citations are used to derive the reported metrics; the results are independent observations of the implemented system.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No explicit free parameters, axioms, or invented entities are stated in the provided abstract; the approach relies on standard optimization primitives (TBA, TPE) and domain assumptions about crash observability.

pith-pipeline@v0.9.0 · 5677 in / 1059 out tokens · 50128 ms · 2026-05-10T05:56:42.159751+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 4 canonical work pages

[1]

Sarathi-serve: Efficient LLM inference by piggybacking decodes with chunked prefills.arXiv preprint, 2024

Amey Agrawal et al. Sarathi-serve: Efficient LLM inference by piggybacking decodes with chunked prefills.arXiv preprint, 2024. Verify final bibliographic details before submission

2024
[2]

Optuna: A next-generation hyperparameter optimization framework

Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. Optuna: A next-generation hyperparameter optimization framework. InProceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 2623–2631, 2019

2019
[3]

Amazon SageMaker automatic model tuning

Amazon Web Services. Amazon SageMaker automatic model tuning. https://docs. aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning.html, 2025. Doc- umentation page

2025
[4]

Algorithms for hyper-parameter optimization

James Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. Algorithms for hyper-parameter optimization. InAdvances in Neural Information Processing Systems 24, pages 2546–2554, 2011

2011
[5]

Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. InAdvances in Neural Information Processing Systems 35, pages 16344–16359, 2022

2022
[6]

Gelbart, Jasper Snoek, and Ryan P

Michael A. Gelbart, Jasper Snoek, and Ryan P . Adams. Bayesian optimization with unknown constraints.arXiv preprint arXiv:1403.5607, 2014. 17

work page arXiv 2014
[7]

Daniel Golovin, Benjamin Solnik, Subhodeep Moitra, David Kochanski, John Karro, and D. Sculley. Google vizier: A service for black-box optimization. InProceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1487–1495, 2017

2017
[8]

Hoos, and Kevin Leyton-Brown

Frank Hutter, Holger H. Hoos, and Kevin Leyton-Brown. Sequential model-based optimization for general algorithm configuration. InProceedings of the 5th International Conference on Learning and Intelligent Optimization, pages 507–523, 2011

2011
[9]

Gonzalez, Hao Zhang, and Ion Stoica

Woojin Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Jinyang Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles, pages 611–626, 2023

2023
[10]

Constrained bayesian optimization with noisy experiments.Bayesian Analysis, 14(2):495–519, 2019

Benjamin Letham, Brian Karrer, Guilherme Ottoni, and Eytan Bakshy. Constrained bayesian optimization with noisy experiments.Bayesian Analysis, 14(2):495–519, 2019

2019
[11]

SLO-Guard: Crash-aware autotuning for llm serving, 2026

Christian Lysen. SLO-Guard: Crash-aware autotuning for llm serving, 2026. URL https://github.com/Chrislysen/SLO-Guard

2026
[12]

Thermal budget annealing: Feasible-first exploration for con- strained ml deployment, 2026

Christian Lysenstøen. Thermal budget annealing: Feasible-first exploration for con- strained ml deployment, 2026. arXiv preprint pending; update citation details after public release

2026
[13]

GenAI-Perf: Generative AI performance measurement for Triton and OpenAI-compatible apis

NVIDIA. GenAI-Perf: Generative AI performance measurement for Triton and OpenAI-compatible apis. https://docs.nvidia.com/ deeplearning/triton-inference-server/user-guide/docs/perf _benchmark/ genai-perf-README.html, 2025. Documentation page

2025
[14]

Scott Gardner, Itay Hubara, Sachin Idgunji, Thomas B

Vijay Janapa Reddi, Christine Cheng, David Kanter, Peter Mattson, Guenther Schmuelling, Carole-Jean Wu, Brian Anderson, Max Breughe, Maximilien Charlebois, William Chou, et al. MLPerf inference benchmark.arXiv preprint arXiv:1911.02549, 2020

work page arXiv 1911
[15]

Gonzalez, Cho-Jui Hsieh, and Ion Stoica

Ying Sheng, Lianmin Zheng, Binhang Yu, Siyuan Zhuang, Zhuohan Li, Max Ryabinin, Alex Touzalin, Joseph E. Gonzalez, Cho-Jui Hsieh, and Ion Stoica. Flexgen: High- throughput generative inference of large language models with a single GPU or CPU. InProceedings of the 40th International Conference on Machine Learning, pages 31094–31116, 2023

2023
[16]

Local bayesian opti- mization for controller tuning with crash constraints.arXiv preprint arXiv:2411.16267, 2024

David Stenger, Dominik Scheurenberg, and Sebastian Trimpe. Local bayesian opti- mization for controller tuning with crash constraints.arXiv preprint arXiv:2411.16267, 2024

work page arXiv 2024
[17]

Morphling: Fast, near-optimal auto-configuration for cloud-native model serving

Luping Wang, Lingyun Yang, Yinghao Yu, Wei Wang, Bo Li, Xianchao Sun, Jian He, and Liping Zhang. Morphling: Fast, near-optimal auto-configuration for cloud-native model serving. InProceedings of the 12th ACM Symposium on Cloud Computing, pages 639–653, 2021

2021
[18]

Revisiting SLO and goodput metrics in LLM serving.arXiv preprint arXiv:2410.14257, 2024

Zhibin Wang, Shipeng Li, Yuhang Zhou, Xue Li, Rong Gu, Cam-Tu Nguyen, Chen Tian, and Sheng Zhong. Revisiting SLO and goodput metrics in LLM serving.arXiv preprint arXiv:2410.14257, 2024. 18

work page arXiv 2024
[19]

Orca: A distributed serving system for transformer-based generative models

Gyeong-In Yu, Jeongmin Jeong, Gunho Kim, Soojeong Shin, and Byung-Gon Chun. Orca: A distributed serving system for transformer-based generative models. In16th USENIX Symposium on Operating Systems Design and Implementation, pages 521–538, 2022

2022
[20]

Gonzalez, and Ion Stoica

Lianmin Zheng, Zhen Jia, Minmin Sun, Sheng Wu, Juncen Yu, Ameer Haj-Ali, Yida Wang, Jie Yang, Danyang Zhuo, Kaushik Sen, Joseph E. Gonzalez, and Ion Stoica. Alpa: Automating inter- and intra-operator parallelism for distributed deep learning. In16th USENIX Symposium on Operating Systems Design and Implementation, pages 559–578, 2022

2022
[21]

Distserve: Disaggregating prefill and decoding for goodput- optimized large language model serving.arXiv preprint, 2024

Yongkang Zhong et al. Distserve: Disaggregating prefill and decoding for goodput- optimized large language model serving.arXiv preprint, 2024. Verify final bibliographic details before submission. A Per-seed summaries Tables 3 and 4 expand the aggregate statistics to the per-seed level. With only five seeds, the reader should be able to see every point th...

2024