SLO-Guard improves tuning budget consistency for SLO-constrained LLM serving by handling crashes explicitly and using a two-phase feasible-first exploration plus exploitation strategy.
Scott Gardner, Itay Hubara, Sachin Idgunji, Thomas B
5 Pith papers cite this work. Polarity classification is still indexing.
years
2026 5representative citing papers
SCIN uses an in-switch accelerator for direct memory access and 8-bit in-network quantization during All-Reduce, delivering up to 8.7x faster small-message reduction and 1.74x TTFT speedup on LLaMA-2 models.
Quantitative benchmarks across recent AI accelerators reveal that optimal hardware choice varies with workload parameters and that several platforms incur substantially higher idle power than GPUs.
Watt Counts supplies over 5,000 energy measurements across 50 LLMs and 10 GPUs and shows that hardware-aware selection can reduce server-scenario energy use by up to 70 percent with little effect on user experience.
Nvidia achieves 1.6x throughput with NVFP4 but hits a VRAM wall for 70B+ models, while Apple UMA enables linear scaling to 80B at 4-bit with up to 23x better energy efficiency.
citing papers explorer
-
SLO-Guard: Crash-Aware, Budget-Consistent Autotuning for SLO-Constrained LLM Serving
SLO-Guard improves tuning budget consistency for SLO-constrained LLM serving by handling crashes explicitly and using a two-phase feasible-first exploration plus exploitation strategy.
-
A Switch-Centric In-Network Architecture for Accelerating LLM Inference in Shared-Memory Network
SCIN uses an in-switch accelerator for direct memory access and 8-bit in-network quantization during All-Reduce, delivering up to 8.7x faster small-message reduction and 1.74x TTFT speedup on LLaMA-2 models.
-
The xPU-athalon: Quantifying the Competition of AI Acceleration
Quantitative benchmarks across recent AI accelerators reveal that optimal hardware choice varies with workload parameters and that several platforms incur substantially higher idle power than GPUs.
-
Watt Counts: Energy-Aware Benchmark for Sustainable LLM Inference on Heterogeneous GPU Architectures
Watt Counts supplies over 5,000 energy measurements across 50 LLMs and 10 GPUs and shows that hardware-aware selection can reduce server-scenario energy use by up to 70 percent with little effect on user experience.
-
Silicon Showdown: Performance, Efficiency, and Ecosystem Barriers in Consumer-Grade LLM Inference
Nvidia achieves 1.6x throughput with NVFP4 but hits a VRAM wall for 70B+ models, while Apple UMA enables linear scaling to 80B at 4-bit with up to 23x better energy efficiency.