SLO-Guard improves tuning budget consistency for SLO-constrained LLM serving by handling crashes explicitly and using a two-phase feasible-first exploration plus exploitation strategy.
Revisiting slo and goodput metrics in llm serving
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
CoRoVA compresses repository context into compact vectors for code LLMs, reducing TTFT 20-38% versus uncompressed RAG with only a small projector module.
A survey synthesizing challenges, system architectures, model optimizations, deployment methods, and resource management techniques for large language model inference at the network edge.
citing papers explorer
-
SLO-Guard: Crash-Aware, Budget-Consistent Autotuning for SLO-Constrained LLM Serving
SLO-Guard improves tuning budget consistency for SLO-constrained LLM serving by handling crashes explicitly and using a two-phase feasible-first exploration plus exploitation strategy.
-
CoRoVA: Compressed Representations for Vector-Augmented Code Completion
CoRoVA compresses repository context into compact vectors for code LLMs, reducing TTFT 20-38% versus uncompressed RAG with only a small projector module.
-
Network Edge Inference for Large Language Models: Principles, Techniques, and Opportunities
A survey synthesizing challenges, system architectures, model optimizations, deployment methods, and resource management techniques for large language model inference at the network edge.