WarmServe reduces tail TTFT by up to 50.8× versus autoscaling and supports 2.5× higher throughput than GPU-sharing by using one-for-many prewarming, model placement, KV cache reservation, and efficient tensor switching.
Mooncake: Trading more storage for less computation — a KVCache-centric architecture for serving LLM chatbot,
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.DC 1years
2025 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
WarmServe: Enabling One-for-Many GPU Prewarming for Multi-LLM Serving
WarmServe reduces tail TTFT by up to 50.8× versus autoscaling and supports 2.5× higher throughput than GPU-sharing by using one-for-many prewarming, model placement, KV cache reservation, and efficient tensor switching.