Adaserve: Accelerating multi-slo llm serving with slo-customized speculative decoding

Zikun Li, Zhuofu Chen, Remi Delacourt, Gabriele Oliaro, Zeyu Wang, Qinghan Chen, Shuhuai Lin, April Yang, Zhihao Zhang, Zhuoming Chen, et al · 2025 · arXiv 2501.12162

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

read on arXiv browse 4 citing papers

representative citing papers

Regulating Branch Parallelism in LLM Serving

cs.DC · 2026-05-07 · unverdicted · novelty 7.0

TAPER regulates LLM branch parallelism by admitting extra branches opportunistically when predicted externality fits slack, delivering 1.48-1.77x higher goodput than eager or fixed-cap baselines on Qwen3-32B while keeping over 95% SLO attainment.

Coral: Cost-Efficient Multi-LLM Serving over Heterogeneous Cloud GPUs

cs.DC · 2026-05-05 · unverdicted · novelty 7.0

Coral cuts multi-LLM serving costs by up to 2.79x and raises goodput by up to 2.39x on heterogeneous GPUs through adaptive joint optimization and a lossless two-stage decomposition that solves quickly.

FASER: Fine-Grained Phase Management for Speculative Decoding in Dynamic LLM Serving

cs.DC · 2026-04-22 · unverdicted · novelty 7.0

FASER delivers up to 53% higher throughput and 1.92x lower latency in dynamic LLM serving by adjusting speculative lengths per request, early pruning of rejects, and overlapping draft/verification phases via frontiers.

Cloud-native and Distributed Systems for Efficient and Scalable Large Language Models -- A Research Agenda

cs.DC · 2026-04-19 · unverdicted · novelty 2.0

This research agenda argues that cloud-native architectures, microservices, autoscaling, and emerging trends like serverless inference and federated learning are required to make large language models efficient and scalable.

citing papers explorer

Showing 4 of 4 citing papers.

Regulating Branch Parallelism in LLM Serving cs.DC · 2026-05-07 · unverdicted · none · ref 42
TAPER regulates LLM branch parallelism by admitting extra branches opportunistically when predicted externality fits slack, delivering 1.48-1.77x higher goodput than eager or fixed-cap baselines on Qwen3-32B while keeping over 95% SLO attainment.
Coral: Cost-Efficient Multi-LLM Serving over Heterogeneous Cloud GPUs cs.DC · 2026-05-05 · unverdicted · none · ref 25
Coral cuts multi-LLM serving costs by up to 2.79x and raises goodput by up to 2.39x on heterogeneous GPUs through adaptive joint optimization and a lossless two-stage decomposition that solves quickly.
FASER: Fine-Grained Phase Management for Speculative Decoding in Dynamic LLM Serving cs.DC · 2026-04-22 · unverdicted · none · ref 27
FASER delivers up to 53% higher throughput and 1.92x lower latency in dynamic LLM serving by adjusting speculative lengths per request, early pruning of rejects, and overlapping draft/verification phases via frontiers.
Cloud-native and Distributed Systems for Efficient and Scalable Large Language Models -- A Research Agenda cs.DC · 2026-04-19 · unverdicted · none · ref 79
This research agenda argues that cloud-native architectures, microservices, autoscaling, and emerging trends like serverless inference and federated learning are required to make large language models efficient and scalable.

Adaserve: Accelerating multi-slo llm serving with slo-customized speculative decoding

fields

years

verdicts

representative citing papers

citing papers explorer