DuetServe: Harmonizing Prefill and Decode for LLM Serving via Adaptive GPU Multiplexing

· 2025 · cs.LG · arXiv 2511.04791

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Modern LLM serving systems must sustain high throughput while meeting strict latency SLOs across two distinct inference phases: compute-intensive prefill and memory-bound decode phases. Existing approaches either (1) aggregate both phases on shared GPUs, leading to interference between prefill and decode phases, which degrades Time-Between-Tokens (TBT); or (2) disaggregate the two phases across GPUs, improving latency but wasting resources through duplicated models and KV cache transfers. We present DuetServe, a unified LLM serving framework that achieves disaggregation-level isolation within a single GPU. DuetServe operates in aggregated mode by default and dynamically activates SM-level GPU spatial multiplexing when TBT degradation is predicted. Its key idea is to decouple prefill and decode execution only when needed through fine-grained, adaptive SM partitioning that provides phase isolation only when contention threatens latency service level objectives. DuetServe integrates (1) an attention-aware roofline model to forecast iteration latency, (2) a partitioning optimizer that selects the optimal SM split to maximize throughput under TBT constraints, and (3) an interruption-free execution engine that eliminates CPU-GPU synchronization overhead. Evaluations show that DuetServe improves total throughput by up to 1.3x while maintaining low generation latency compared to state-of-the-art frameworks.

representative citing papers

FlexNPU: Transparent NPU Virtualization for Dynamic LLM Prefill-Decode Co-location

cs.DC · 2026-06-03 · unverdicted · novelty 5.0

FlexNPU is a transparent virtualization system for Ascend NPUs that supports dynamic prefill-decode co-location in LLM serving and reports throughput gains plus large TTFT reductions versus static baselines.

citing papers explorer

Showing 1 of 1 citing paper.

FlexNPU: Transparent NPU Virtualization for Dynamic LLM Prefill-Decode Co-location cs.DC · 2026-06-03 · unverdicted · none · ref 13 · internal anchor
FlexNPU is a transparent virtualization system for Ascend NPUs that supports dynamic prefill-decode co-location in LLM serving and reports throughput gains plus large TTFT reductions versus static baselines.

DuetServe: Harmonizing Prefill and Decode for LLM Serving via Adaptive GPU Multiplexing

fields

years

verdicts

representative citing papers

citing papers explorer