A load-aware prefill deflection scheduler for disaggregated LLM serving reduces P95 TTFT by up to 81% by interleaving chunked prefill on decode nodes and eliminating KV-cache transfers.
Nexus: Proactive Intra-GPU Disaggregation of Prefill and Decode in LLM Serving
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.DC 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
FlexNPU is a transparent virtualization system for Ascend NPUs that supports dynamic prefill-decode co-location in LLM serving and reports throughput gains plus large TTFT reductions versus static baselines.
citing papers explorer
-
Towards Load-Aware Prefill Deflection for Disaggregated LLM Serving
A load-aware prefill deflection scheduler for disaggregated LLM serving reduces P95 TTFT by up to 81% by interleaving chunked prefill on decode nodes and eliminating KV-cache transfers.
-
FlexNPU: Transparent NPU Virtualization for Dynamic LLM Prefill-Decode Co-location
FlexNPU is a transparent virtualization system for Ascend NPUs that supports dynamic prefill-decode co-location in LLM serving and reports throughput gains plus large TTFT reductions versus static baselines.