Distserve: Disaggregating prefill and decoding for goodput-optimized large language model serving

Zhong, Y · 2024 · arXiv 2401.09670

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

representative citing papers

The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures

cs.DC · 2026-05-12 · unverdicted · novelty 7.0

Power capping is illusory in LLM decode as memory-bound operation leaves power headroom untouched on 700 W GPUs, while SM clock locking saves up to 32% energy and three DVFS classes appear across attention types.

Dooly: Configuration-Agnostic, Redundancy-Aware Profiling for LLM Inference Simulation

cs.DC · 2026-05-08 · unverdicted · novelty 7.0

Dooly reduces LLM inference profiling costs by 56.4% via configuration-agnostic taint-based labeling and selective database reuse, delivering simulation accuracy within 5% MAPE for TTFT and 8% for TPOT across 12 models.

When Is the Same Model Not the Same Service? A Measurement Study of Hosted Open-Weight LLM APIs

cs.PF · 2026-05-04 · unverdicted · novelty 7.0 · 2 refs

Hosted open-weight LLM APIs function as time-varying heterogeneous services rather than fixed model artifacts, with demand concentrated, supply-use mismatches, and task-specific routing yielding major cost and throughput gains.

MLCommons Chakra: Advancing Performance Benchmarking and Co-design using Standardized Execution Traces

cs.DC · 2026-05-11 · unverdicted · novelty 6.0

Chakra introduces a portable, interoperable graph-based execution trace format for distributed ML workloads along with supporting tools to standardize performance benchmarking and software-hardware co-design.

Tackling the Data-Parallel Load Balancing Bottleneck in LLM Serving: Practical Online Routing at Scale

cs.DC · 2026-05-07 · unverdicted · novelty 6.0 · 2 refs

BalanceRoute uses a piecewise-linear F-score (with optional short lookahead) for sticky request routing in LLM serving, reducing DP imbalance and raising end-to-end throughput versus vLLM baselines on production and Azure traces.

A Queueing-Theoretic Framework for Stability Analysis of LLM Inference with KV Cache Memory Constraints

cs.LG · 2026-05-06 · unverdicted · novelty 6.0

A queueing model derives stability conditions for LLM inference services under combined compute and KV cache memory limits, with experimental validation showing typical deviations under 10%.

MemExplorer: Navigating the Heterogeneous Memory Design Space for Agentic Inference NPUs

cs.AR · 2026-04-17 · unverdicted · novelty 6.0

MemExplorer optimizes heterogeneous memory systems for agentic LLM inference on NPUs and reports up to 2.3x higher energy efficiency than baselines under fixed power budgets.

HybridFlow: A Flexible and Efficient RLHF Framework

cs.LG · 2024-09-28 · unverdicted · novelty 6.0

HybridFlow combines single- and multi-controller paradigms with a 3D-HybridEngine to deliver 1.53x to 20.57x higher throughput for various RLHF algorithms compared to prior systems.

citing papers explorer

Showing 8 of 8 citing papers.

The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures cs.DC · 2026-05-12 · unverdicted · none · ref 25
Power capping is illusory in LLM decode as memory-bound operation leaves power headroom untouched on 700 W GPUs, while SM clock locking saves up to 32% energy and three DVFS classes appear across attention types.
Dooly: Configuration-Agnostic, Redundancy-Aware Profiling for LLM Inference Simulation cs.DC · 2026-05-08 · unverdicted · none · ref 42
Dooly reduces LLM inference profiling costs by 56.4% via configuration-agnostic taint-based labeling and selective database reuse, delivering simulation accuracy within 5% MAPE for TTFT and 8% for TPOT across 12 models.
When Is the Same Model Not the Same Service? A Measurement Study of Hosted Open-Weight LLM APIs cs.PF · 2026-05-04 · unverdicted · none · ref 15 · 2 links
Hosted open-weight LLM APIs function as time-varying heterogeneous services rather than fixed model artifacts, with demand concentrated, supply-use mismatches, and task-specific routing yielding major cost and throughput gains.
MLCommons Chakra: Advancing Performance Benchmarking and Co-design using Standardized Execution Traces cs.DC · 2026-05-11 · unverdicted · none · ref 105
Chakra introduces a portable, interoperable graph-based execution trace format for distributed ML workloads along with supporting tools to standardize performance benchmarking and software-hardware co-design.
Tackling the Data-Parallel Load Balancing Bottleneck in LLM Serving: Practical Online Routing at Scale cs.DC · 2026-05-07 · unverdicted · none · ref 33 · 2 links
BalanceRoute uses a piecewise-linear F-score (with optional short lookahead) for sticky request routing in LLM serving, reducing DP imbalance and raising end-to-end throughput versus vLLM baselines on production and Azure traces.
A Queueing-Theoretic Framework for Stability Analysis of LLM Inference with KV Cache Memory Constraints cs.LG · 2026-05-06 · unverdicted · none · ref 69
A queueing model derives stability conditions for LLM inference services under combined compute and KV cache memory limits, with experimental validation showing typical deviations under 10%.
MemExplorer: Navigating the Heterogeneous Memory Design Space for Agentic Inference NPUs cs.AR · 2026-04-17 · unverdicted · none · ref 64
MemExplorer optimizes heterogeneous memory systems for agentic LLM inference on NPUs and reports up to 2.3x higher energy efficiency than baselines under fixed power budgets.
HybridFlow: A Flexible and Efficient RLHF Framework cs.LG · 2024-09-28 · unverdicted · none · ref 105
HybridFlow combines single- and multi-controller paradigms with a 3D-HybridEngine to deliver 1.53x to 20.57x higher throughput for various RLHF algorithms compared to prior systems.

Distserve: Disaggregating prefill and decoding for goodput-optimized large language model serving

fields

years

verdicts

representative citing papers

citing papers explorer