NetKV: Network-Aware Decode Instance Selection for Disaggregated LLM Inference

Mubarak Adetunji Ojewale

arxiv: 2606.03910 · v1 · pith:3EPNADO3new · submitted 2026-06-02 · 💻 cs.PF · cs.AI· cs.DC· cs.NI

NetKV: Network-Aware Decode Instance Selection for Disaggregated LLM Inference

Mubarak Adetunji Ojewale This is my paper

Pith reviewed 2026-06-28 07:16 UTC · model grok-4.3

classification 💻 cs.PF cs.AIcs.DCcs.NI

keywords disaggregated LLM inferenceKV cache transfernetwork-aware schedulingtime to first tokendecode instance selectioncache localitySLO attainmentfat-tree network

0 comments

The pith

Ignoring network distance and congestion in KV cache transfers makes cache-aware LLM schedulers arbitrarily suboptimal as context length grows.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Disaggregated LLM inference splits prefill and decode across machines, so the KV cache must cross the datacenter network before decoding can begin and this transfer time adds directly to TTFT. Current schedulers pick decode instances using only compute load and prefix-cache hits, without regard to topological distance or current congestion. The paper supplies a thin network cost oracle and proves that dropping the network term causes scheduling performance to degrade without bound for longer contexts. NetKV is a simple per-request greedy selector that consumes the oracle and produces provably stable tier rankings even with stale data. In simulations it cuts mean TTFT by up to 21 percent versus round-robin and raises SLO attainment by up to 20 percentage points while adding negligible overhead to time between tokens.

Core claim

We prove that ignoring the network term renders cache-aware-only scheduling arbitrarily suboptimal as context length grows. NetKV, the O(|D|) per-request greedy that consumes this oracle, has tier rankings that are provably robust to stale telemetry. On a 64-GPU four-tier fat-tree simulator driven by Mooncake traces, NetKV reduces mean TTFT by up to 21.2% over round-robin and 17.6% over a tuned cache+load-aware scheduler, lifts SLO attainment by up to 20.1 percentage points, and keeps the Time Between Tokens overhead below 0.5 ms in every condition tested, with no changes to the transport, inference engine, or hardware.

What carries the argument

the network cost oracle, a thin operator-to-scheduler interface that reports transfer costs between prefill and decode instances based on topology and congestion

If this is right

Cache-aware-only scheduling becomes arbitrarily suboptimal as context length increases.
NetKV reduces mean TTFT by up to 21.2% versus round-robin selection.
NetKV reduces mean TTFT by up to 17.6% versus a tuned cache-plus-load scheduler.
NetKV raises SLO attainment by up to 20.1 percentage points.
Time-between-tokens overhead stays below 0.5 ms without any changes to transport or engines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same oracle interface could be reused for other disaggregated stages such as embedding or retrieval.
Production systems might combine the oracle with existing network monitoring stacks to keep data fresh.
Different fat-tree or Clos topologies could be tested to check whether the robustness to stale data holds.
The approach hints that network cost should be considered when deciding batch sizes or prefill-decode pairings.

Load-bearing premise

The network cost oracle can be realized with low overhead and sufficiently fresh data, and the four-tier fat-tree simulator driven by Mooncake traces faithfully represents production network behavior and workload patterns.

What would settle it

Measure actual TTFT and oracle overhead when NetKV runs on real production hardware with live network telemetry instead of the simulator.

Figures

Figures reproduced from arXiv: 2606.03910 by Mubarak Adetunji Ojewale.

**Figure 2.** Figure 2: Oracle staleness sweep: TTFT, TBT, and SLO are invariant from 100 ms to 60 s refresh intervals. [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Prefix-sharing sweep on the RAG workload: NetKV-Full preserves a roughly constant TTFT advantage over CA and [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Ablation ladder: mean TTFT for CLA*, NetKV-Topo-Only, NetKV-Static, and NetKV-Full across the chatbot, RAG, [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Scalability: mean TTFT, mean TBT, SLO attainment, and scheduler decision latency from 64 to 1024 GPUs. The [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

read the original abstract

Disaggregated LLM inference forces the KV cache to traverse the datacenter network before decoding begins, so transfer time enters directly into the Time to First Token (TTFT) budget. Current schedulers route on compute load and prefix-cache locality alone, ignoring the topological distance and dynamic congestion between prefill and decode instances. We close this gap with a thin operator-to-scheduler interface, the network cost oracle, and we prove that ignoring the network term renders cache-aware-only scheduling arbitrarily suboptimal as context length grows. NetKV, the O(|D|) per-request greedy that consumes this oracle, has tier rankings that are provably robust to stale telemetry. On a 64-GPU four-tier fat-tree simulator driven by Mooncake traces, NetKV reduces mean TTFT by up to 21.2% over round-robin and 17.6% over a tuned cache+load-aware scheduler, lifts SLO attainment by up to 20.1 percentage points, and keeps the Time Between Tokens overhead below 0.5 ms in every condition tested, with no changes to the transport, inference engine, or hardware.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

NetKV adds a network cost oracle and greedy selector to decode scheduling, with a suboptimality proof and simulation gains, but all numbers rest on an unvalidated fat-tree simulator.

read the letter

The paper's main addition is the network cost oracle that lets a scheduler pick decode instances while accounting for topological distance and congestion on KV cache transfers. It also includes a proof that pure cache-aware scheduling can be arbitrarily suboptimal as context length increases, plus an O(|D|) greedy that stays robust to stale data.

The work does a solid job naming a real latency component that current disaggregated systems ignore. The simulation on a 64-GPU four-tier fat-tree driven by Mooncake traces reports clear TTFT reductions (up to 21.2% vs round-robin, 17.6% vs tuned cache+load) and SLO lifts, with negligible added overhead.

The soft spot is that every performance claim comes from this single simulator with no cross-check against production fat-tree behavior, ECMP effects, or actual trace burstiness. The oracle itself is described at a high level with no implementation cost or freshness data, and the proof lives inside the cost model rather than addressing external validity. If the simulator diverges from real networks, the reported advantages shrink or reverse.

This is aimed at people building or tuning large-scale LLM inference clusters who already deal with disaggregation. A reader working on scheduling or network-aware systems would get value from the oracle idea and the suboptimality argument, provided the full paper supplies the missing derivation steps and any oracle details.

It deserves a serious referee to check whether the simulation setup and proof hold up once the full text and any artifacts are examined.

Referee Report

3 major / 1 minor

Summary. The paper introduces NetKV, a scheduler for decode instance selection in disaggregated LLM inference that incorporates a network cost oracle to account for topological distance and congestion between prefill and decode instances. It asserts a proof that cache-aware-only scheduling is arbitrarily suboptimal as context length grows, and reports simulation results on a 64-GPU four-tier fat-tree topology driven by Mooncake traces showing up to 21.2% mean TTFT reduction versus round-robin, 17.6% versus a tuned cache+load scheduler, and up to 20.1 percentage point gains in SLO attainment, with TBT overhead below 0.5 ms.

Significance. If the simulation results and oracle assumptions hold under real workloads, NetKV could improve TTFT and SLO compliance in production disaggregated LLM serving without transport or hardware changes. The claimed robustness of tier rankings to stale telemetry and the O(|D|) greedy algorithm are practical strengths if the cost model is realizable with low overhead.

major comments (3)

[Abstract] Abstract: the suboptimality proof is asserted without any derivation steps, cost-model equations, or formal statement of the network term, which is load-bearing for the central theoretical claim that cache-aware-only scheduling becomes arbitrarily suboptimal.
[Abstract] Abstract: all headline TTFT (21.2%) and SLO (20.1 pp) gains are obtained exclusively from a 64-GPU four-tier fat-tree simulator driven by Mooncake traces; no cross-validation against production fat-tree measurements, ECMP behavior, or trace burstiness is reported, undermining external validity of the empirical claims.
[Abstract] Abstract: the network cost oracle is introduced as a thin interface but no implementation overhead, freshness requirements, or accuracy bounds are quantified, which is load-bearing for the claim that NetKV can be deployed without changes to the inference engine.

minor comments (1)

[Abstract] Abstract: experimental configuration details (e.g., context length distribution, request arrival pattern, exact definition of the tuned cache+load baseline) and any statistical significance tests are absent, making it impossible to reproduce or assess the reported gains.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. Below we respond point-by-point to the major comments and indicate the revisions we will make.

read point-by-point responses

Referee: [Abstract] Abstract: the suboptimality proof is asserted without any derivation steps, cost-model equations, or formal statement of the network term, which is load-bearing for the central theoretical claim that cache-aware-only scheduling becomes arbitrarily suboptimal.

Authors: The complete proof, cost-model equations, and formal definition of the network term appear in Section 3. We agree the abstract should be self-contained on this central claim and will revise it to include a concise statement of the cost function C(p,d) = α·cache_hit(p,d) + eta·network_cost(p,d) together with a one-sentence outline of the suboptimality argument. revision: yes
Referee: [Abstract] Abstract: all headline TTFT (21.2%) and SLO (20.1 pp) gains are obtained exclusively from a 64-GPU four-tier fat-tree simulator driven by Mooncake traces; no cross-validation against production fat-tree measurements, ECMP behavior, or trace burstiness is reported, undermining external validity of the empirical claims.

Authors: All reported numbers are indeed from the 64-GPU fat-tree simulator driven by Mooncake traces; the simulator does model ECMP and trace burstiness. We lack production fat-tree measurements for direct cross-validation. We will add an explicit Limitations subsection that states the simulation assumptions and the absence of real-cluster validation, while retaining the simulation results as the primary evidence. revision: partial
Referee: [Abstract] Abstract: the network cost oracle is introduced as a thin interface but no implementation overhead, freshness requirements, or accuracy bounds are quantified, which is load-bearing for the claim that NetKV can be deployed without changes to the inference engine.

Authors: We will expand the oracle description with (i) an estimated per-request query cost based on standard switch telemetry, (ii) the already-proven robustness of tier rankings to staleness, and (iii) a simple accuracy bound derived from typical monitoring error rates. These additions will be placed in Section 4 and referenced from the abstract. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on external simulation and model-based proof

full rationale

The paper presents a mathematical proof that cache-aware scheduling becomes arbitrarily suboptimal when network cost is ignored, plus empirical gains from a 64-GPU four-tier fat-tree simulator driven by Mooncake traces. No equations, fitted parameters, or self-citations are shown that would make the TTFT reductions or SLO lifts equivalent to the inputs by construction. The proof is internal to the cost model but does not tautologically force the reported simulation outcomes; results are presented as simulator outputs rather than self-referential predictions. The derivation chain is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central design rests on the existence and low-overhead availability of a network cost oracle whose accuracy is treated as given; the suboptimality proof and simulation results depend on this interface and on the fidelity of the fat-tree model.

axioms (1)

domain assumption Network cost oracle supplies accurate topological distance and congestion values with acceptable staleness
Invoked as the input to the O(|D|) greedy ranking and the robustness claim.

invented entities (1)

network cost oracle no independent evidence
purpose: Thin operator-to-scheduler interface exposing network distance and congestion
New component introduced to close the gap left by compute-and-cache-only schedulers

pith-pipeline@v0.9.1-grok · 5731 in / 1245 out tokens · 34497 ms · 2026-06-28T07:16:42.370219+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Can I Buy Your KV Cache?
cs.AI 2026-06 unverdicted novelty 6.0

Proposes an agent-native prefill CDN where precomputed KV caches are hosted and sold to agents, delivering 9-50x compute savings with exact token and logit matching on Qwen3-4B.

Reference graph

Works this paper leans on

28 extracted references · 8 canonical work pages · cited by 1 Pith paper

[1]

Distserve: disaggregating prefill and decoding for goodput-optimized large language model serving,

Y . Zhong, S. Liu, J. Chen, J. Hu, Y . Zhu, X. Liu, X. Jin, and H. Zhang, “Distserve: disaggregating prefill and decoding for goodput-optimized large language model serving,” inProceedings of the 18th USENIX Conference on Operating Systems Design and Implementation, ser. OSDI’24. USA: USENIX Association, 2024

2024
[2]

Splitwise: Efficient generative LLM inference using phase splitting,

P. Patel, E. Choukse, C. Zhang, A. Shah, ´I. Goiri, S. Maleki, and R. Bianchini, “Splitwise: Efficient generative LLM inference using phase splitting,” 2024

2024
[3]

Mooncake: A kvcache-centric disaggregated architecture for llm serving

R. Qin, Z. Li, W. He, J. Cui, H. Tang, F. Ren, T. Ma, S. Cai, Y . Zhang, M. Zhang, Y . Wu, W. Zheng, and X. Xu, “Mooncake: A kvcache-centric disaggregated architecture for llm serving.” New York, NY , USA: Association for Computing Machinery, Nov. 2025. [Online]. Available: https://doi.org/10.1145/3773772

work page doi:10.1145/3773772 2025
[4]

Dynamo: A datacenter-scale distributed inference frame- work,

NVIDIA, “Dynamo: A datacenter-scale distributed inference frame- work,” Open-source project, https://github.com/ai-dynamo/dynamo, 2025

2025
[5]

llm-d: Kubernetes-native distributed inferencing,

llm-d project, “llm-d: Kubernetes-native distributed inferencing,” CNCF Sandbox, https://llm-d.ai, 2025

2025
[6]

Sglang: efficient execution of structured language model programs,

L. Zheng, L. Yin, Z. Xie, C. Sun, J. Huang, C. H. Yu, S. Cao, C. Kozyrakis, I. Stoica, J. E. Gonzalez, C. Barrett, and Y . Sheng, “Sglang: efficient execution of structured language model programs,” in Proceedings of the 38th International Conference on Neural Information Processing Systems. Red Hook, NY , USA: Curran Associates Inc., 2024

2024
[7]

Proceedings of the 29th Symposium on Operating Systems Principles , pages =

W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory management for large language model serving with pagedattention,” inProceedings of the 29th Symposium on Operating Systems Principles, ser. SOSP ’23. New York, NY , USA: Association for Computing Machinery, 2023, p. 611–626. [Online]. Availabl...

work page doi:10.1145/3600006.3613165 2023
[8]

The llama 3 herd of models,

A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughanet al., “The llama 3 herd of models,” 2024

2024
[9]

Cloud abstractions for ai workloads,

M. Canini, T. A. Benson, R. Bianchini, I. n. Goiri, D. Kosti ´c, P. Pietzuch, and S. Peter, “Cloud abstractions for ai workloads,” inProceedings of the 16th ACM SIGOPS Asia-Pacific Workshop on Systems, ser. APSys ’25. New York, NY , USA: Association for Computing Machinery, 2025, p. 98–105. [Online]. Available: https://doi.org/10.1145/3725783.3764395

work page doi:10.1145/3725783.3764395 2025
[10]

FlowKV: A disaggregated inference framework with low-latency KV cache transfer and load-aware scheduling,

W. Li, G. Jiang, X. Ding, Z. Tao, C. Hao, C. Xu, Y . Zhang, and H. Wang, “FlowKV: A disaggregated inference framework with low-latency KV cache transfer and load-aware scheduling,” arXiv:2504.03775, 2025

arXiv 2025
[11]

Sarathi-Serve: Taming throughput–latency tradeoff in LLM inference with chunked prefill,

A. Agrawal, N. Kedia, A. Panwar, J. Mohan, N. Kwatra, B. S. Gulavani, A. Tumanov, and R. Ramjee, “Sarathi-Serve: Taming throughput–latency tradeoff in LLM inference with chunked prefill,” inProceedings of the 18th USENIX Symposium on Operating Systems Design and Implemen- tation (OSDI), 2024

2024
[12]

AlpaServe: Statistical multiplexing with model parallelism for deep learning serving,

Z. Li, L. Zheng, Y . Zhong, V . Liu, Y . Sheng, X. Jin, Y . Huang, Z. Chen, H. Zhang, J. E. Gonzalez, and I. Stoica, “AlpaServe: Statistical multiplexing with model parallelism for deep learning serving,” in Proceedings of the 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2023

2023
[13]

ServerlessLLM: Locality-enhanced serverless inference for large language models,

Y . Fu, L. Xue, Y . Huang, A.-O. Brabete, D. Ustiugov, Y . Patel, and L. Mai, “ServerlessLLM: Locality-enhanced serverless inference for large language models,” inProceedings of the 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2024

2024
[14]

Llumnix: Dynamic scheduling for large language model serving,

B. Sun, Z. Huang, H. Zhao, W. Xiao, X. Zhang, Y . Li, and W. Lin, “Llumnix: Dynamic scheduling for large language model serving,” in Proceedings of the 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2024

2024
[15]

Cassini: network-aware job scheduling in machine learning clusters,

S. Rajasekaran, M. Ghobadi, and A. Akella, “Cassini: network-aware job scheduling in machine learning clusters,” inProceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation, ser. NSDI’24. USA: USENIX Association, 2024

2024
[16]

vclos: Network contention aware scheduling for distributed machine learning tasks in multi-tenant gpu clusters,

X. Han, S. Zhao, Y . Lv, P. Cao, W. Jiang, Q. Yang, Y . Liu, S. Lin, B. Jiang, X. Liu, Y . Cui, C. Zhou, and X. Wang, “vclos: Network contention aware scheduling for distributed machine learning tasks in multi-tenant gpu clusters,”Comput. Netw., vol. 268, no. C, Aug. 2025. [Online]. Available: https://doi.org/10.1016/j.comnet.2025.111285

work page doi:10.1016/j.comnet.2025.111285 2025
[17]

{TopoOpt}: Co-optimizing network topol- ogy and parallelization strategy for distributed training jobs,

W. Wang, M. Khazraee, Z. Zhong, M. Ghobadi, Z. Jia, D. Mudigere, Y . Zhang, and A. Kewitsch, “{TopoOpt}: Co-optimizing network topol- ogy and parallelization strategy for distributed training jobs,” in20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), 2023, pp. 739–767

2023
[18]

GORGO: Maximizing KV-cache reuse while minimizing network latency in cross-region LLM load balancing,

“GORGO: Maximizing KV-cache reuse while minimizing network latency in cross-region LLM load balancing,” arXiv:2602.11688, Feb. 2026

arXiv 2026
[19]

Helix: Serving large language models over heterogeneous gpus and network via max-flow,

Y . Mei, Y . Zhuang, X. Miao, J. Yang, Z. Jia, and R. Vinayak, “Helix: Serving large language models over heterogeneous gpus and network via max-flow,” inProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1, ser. ASPLOS ’25. New York, NY , USA: Association for Computing Mac...

work page doi:10.1145/3669940.3707215 2025
[20]

SIGCOMM Comput

M. Al-Fares, A. Loukissas, and A. Vahdat, “A scalable, commodity data center network architecture,” vol. 38, no. 4. New York, NY , USA: Association for Computing Machinery, Aug. 2008, p. 63–74. [Online]. Available: https://doi.org/10.1145/1402946.1402967

work page doi:10.1145/1402946.1402967 2008
[21]

Congestion control for large-scale rdma deployments,

Y . Zhu, H. Eran, D. Firestone, C. Guo, M. Lipshteyn, Y . Liron, J. Padhye, S. Raindel, M. H. Yahia, and M. Zhang, “Congestion control for large-scale rdma deployments,” vol. 45, no. 4. New York, NY , USA: Association for Computing Machinery, Aug. 2015, p. 523–536. [Online]. Available: https://doi.org/10.1145/2829988.2787484

work page doi:10.1145/2829988.2787484 2015
[22]

GQA: Training generalized multi-query transformer models from multi-head checkpoints,

J. Ainslie, J. Lee-Thorp, M. De Jong, Y . Zemlyanskiy, F. Lebr ´on, and S. Sanghai, “GQA: Training generalized multi-query transformer models from multi-head checkpoints,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023, pp. 4895– 4901

2023
[23]

HPCC: High precision congestion control,

Y . Li, R. Miao, H. H. Liu, Y . Zhuang, F. Feng, L. Tang, Z. Cao, M. Zhang, F. Kelly, M. Alizadeh, and M. Yu, “HPCC: High precision congestion control,” inProceedings of the ACM SIGCOMM 2019 Conference, 2019

2019
[24]

Orca: A distributed serving system for{Transformer-Based}generative models,

G.-I. Yu, J. S. Jeong, G.-W. Kim, S. Kim, and B.-G. Chun, “Orca: A distributed serving system for{Transformer-Based}generative models,” in16th USENIX symposium on operating systems design and implemen- tation (OSDI 22), 2022, pp. 521–538

2022
[25]

MLPerf inference: Datacenter v5.0 results,

MLCommons, “MLPerf inference: Datacenter v5.0 results,” https:// mlcommons.org/benchmarks/inference-datacenter/, Apr. 2025, llama-2- 70B offline, e.g., Juniper Networks 32×H100 submission at 82,749 tokens/s

2025
[26]

Data center tcp (dctcp),

M. Alizadeh, A. Greenberg, D. A. Maltz, J. Padhye, P. Patel, B. Prabhakar, S. Sengupta, and M. Sridharan, “Data center tcp (dctcp),” inProceedings of the ACM SIGCOMM 2010 Conference, ser. SIGCOMM ’10. New York, NY , USA: Association for Computing Machinery, 2010, p. 63–74. [Online]. Available: https: //doi.org/10.1145/1851182.1851192

work page doi:10.1145/1851182.1851192 2010
[27]

CacheGen: KV cache compression and streaming for fast large language model serving,

Y . Liu, H. Li, Y . Cheng, S. Ray, Y . Huang, Q. Zhang, K. Du, J. Yao, S. Lu, G. Ananthanarayanan, M. Maire, H. Hoffmann, A. Holtzman, and J. Jiang, “CacheGen: KV cache compression and streaming for fast large language model serving,” inProceedings of the ACM SIGCOMM 2024 Conference, 2024

2024
[28]

LMCache: An efficient KV cache layer for enterprise-scale LLM inference,

Y . Liu, J. Yao, H. Li, Y . Chenget al., “LMCache: An efficient KV cache layer for enterprise-scale LLM inference,” arXiv:2510.09665, Oct. 2025

arXiv 2025

[1] [1]

Distserve: disaggregating prefill and decoding for goodput-optimized large language model serving,

Y . Zhong, S. Liu, J. Chen, J. Hu, Y . Zhu, X. Liu, X. Jin, and H. Zhang, “Distserve: disaggregating prefill and decoding for goodput-optimized large language model serving,” inProceedings of the 18th USENIX Conference on Operating Systems Design and Implementation, ser. OSDI’24. USA: USENIX Association, 2024

2024

[2] [2]

Splitwise: Efficient generative LLM inference using phase splitting,

P. Patel, E. Choukse, C. Zhang, A. Shah, ´I. Goiri, S. Maleki, and R. Bianchini, “Splitwise: Efficient generative LLM inference using phase splitting,” 2024

2024

[3] [3]

Mooncake: A kvcache-centric disaggregated architecture for llm serving

R. Qin, Z. Li, W. He, J. Cui, H. Tang, F. Ren, T. Ma, S. Cai, Y . Zhang, M. Zhang, Y . Wu, W. Zheng, and X. Xu, “Mooncake: A kvcache-centric disaggregated architecture for llm serving.” New York, NY , USA: Association for Computing Machinery, Nov. 2025. [Online]. Available: https://doi.org/10.1145/3773772

work page doi:10.1145/3773772 2025

[4] [4]

Dynamo: A datacenter-scale distributed inference frame- work,

NVIDIA, “Dynamo: A datacenter-scale distributed inference frame- work,” Open-source project, https://github.com/ai-dynamo/dynamo, 2025

2025

[5] [5]

llm-d: Kubernetes-native distributed inferencing,

llm-d project, “llm-d: Kubernetes-native distributed inferencing,” CNCF Sandbox, https://llm-d.ai, 2025

2025

[6] [6]

Sglang: efficient execution of structured language model programs,

L. Zheng, L. Yin, Z. Xie, C. Sun, J. Huang, C. H. Yu, S. Cao, C. Kozyrakis, I. Stoica, J. E. Gonzalez, C. Barrett, and Y . Sheng, “Sglang: efficient execution of structured language model programs,” in Proceedings of the 38th International Conference on Neural Information Processing Systems. Red Hook, NY , USA: Curran Associates Inc., 2024

2024

[7] [7]

Proceedings of the 29th Symposium on Operating Systems Principles , pages =

W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory management for large language model serving with pagedattention,” inProceedings of the 29th Symposium on Operating Systems Principles, ser. SOSP ’23. New York, NY , USA: Association for Computing Machinery, 2023, p. 611–626. [Online]. Availabl...

work page doi:10.1145/3600006.3613165 2023

[8] [8]

The llama 3 herd of models,

A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughanet al., “The llama 3 herd of models,” 2024

2024

[9] [9]

Cloud abstractions for ai workloads,

M. Canini, T. A. Benson, R. Bianchini, I. n. Goiri, D. Kosti ´c, P. Pietzuch, and S. Peter, “Cloud abstractions for ai workloads,” inProceedings of the 16th ACM SIGOPS Asia-Pacific Workshop on Systems, ser. APSys ’25. New York, NY , USA: Association for Computing Machinery, 2025, p. 98–105. [Online]. Available: https://doi.org/10.1145/3725783.3764395

work page doi:10.1145/3725783.3764395 2025

[10] [10]

FlowKV: A disaggregated inference framework with low-latency KV cache transfer and load-aware scheduling,

W. Li, G. Jiang, X. Ding, Z. Tao, C. Hao, C. Xu, Y . Zhang, and H. Wang, “FlowKV: A disaggregated inference framework with low-latency KV cache transfer and load-aware scheduling,” arXiv:2504.03775, 2025

arXiv 2025

[11] [11]

Sarathi-Serve: Taming throughput–latency tradeoff in LLM inference with chunked prefill,

A. Agrawal, N. Kedia, A. Panwar, J. Mohan, N. Kwatra, B. S. Gulavani, A. Tumanov, and R. Ramjee, “Sarathi-Serve: Taming throughput–latency tradeoff in LLM inference with chunked prefill,” inProceedings of the 18th USENIX Symposium on Operating Systems Design and Implemen- tation (OSDI), 2024

2024

[12] [12]

AlpaServe: Statistical multiplexing with model parallelism for deep learning serving,

Z. Li, L. Zheng, Y . Zhong, V . Liu, Y . Sheng, X. Jin, Y . Huang, Z. Chen, H. Zhang, J. E. Gonzalez, and I. Stoica, “AlpaServe: Statistical multiplexing with model parallelism for deep learning serving,” in Proceedings of the 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2023

2023

[13] [13]

ServerlessLLM: Locality-enhanced serverless inference for large language models,

Y . Fu, L. Xue, Y . Huang, A.-O. Brabete, D. Ustiugov, Y . Patel, and L. Mai, “ServerlessLLM: Locality-enhanced serverless inference for large language models,” inProceedings of the 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2024

2024

[14] [14]

Llumnix: Dynamic scheduling for large language model serving,

B. Sun, Z. Huang, H. Zhao, W. Xiao, X. Zhang, Y . Li, and W. Lin, “Llumnix: Dynamic scheduling for large language model serving,” in Proceedings of the 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2024

2024

[15] [15]

Cassini: network-aware job scheduling in machine learning clusters,

S. Rajasekaran, M. Ghobadi, and A. Akella, “Cassini: network-aware job scheduling in machine learning clusters,” inProceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation, ser. NSDI’24. USA: USENIX Association, 2024

2024

[16] [16]

vclos: Network contention aware scheduling for distributed machine learning tasks in multi-tenant gpu clusters,

X. Han, S. Zhao, Y . Lv, P. Cao, W. Jiang, Q. Yang, Y . Liu, S. Lin, B. Jiang, X. Liu, Y . Cui, C. Zhou, and X. Wang, “vclos: Network contention aware scheduling for distributed machine learning tasks in multi-tenant gpu clusters,”Comput. Netw., vol. 268, no. C, Aug. 2025. [Online]. Available: https://doi.org/10.1016/j.comnet.2025.111285

work page doi:10.1016/j.comnet.2025.111285 2025

[17] [17]

{TopoOpt}: Co-optimizing network topol- ogy and parallelization strategy for distributed training jobs,

W. Wang, M. Khazraee, Z. Zhong, M. Ghobadi, Z. Jia, D. Mudigere, Y . Zhang, and A. Kewitsch, “{TopoOpt}: Co-optimizing network topol- ogy and parallelization strategy for distributed training jobs,” in20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), 2023, pp. 739–767

2023

[18] [18]

GORGO: Maximizing KV-cache reuse while minimizing network latency in cross-region LLM load balancing,

“GORGO: Maximizing KV-cache reuse while minimizing network latency in cross-region LLM load balancing,” arXiv:2602.11688, Feb. 2026

arXiv 2026

[19] [19]

Helix: Serving large language models over heterogeneous gpus and network via max-flow,

Y . Mei, Y . Zhuang, X. Miao, J. Yang, Z. Jia, and R. Vinayak, “Helix: Serving large language models over heterogeneous gpus and network via max-flow,” inProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1, ser. ASPLOS ’25. New York, NY , USA: Association for Computing Mac...

work page doi:10.1145/3669940.3707215 2025

[20] [20]

SIGCOMM Comput

M. Al-Fares, A. Loukissas, and A. Vahdat, “A scalable, commodity data center network architecture,” vol. 38, no. 4. New York, NY , USA: Association for Computing Machinery, Aug. 2008, p. 63–74. [Online]. Available: https://doi.org/10.1145/1402946.1402967

work page doi:10.1145/1402946.1402967 2008

[21] [21]

Congestion control for large-scale rdma deployments,

Y . Zhu, H. Eran, D. Firestone, C. Guo, M. Lipshteyn, Y . Liron, J. Padhye, S. Raindel, M. H. Yahia, and M. Zhang, “Congestion control for large-scale rdma deployments,” vol. 45, no. 4. New York, NY , USA: Association for Computing Machinery, Aug. 2015, p. 523–536. [Online]. Available: https://doi.org/10.1145/2829988.2787484

work page doi:10.1145/2829988.2787484 2015

[22] [22]

GQA: Training generalized multi-query transformer models from multi-head checkpoints,

J. Ainslie, J. Lee-Thorp, M. De Jong, Y . Zemlyanskiy, F. Lebr ´on, and S. Sanghai, “GQA: Training generalized multi-query transformer models from multi-head checkpoints,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023, pp. 4895– 4901

2023

[23] [23]

HPCC: High precision congestion control,

Y . Li, R. Miao, H. H. Liu, Y . Zhuang, F. Feng, L. Tang, Z. Cao, M. Zhang, F. Kelly, M. Alizadeh, and M. Yu, “HPCC: High precision congestion control,” inProceedings of the ACM SIGCOMM 2019 Conference, 2019

2019

[24] [24]

Orca: A distributed serving system for{Transformer-Based}generative models,

G.-I. Yu, J. S. Jeong, G.-W. Kim, S. Kim, and B.-G. Chun, “Orca: A distributed serving system for{Transformer-Based}generative models,” in16th USENIX symposium on operating systems design and implemen- tation (OSDI 22), 2022, pp. 521–538

2022

[25] [25]

MLPerf inference: Datacenter v5.0 results,

MLCommons, “MLPerf inference: Datacenter v5.0 results,” https:// mlcommons.org/benchmarks/inference-datacenter/, Apr. 2025, llama-2- 70B offline, e.g., Juniper Networks 32×H100 submission at 82,749 tokens/s

2025

[26] [26]

Data center tcp (dctcp),

M. Alizadeh, A. Greenberg, D. A. Maltz, J. Padhye, P. Patel, B. Prabhakar, S. Sengupta, and M. Sridharan, “Data center tcp (dctcp),” inProceedings of the ACM SIGCOMM 2010 Conference, ser. SIGCOMM ’10. New York, NY , USA: Association for Computing Machinery, 2010, p. 63–74. [Online]. Available: https: //doi.org/10.1145/1851182.1851192

work page doi:10.1145/1851182.1851192 2010

[27] [27]

CacheGen: KV cache compression and streaming for fast large language model serving,

Y . Liu, H. Li, Y . Cheng, S. Ray, Y . Huang, Q. Zhang, K. Du, J. Yao, S. Lu, G. Ananthanarayanan, M. Maire, H. Hoffmann, A. Holtzman, and J. Jiang, “CacheGen: KV cache compression and streaming for fast large language model serving,” inProceedings of the ACM SIGCOMM 2024 Conference, 2024

2024

[28] [28]

LMCache: An efficient KV cache layer for enterprise-scale LLM inference,

Y . Liu, J. Yao, H. Li, Y . Chenget al., “LMCache: An efficient KV cache layer for enterprise-scale LLM inference,” arXiv:2510.09665, Oct. 2025

arXiv 2025