NetKV: Network-Aware Decode Instance Selection for Disaggregated LLM Inference
Pith reviewed 2026-06-28 07:16 UTC · model grok-4.3
The pith
Ignoring network distance and congestion in KV cache transfers makes cache-aware LLM schedulers arbitrarily suboptimal as context length grows.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We prove that ignoring the network term renders cache-aware-only scheduling arbitrarily suboptimal as context length grows. NetKV, the O(|D|) per-request greedy that consumes this oracle, has tier rankings that are provably robust to stale telemetry. On a 64-GPU four-tier fat-tree simulator driven by Mooncake traces, NetKV reduces mean TTFT by up to 21.2% over round-robin and 17.6% over a tuned cache+load-aware scheduler, lifts SLO attainment by up to 20.1 percentage points, and keeps the Time Between Tokens overhead below 0.5 ms in every condition tested, with no changes to the transport, inference engine, or hardware.
What carries the argument
the network cost oracle, a thin operator-to-scheduler interface that reports transfer costs between prefill and decode instances based on topology and congestion
If this is right
- Cache-aware-only scheduling becomes arbitrarily suboptimal as context length increases.
- NetKV reduces mean TTFT by up to 21.2% versus round-robin selection.
- NetKV reduces mean TTFT by up to 17.6% versus a tuned cache-plus-load scheduler.
- NetKV raises SLO attainment by up to 20.1 percentage points.
- Time-between-tokens overhead stays below 0.5 ms without any changes to transport or engines.
Where Pith is reading between the lines
- The same oracle interface could be reused for other disaggregated stages such as embedding or retrieval.
- Production systems might combine the oracle with existing network monitoring stacks to keep data fresh.
- Different fat-tree or Clos topologies could be tested to check whether the robustness to stale data holds.
- The approach hints that network cost should be considered when deciding batch sizes or prefill-decode pairings.
Load-bearing premise
The network cost oracle can be realized with low overhead and sufficiently fresh data, and the four-tier fat-tree simulator driven by Mooncake traces faithfully represents production network behavior and workload patterns.
What would settle it
Measure actual TTFT and oracle overhead when NetKV runs on real production hardware with live network telemetry instead of the simulator.
Figures
read the original abstract
Disaggregated LLM inference forces the KV cache to traverse the datacenter network before decoding begins, so transfer time enters directly into the Time to First Token (TTFT) budget. Current schedulers route on compute load and prefix-cache locality alone, ignoring the topological distance and dynamic congestion between prefill and decode instances. We close this gap with a thin operator-to-scheduler interface, the network cost oracle, and we prove that ignoring the network term renders cache-aware-only scheduling arbitrarily suboptimal as context length grows. NetKV, the O(|D|) per-request greedy that consumes this oracle, has tier rankings that are provably robust to stale telemetry. On a 64-GPU four-tier fat-tree simulator driven by Mooncake traces, NetKV reduces mean TTFT by up to 21.2% over round-robin and 17.6% over a tuned cache+load-aware scheduler, lifts SLO attainment by up to 20.1 percentage points, and keeps the Time Between Tokens overhead below 0.5 ms in every condition tested, with no changes to the transport, inference engine, or hardware.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces NetKV, a scheduler for decode instance selection in disaggregated LLM inference that incorporates a network cost oracle to account for topological distance and congestion between prefill and decode instances. It asserts a proof that cache-aware-only scheduling is arbitrarily suboptimal as context length grows, and reports simulation results on a 64-GPU four-tier fat-tree topology driven by Mooncake traces showing up to 21.2% mean TTFT reduction versus round-robin, 17.6% versus a tuned cache+load scheduler, and up to 20.1 percentage point gains in SLO attainment, with TBT overhead below 0.5 ms.
Significance. If the simulation results and oracle assumptions hold under real workloads, NetKV could improve TTFT and SLO compliance in production disaggregated LLM serving without transport or hardware changes. The claimed robustness of tier rankings to stale telemetry and the O(|D|) greedy algorithm are practical strengths if the cost model is realizable with low overhead.
major comments (3)
- [Abstract] Abstract: the suboptimality proof is asserted without any derivation steps, cost-model equations, or formal statement of the network term, which is load-bearing for the central theoretical claim that cache-aware-only scheduling becomes arbitrarily suboptimal.
- [Abstract] Abstract: all headline TTFT (21.2%) and SLO (20.1 pp) gains are obtained exclusively from a 64-GPU four-tier fat-tree simulator driven by Mooncake traces; no cross-validation against production fat-tree measurements, ECMP behavior, or trace burstiness is reported, undermining external validity of the empirical claims.
- [Abstract] Abstract: the network cost oracle is introduced as a thin interface but no implementation overhead, freshness requirements, or accuracy bounds are quantified, which is load-bearing for the claim that NetKV can be deployed without changes to the inference engine.
minor comments (1)
- [Abstract] Abstract: experimental configuration details (e.g., context length distribution, request arrival pattern, exact definition of the tuned cache+load baseline) and any statistical significance tests are absent, making it impossible to reproduce or assess the reported gains.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. Below we respond point-by-point to the major comments and indicate the revisions we will make.
read point-by-point responses
-
Referee: [Abstract] Abstract: the suboptimality proof is asserted without any derivation steps, cost-model equations, or formal statement of the network term, which is load-bearing for the central theoretical claim that cache-aware-only scheduling becomes arbitrarily suboptimal.
Authors: The complete proof, cost-model equations, and formal definition of the network term appear in Section 3. We agree the abstract should be self-contained on this central claim and will revise it to include a concise statement of the cost function C(p,d) = α·cache_hit(p,d) + eta·network_cost(p,d) together with a one-sentence outline of the suboptimality argument. revision: yes
-
Referee: [Abstract] Abstract: all headline TTFT (21.2%) and SLO (20.1 pp) gains are obtained exclusively from a 64-GPU four-tier fat-tree simulator driven by Mooncake traces; no cross-validation against production fat-tree measurements, ECMP behavior, or trace burstiness is reported, undermining external validity of the empirical claims.
Authors: All reported numbers are indeed from the 64-GPU fat-tree simulator driven by Mooncake traces; the simulator does model ECMP and trace burstiness. We lack production fat-tree measurements for direct cross-validation. We will add an explicit Limitations subsection that states the simulation assumptions and the absence of real-cluster validation, while retaining the simulation results as the primary evidence. revision: partial
-
Referee: [Abstract] Abstract: the network cost oracle is introduced as a thin interface but no implementation overhead, freshness requirements, or accuracy bounds are quantified, which is load-bearing for the claim that NetKV can be deployed without changes to the inference engine.
Authors: We will expand the oracle description with (i) an estimated per-request query cost based on standard switch telemetry, (ii) the already-proven robustness of tier rankings to staleness, and (iii) a simple accuracy bound derived from typical monitoring error rates. These additions will be placed in Section 4 and referenced from the abstract. revision: yes
Circularity Check
No significant circularity; claims rest on external simulation and model-based proof
full rationale
The paper presents a mathematical proof that cache-aware scheduling becomes arbitrarily suboptimal when network cost is ignored, plus empirical gains from a 64-GPU four-tier fat-tree simulator driven by Mooncake traces. No equations, fitted parameters, or self-citations are shown that would make the TTFT reductions or SLO lifts equivalent to the inputs by construction. The proof is internal to the cost model but does not tautologically force the reported simulation outcomes; results are presented as simulator outputs rather than self-referential predictions. The derivation chain is therefore self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Network cost oracle supplies accurate topological distance and congestion values with acceptable staleness
invented entities (1)
-
network cost oracle
no independent evidence
Forward citations
Cited by 1 Pith paper
-
Can I Buy Your KV Cache?
Proposes an agent-native prefill CDN where precomputed KV caches are hosted and sold to agents, delivering 9-50x compute savings with exact token and logit matching on Qwen3-4B.
Reference graph
Works this paper leans on
-
[1]
Distserve: disaggregating prefill and decoding for goodput-optimized large language model serving,
Y . Zhong, S. Liu, J. Chen, J. Hu, Y . Zhu, X. Liu, X. Jin, and H. Zhang, “Distserve: disaggregating prefill and decoding for goodput-optimized large language model serving,” inProceedings of the 18th USENIX Conference on Operating Systems Design and Implementation, ser. OSDI’24. USA: USENIX Association, 2024
2024
-
[2]
Splitwise: Efficient generative LLM inference using phase splitting,
P. Patel, E. Choukse, C. Zhang, A. Shah, ´I. Goiri, S. Maleki, and R. Bianchini, “Splitwise: Efficient generative LLM inference using phase splitting,” 2024
2024
-
[3]
Mooncake: A kvcache-centric disaggregated architecture for llm serving
R. Qin, Z. Li, W. He, J. Cui, H. Tang, F. Ren, T. Ma, S. Cai, Y . Zhang, M. Zhang, Y . Wu, W. Zheng, and X. Xu, “Mooncake: A kvcache-centric disaggregated architecture for llm serving.” New York, NY , USA: Association for Computing Machinery, Nov. 2025. [Online]. Available: https://doi.org/10.1145/3773772
-
[4]
Dynamo: A datacenter-scale distributed inference frame- work,
NVIDIA, “Dynamo: A datacenter-scale distributed inference frame- work,” Open-source project, https://github.com/ai-dynamo/dynamo, 2025
2025
-
[5]
llm-d: Kubernetes-native distributed inferencing,
llm-d project, “llm-d: Kubernetes-native distributed inferencing,” CNCF Sandbox, https://llm-d.ai, 2025
2025
-
[6]
Sglang: efficient execution of structured language model programs,
L. Zheng, L. Yin, Z. Xie, C. Sun, J. Huang, C. H. Yu, S. Cao, C. Kozyrakis, I. Stoica, J. E. Gonzalez, C. Barrett, and Y . Sheng, “Sglang: efficient execution of structured language model programs,” in Proceedings of the 38th International Conference on Neural Information Processing Systems. Red Hook, NY , USA: Curran Associates Inc., 2024
2024
-
[7]
Proceedings of the 29th Symposium on Operating Systems Principles , pages =
W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory management for large language model serving with pagedattention,” inProceedings of the 29th Symposium on Operating Systems Principles, ser. SOSP ’23. New York, NY , USA: Association for Computing Machinery, 2023, p. 611–626. [Online]. Availabl...
-
[8]
The llama 3 herd of models,
A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughanet al., “The llama 3 herd of models,” 2024
2024
-
[9]
Cloud abstractions for ai workloads,
M. Canini, T. A. Benson, R. Bianchini, I. n. Goiri, D. Kosti ´c, P. Pietzuch, and S. Peter, “Cloud abstractions for ai workloads,” inProceedings of the 16th ACM SIGOPS Asia-Pacific Workshop on Systems, ser. APSys ’25. New York, NY , USA: Association for Computing Machinery, 2025, p. 98–105. [Online]. Available: https://doi.org/10.1145/3725783.3764395
-
[10]
W. Li, G. Jiang, X. Ding, Z. Tao, C. Hao, C. Xu, Y . Zhang, and H. Wang, “FlowKV: A disaggregated inference framework with low-latency KV cache transfer and load-aware scheduling,” arXiv:2504.03775, 2025
arXiv 2025
-
[11]
Sarathi-Serve: Taming throughput–latency tradeoff in LLM inference with chunked prefill,
A. Agrawal, N. Kedia, A. Panwar, J. Mohan, N. Kwatra, B. S. Gulavani, A. Tumanov, and R. Ramjee, “Sarathi-Serve: Taming throughput–latency tradeoff in LLM inference with chunked prefill,” inProceedings of the 18th USENIX Symposium on Operating Systems Design and Implemen- tation (OSDI), 2024
2024
-
[12]
AlpaServe: Statistical multiplexing with model parallelism for deep learning serving,
Z. Li, L. Zheng, Y . Zhong, V . Liu, Y . Sheng, X. Jin, Y . Huang, Z. Chen, H. Zhang, J. E. Gonzalez, and I. Stoica, “AlpaServe: Statistical multiplexing with model parallelism for deep learning serving,” in Proceedings of the 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2023
2023
-
[13]
ServerlessLLM: Locality-enhanced serverless inference for large language models,
Y . Fu, L. Xue, Y . Huang, A.-O. Brabete, D. Ustiugov, Y . Patel, and L. Mai, “ServerlessLLM: Locality-enhanced serverless inference for large language models,” inProceedings of the 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2024
2024
-
[14]
Llumnix: Dynamic scheduling for large language model serving,
B. Sun, Z. Huang, H. Zhao, W. Xiao, X. Zhang, Y . Li, and W. Lin, “Llumnix: Dynamic scheduling for large language model serving,” in Proceedings of the 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2024
2024
-
[15]
Cassini: network-aware job scheduling in machine learning clusters,
S. Rajasekaran, M. Ghobadi, and A. Akella, “Cassini: network-aware job scheduling in machine learning clusters,” inProceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation, ser. NSDI’24. USA: USENIX Association, 2024
2024
-
[16]
X. Han, S. Zhao, Y . Lv, P. Cao, W. Jiang, Q. Yang, Y . Liu, S. Lin, B. Jiang, X. Liu, Y . Cui, C. Zhou, and X. Wang, “vclos: Network contention aware scheduling for distributed machine learning tasks in multi-tenant gpu clusters,”Comput. Netw., vol. 268, no. C, Aug. 2025. [Online]. Available: https://doi.org/10.1016/j.comnet.2025.111285
-
[17]
{TopoOpt}: Co-optimizing network topol- ogy and parallelization strategy for distributed training jobs,
W. Wang, M. Khazraee, Z. Zhong, M. Ghobadi, Z. Jia, D. Mudigere, Y . Zhang, and A. Kewitsch, “{TopoOpt}: Co-optimizing network topol- ogy and parallelization strategy for distributed training jobs,” in20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), 2023, pp. 739–767
2023
-
[18]
“GORGO: Maximizing KV-cache reuse while minimizing network latency in cross-region LLM load balancing,” arXiv:2602.11688, Feb. 2026
arXiv 2026
-
[19]
Helix: Serving large language models over heterogeneous gpus and network via max-flow,
Y . Mei, Y . Zhuang, X. Miao, J. Yang, Z. Jia, and R. Vinayak, “Helix: Serving large language models over heterogeneous gpus and network via max-flow,” inProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1, ser. ASPLOS ’25. New York, NY , USA: Association for Computing Mac...
-
[20]
M. Al-Fares, A. Loukissas, and A. Vahdat, “A scalable, commodity data center network architecture,” vol. 38, no. 4. New York, NY , USA: Association for Computing Machinery, Aug. 2008, p. 63–74. [Online]. Available: https://doi.org/10.1145/1402946.1402967
-
[21]
Congestion control for large-scale rdma deployments,
Y . Zhu, H. Eran, D. Firestone, C. Guo, M. Lipshteyn, Y . Liron, J. Padhye, S. Raindel, M. H. Yahia, and M. Zhang, “Congestion control for large-scale rdma deployments,” vol. 45, no. 4. New York, NY , USA: Association for Computing Machinery, Aug. 2015, p. 523–536. [Online]. Available: https://doi.org/10.1145/2829988.2787484
-
[22]
GQA: Training generalized multi-query transformer models from multi-head checkpoints,
J. Ainslie, J. Lee-Thorp, M. De Jong, Y . Zemlyanskiy, F. Lebr ´on, and S. Sanghai, “GQA: Training generalized multi-query transformer models from multi-head checkpoints,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023, pp. 4895– 4901
2023
-
[23]
HPCC: High precision congestion control,
Y . Li, R. Miao, H. H. Liu, Y . Zhuang, F. Feng, L. Tang, Z. Cao, M. Zhang, F. Kelly, M. Alizadeh, and M. Yu, “HPCC: High precision congestion control,” inProceedings of the ACM SIGCOMM 2019 Conference, 2019
2019
-
[24]
Orca: A distributed serving system for{Transformer-Based}generative models,
G.-I. Yu, J. S. Jeong, G.-W. Kim, S. Kim, and B.-G. Chun, “Orca: A distributed serving system for{Transformer-Based}generative models,” in16th USENIX symposium on operating systems design and implemen- tation (OSDI 22), 2022, pp. 521–538
2022
-
[25]
MLPerf inference: Datacenter v5.0 results,
MLCommons, “MLPerf inference: Datacenter v5.0 results,” https:// mlcommons.org/benchmarks/inference-datacenter/, Apr. 2025, llama-2- 70B offline, e.g., Juniper Networks 32×H100 submission at 82,749 tokens/s
2025
-
[26]
M. Alizadeh, A. Greenberg, D. A. Maltz, J. Padhye, P. Patel, B. Prabhakar, S. Sengupta, and M. Sridharan, “Data center tcp (dctcp),” inProceedings of the ACM SIGCOMM 2010 Conference, ser. SIGCOMM ’10. New York, NY , USA: Association for Computing Machinery, 2010, p. 63–74. [Online]. Available: https: //doi.org/10.1145/1851182.1851192
-
[27]
CacheGen: KV cache compression and streaming for fast large language model serving,
Y . Liu, H. Li, Y . Cheng, S. Ray, Y . Huang, Q. Zhang, K. Du, J. Yao, S. Lu, G. Ananthanarayanan, M. Maire, H. Hoffmann, A. Holtzman, and J. Jiang, “CacheGen: KV cache compression and streaming for fast large language model serving,” inProceedings of the ACM SIGCOMM 2024 Conference, 2024
2024
-
[28]
LMCache: An efficient KV cache layer for enterprise-scale LLM inference,
Y . Liu, J. Yao, H. Li, Y . Chenget al., “LMCache: An efficient KV cache layer for enterprise-scale LLM inference,” arXiv:2510.09665, Oct. 2025
arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.