pith. sign in

arxiv: 2606.03910 · v1 · pith:3EPNADO3new · submitted 2026-06-02 · 💻 cs.PF · cs.AI· cs.DC· cs.NI

NetKV: Network-Aware Decode Instance Selection for Disaggregated LLM Inference

Pith reviewed 2026-06-28 07:16 UTC · model grok-4.3

classification 💻 cs.PF cs.AIcs.DCcs.NI
keywords disaggregated LLM inferenceKV cache transfernetwork-aware schedulingtime to first tokendecode instance selectioncache localitySLO attainmentfat-tree network
0
0 comments X

The pith

Ignoring network distance and congestion in KV cache transfers makes cache-aware LLM schedulers arbitrarily suboptimal as context length grows.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Disaggregated LLM inference splits prefill and decode across machines, so the KV cache must cross the datacenter network before decoding can begin and this transfer time adds directly to TTFT. Current schedulers pick decode instances using only compute load and prefix-cache hits, without regard to topological distance or current congestion. The paper supplies a thin network cost oracle and proves that dropping the network term causes scheduling performance to degrade without bound for longer contexts. NetKV is a simple per-request greedy selector that consumes the oracle and produces provably stable tier rankings even with stale data. In simulations it cuts mean TTFT by up to 21 percent versus round-robin and raises SLO attainment by up to 20 percentage points while adding negligible overhead to time between tokens.

Core claim

We prove that ignoring the network term renders cache-aware-only scheduling arbitrarily suboptimal as context length grows. NetKV, the O(|D|) per-request greedy that consumes this oracle, has tier rankings that are provably robust to stale telemetry. On a 64-GPU four-tier fat-tree simulator driven by Mooncake traces, NetKV reduces mean TTFT by up to 21.2% over round-robin and 17.6% over a tuned cache+load-aware scheduler, lifts SLO attainment by up to 20.1 percentage points, and keeps the Time Between Tokens overhead below 0.5 ms in every condition tested, with no changes to the transport, inference engine, or hardware.

What carries the argument

the network cost oracle, a thin operator-to-scheduler interface that reports transfer costs between prefill and decode instances based on topology and congestion

If this is right

  • Cache-aware-only scheduling becomes arbitrarily suboptimal as context length increases.
  • NetKV reduces mean TTFT by up to 21.2% versus round-robin selection.
  • NetKV reduces mean TTFT by up to 17.6% versus a tuned cache-plus-load scheduler.
  • NetKV raises SLO attainment by up to 20.1 percentage points.
  • Time-between-tokens overhead stays below 0.5 ms without any changes to transport or engines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same oracle interface could be reused for other disaggregated stages such as embedding or retrieval.
  • Production systems might combine the oracle with existing network monitoring stacks to keep data fresh.
  • Different fat-tree or Clos topologies could be tested to check whether the robustness to stale data holds.
  • The approach hints that network cost should be considered when deciding batch sizes or prefill-decode pairings.

Load-bearing premise

The network cost oracle can be realized with low overhead and sufficiently fresh data, and the four-tier fat-tree simulator driven by Mooncake traces faithfully represents production network behavior and workload patterns.

What would settle it

Measure actual TTFT and oracle overhead when NetKV runs on real production hardware with live network telemetry instead of the simulator.

Figures

Figures reproduced from arXiv: 2606.03910 by Mubarak Adetunji Ojewale.

Figure 1
Figure 1. Figure 1: NetKV-Full mean-TTFT reduction over CLA* (%) across the topology sweep for each workload profile. Rows: cross [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Oracle staleness sweep: TTFT, TBT, and SLO are invariant from 100 ms to 60 s refresh intervals. [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Prefix-sharing sweep on the RAG workload: NetKV-Full preserves a roughly constant TTFT advantage over CA and [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Ablation ladder: mean TTFT for CLA*, NetKV-Topo-Only, NetKV-Static, and NetKV-Full across the chatbot, RAG, [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Scalability: mean TTFT, mean TBT, SLO attainment, and scheduler decision latency from 64 to 1024 GPUs. The [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
read the original abstract

Disaggregated LLM inference forces the KV cache to traverse the datacenter network before decoding begins, so transfer time enters directly into the Time to First Token (TTFT) budget. Current schedulers route on compute load and prefix-cache locality alone, ignoring the topological distance and dynamic congestion between prefill and decode instances. We close this gap with a thin operator-to-scheduler interface, the network cost oracle, and we prove that ignoring the network term renders cache-aware-only scheduling arbitrarily suboptimal as context length grows. NetKV, the O(|D|) per-request greedy that consumes this oracle, has tier rankings that are provably robust to stale telemetry. On a 64-GPU four-tier fat-tree simulator driven by Mooncake traces, NetKV reduces mean TTFT by up to 21.2% over round-robin and 17.6% over a tuned cache+load-aware scheduler, lifts SLO attainment by up to 20.1 percentage points, and keeps the Time Between Tokens overhead below 0.5 ms in every condition tested, with no changes to the transport, inference engine, or hardware.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces NetKV, a scheduler for decode instance selection in disaggregated LLM inference that incorporates a network cost oracle to account for topological distance and congestion between prefill and decode instances. It asserts a proof that cache-aware-only scheduling is arbitrarily suboptimal as context length grows, and reports simulation results on a 64-GPU four-tier fat-tree topology driven by Mooncake traces showing up to 21.2% mean TTFT reduction versus round-robin, 17.6% versus a tuned cache+load scheduler, and up to 20.1 percentage point gains in SLO attainment, with TBT overhead below 0.5 ms.

Significance. If the simulation results and oracle assumptions hold under real workloads, NetKV could improve TTFT and SLO compliance in production disaggregated LLM serving without transport or hardware changes. The claimed robustness of tier rankings to stale telemetry and the O(|D|) greedy algorithm are practical strengths if the cost model is realizable with low overhead.

major comments (3)
  1. [Abstract] Abstract: the suboptimality proof is asserted without any derivation steps, cost-model equations, or formal statement of the network term, which is load-bearing for the central theoretical claim that cache-aware-only scheduling becomes arbitrarily suboptimal.
  2. [Abstract] Abstract: all headline TTFT (21.2%) and SLO (20.1 pp) gains are obtained exclusively from a 64-GPU four-tier fat-tree simulator driven by Mooncake traces; no cross-validation against production fat-tree measurements, ECMP behavior, or trace burstiness is reported, undermining external validity of the empirical claims.
  3. [Abstract] Abstract: the network cost oracle is introduced as a thin interface but no implementation overhead, freshness requirements, or accuracy bounds are quantified, which is load-bearing for the claim that NetKV can be deployed without changes to the inference engine.
minor comments (1)
  1. [Abstract] Abstract: experimental configuration details (e.g., context length distribution, request arrival pattern, exact definition of the tuned cache+load baseline) and any statistical significance tests are absent, making it impossible to reproduce or assess the reported gains.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. Below we respond point-by-point to the major comments and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the suboptimality proof is asserted without any derivation steps, cost-model equations, or formal statement of the network term, which is load-bearing for the central theoretical claim that cache-aware-only scheduling becomes arbitrarily suboptimal.

    Authors: The complete proof, cost-model equations, and formal definition of the network term appear in Section 3. We agree the abstract should be self-contained on this central claim and will revise it to include a concise statement of the cost function C(p,d) = α·cache_hit(p,d) + eta·network_cost(p,d) together with a one-sentence outline of the suboptimality argument. revision: yes

  2. Referee: [Abstract] Abstract: all headline TTFT (21.2%) and SLO (20.1 pp) gains are obtained exclusively from a 64-GPU four-tier fat-tree simulator driven by Mooncake traces; no cross-validation against production fat-tree measurements, ECMP behavior, or trace burstiness is reported, undermining external validity of the empirical claims.

    Authors: All reported numbers are indeed from the 64-GPU fat-tree simulator driven by Mooncake traces; the simulator does model ECMP and trace burstiness. We lack production fat-tree measurements for direct cross-validation. We will add an explicit Limitations subsection that states the simulation assumptions and the absence of real-cluster validation, while retaining the simulation results as the primary evidence. revision: partial

  3. Referee: [Abstract] Abstract: the network cost oracle is introduced as a thin interface but no implementation overhead, freshness requirements, or accuracy bounds are quantified, which is load-bearing for the claim that NetKV can be deployed without changes to the inference engine.

    Authors: We will expand the oracle description with (i) an estimated per-request query cost based on standard switch telemetry, (ii) the already-proven robustness of tier rankings to staleness, and (iii) a simple accuracy bound derived from typical monitoring error rates. These additions will be placed in Section 4 and referenced from the abstract. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on external simulation and model-based proof

full rationale

The paper presents a mathematical proof that cache-aware scheduling becomes arbitrarily suboptimal when network cost is ignored, plus empirical gains from a 64-GPU four-tier fat-tree simulator driven by Mooncake traces. No equations, fitted parameters, or self-citations are shown that would make the TTFT reductions or SLO lifts equivalent to the inputs by construction. The proof is internal to the cost model but does not tautologically force the reported simulation outcomes; results are presented as simulator outputs rather than self-referential predictions. The derivation chain is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central design rests on the existence and low-overhead availability of a network cost oracle whose accuracy is treated as given; the suboptimality proof and simulation results depend on this interface and on the fidelity of the fat-tree model.

axioms (1)
  • domain assumption Network cost oracle supplies accurate topological distance and congestion values with acceptable staleness
    Invoked as the input to the O(|D|) greedy ranking and the robustness claim.
invented entities (1)
  • network cost oracle no independent evidence
    purpose: Thin operator-to-scheduler interface exposing network distance and congestion
    New component introduced to close the gap left by compute-and-cache-only schedulers

pith-pipeline@v0.9.1-grok · 5731 in / 1245 out tokens · 34497 ms · 2026-06-28T07:16:42.370219+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Can I Buy Your KV Cache?

    cs.AI 2026-06 unverdicted novelty 6.0

    Proposes an agent-native prefill CDN where precomputed KV caches are hosted and sold to agents, delivering 9-50x compute savings with exact token and logit matching on Qwen3-4B.

Reference graph

Works this paper leans on

28 extracted references · 8 canonical work pages · cited by 1 Pith paper

  1. [1]

    Distserve: disaggregating prefill and decoding for goodput-optimized large language model serving,

    Y . Zhong, S. Liu, J. Chen, J. Hu, Y . Zhu, X. Liu, X. Jin, and H. Zhang, “Distserve: disaggregating prefill and decoding for goodput-optimized large language model serving,” inProceedings of the 18th USENIX Conference on Operating Systems Design and Implementation, ser. OSDI’24. USA: USENIX Association, 2024

  2. [2]

    Splitwise: Efficient generative LLM inference using phase splitting,

    P. Patel, E. Choukse, C. Zhang, A. Shah, ´I. Goiri, S. Maleki, and R. Bianchini, “Splitwise: Efficient generative LLM inference using phase splitting,” 2024

  3. [3]

    Mooncake: A kvcache-centric disaggregated architecture for llm serving

    R. Qin, Z. Li, W. He, J. Cui, H. Tang, F. Ren, T. Ma, S. Cai, Y . Zhang, M. Zhang, Y . Wu, W. Zheng, and X. Xu, “Mooncake: A kvcache-centric disaggregated architecture for llm serving.” New York, NY , USA: Association for Computing Machinery, Nov. 2025. [Online]. Available: https://doi.org/10.1145/3773772

  4. [4]

    Dynamo: A datacenter-scale distributed inference frame- work,

    NVIDIA, “Dynamo: A datacenter-scale distributed inference frame- work,” Open-source project, https://github.com/ai-dynamo/dynamo, 2025

  5. [5]

    llm-d: Kubernetes-native distributed inferencing,

    llm-d project, “llm-d: Kubernetes-native distributed inferencing,” CNCF Sandbox, https://llm-d.ai, 2025

  6. [6]

    Sglang: efficient execution of structured language model programs,

    L. Zheng, L. Yin, Z. Xie, C. Sun, J. Huang, C. H. Yu, S. Cao, C. Kozyrakis, I. Stoica, J. E. Gonzalez, C. Barrett, and Y . Sheng, “Sglang: efficient execution of structured language model programs,” in Proceedings of the 38th International Conference on Neural Information Processing Systems. Red Hook, NY , USA: Curran Associates Inc., 2024

  7. [7]

    Proceedings of the 29th Symposium on Operating Systems Principles , pages =

    W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory management for large language model serving with pagedattention,” inProceedings of the 29th Symposium on Operating Systems Principles, ser. SOSP ’23. New York, NY , USA: Association for Computing Machinery, 2023, p. 611–626. [Online]. Availabl...

  8. [8]

    The llama 3 herd of models,

    A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughanet al., “The llama 3 herd of models,” 2024

  9. [9]

    Cloud abstractions for ai workloads,

    M. Canini, T. A. Benson, R. Bianchini, I. n. Goiri, D. Kosti ´c, P. Pietzuch, and S. Peter, “Cloud abstractions for ai workloads,” inProceedings of the 16th ACM SIGOPS Asia-Pacific Workshop on Systems, ser. APSys ’25. New York, NY , USA: Association for Computing Machinery, 2025, p. 98–105. [Online]. Available: https://doi.org/10.1145/3725783.3764395

  10. [10]

    FlowKV: A disaggregated inference framework with low-latency KV cache transfer and load-aware scheduling,

    W. Li, G. Jiang, X. Ding, Z. Tao, C. Hao, C. Xu, Y . Zhang, and H. Wang, “FlowKV: A disaggregated inference framework with low-latency KV cache transfer and load-aware scheduling,” arXiv:2504.03775, 2025

  11. [11]

    Sarathi-Serve: Taming throughput–latency tradeoff in LLM inference with chunked prefill,

    A. Agrawal, N. Kedia, A. Panwar, J. Mohan, N. Kwatra, B. S. Gulavani, A. Tumanov, and R. Ramjee, “Sarathi-Serve: Taming throughput–latency tradeoff in LLM inference with chunked prefill,” inProceedings of the 18th USENIX Symposium on Operating Systems Design and Implemen- tation (OSDI), 2024

  12. [12]

    AlpaServe: Statistical multiplexing with model parallelism for deep learning serving,

    Z. Li, L. Zheng, Y . Zhong, V . Liu, Y . Sheng, X. Jin, Y . Huang, Z. Chen, H. Zhang, J. E. Gonzalez, and I. Stoica, “AlpaServe: Statistical multiplexing with model parallelism for deep learning serving,” in Proceedings of the 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2023

  13. [13]

    ServerlessLLM: Locality-enhanced serverless inference for large language models,

    Y . Fu, L. Xue, Y . Huang, A.-O. Brabete, D. Ustiugov, Y . Patel, and L. Mai, “ServerlessLLM: Locality-enhanced serverless inference for large language models,” inProceedings of the 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2024

  14. [14]

    Llumnix: Dynamic scheduling for large language model serving,

    B. Sun, Z. Huang, H. Zhao, W. Xiao, X. Zhang, Y . Li, and W. Lin, “Llumnix: Dynamic scheduling for large language model serving,” in Proceedings of the 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2024

  15. [15]

    Cassini: network-aware job scheduling in machine learning clusters,

    S. Rajasekaran, M. Ghobadi, and A. Akella, “Cassini: network-aware job scheduling in machine learning clusters,” inProceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation, ser. NSDI’24. USA: USENIX Association, 2024

  16. [16]

    vclos: Network contention aware scheduling for distributed machine learning tasks in multi-tenant gpu clusters,

    X. Han, S. Zhao, Y . Lv, P. Cao, W. Jiang, Q. Yang, Y . Liu, S. Lin, B. Jiang, X. Liu, Y . Cui, C. Zhou, and X. Wang, “vclos: Network contention aware scheduling for distributed machine learning tasks in multi-tenant gpu clusters,”Comput. Netw., vol. 268, no. C, Aug. 2025. [Online]. Available: https://doi.org/10.1016/j.comnet.2025.111285

  17. [17]

    {TopoOpt}: Co-optimizing network topol- ogy and parallelization strategy for distributed training jobs,

    W. Wang, M. Khazraee, Z. Zhong, M. Ghobadi, Z. Jia, D. Mudigere, Y . Zhang, and A. Kewitsch, “{TopoOpt}: Co-optimizing network topol- ogy and parallelization strategy for distributed training jobs,” in20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), 2023, pp. 739–767

  18. [18]

    GORGO: Maximizing KV-cache reuse while minimizing network latency in cross-region LLM load balancing,

    “GORGO: Maximizing KV-cache reuse while minimizing network latency in cross-region LLM load balancing,” arXiv:2602.11688, Feb. 2026

  19. [19]

    Helix: Serving large language models over heterogeneous gpus and network via max-flow,

    Y . Mei, Y . Zhuang, X. Miao, J. Yang, Z. Jia, and R. Vinayak, “Helix: Serving large language models over heterogeneous gpus and network via max-flow,” inProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1, ser. ASPLOS ’25. New York, NY , USA: Association for Computing Mac...

  20. [20]

    SIGCOMM Comput

    M. Al-Fares, A. Loukissas, and A. Vahdat, “A scalable, commodity data center network architecture,” vol. 38, no. 4. New York, NY , USA: Association for Computing Machinery, Aug. 2008, p. 63–74. [Online]. Available: https://doi.org/10.1145/1402946.1402967

  21. [21]

    Congestion control for large-scale rdma deployments,

    Y . Zhu, H. Eran, D. Firestone, C. Guo, M. Lipshteyn, Y . Liron, J. Padhye, S. Raindel, M. H. Yahia, and M. Zhang, “Congestion control for large-scale rdma deployments,” vol. 45, no. 4. New York, NY , USA: Association for Computing Machinery, Aug. 2015, p. 523–536. [Online]. Available: https://doi.org/10.1145/2829988.2787484

  22. [22]

    GQA: Training generalized multi-query transformer models from multi-head checkpoints,

    J. Ainslie, J. Lee-Thorp, M. De Jong, Y . Zemlyanskiy, F. Lebr ´on, and S. Sanghai, “GQA: Training generalized multi-query transformer models from multi-head checkpoints,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023, pp. 4895– 4901

  23. [23]

    HPCC: High precision congestion control,

    Y . Li, R. Miao, H. H. Liu, Y . Zhuang, F. Feng, L. Tang, Z. Cao, M. Zhang, F. Kelly, M. Alizadeh, and M. Yu, “HPCC: High precision congestion control,” inProceedings of the ACM SIGCOMM 2019 Conference, 2019

  24. [24]

    Orca: A distributed serving system for{Transformer-Based}generative models,

    G.-I. Yu, J. S. Jeong, G.-W. Kim, S. Kim, and B.-G. Chun, “Orca: A distributed serving system for{Transformer-Based}generative models,” in16th USENIX symposium on operating systems design and implemen- tation (OSDI 22), 2022, pp. 521–538

  25. [25]

    MLPerf inference: Datacenter v5.0 results,

    MLCommons, “MLPerf inference: Datacenter v5.0 results,” https:// mlcommons.org/benchmarks/inference-datacenter/, Apr. 2025, llama-2- 70B offline, e.g., Juniper Networks 32×H100 submission at 82,749 tokens/s

  26. [26]

    Data center tcp (dctcp),

    M. Alizadeh, A. Greenberg, D. A. Maltz, J. Padhye, P. Patel, B. Prabhakar, S. Sengupta, and M. Sridharan, “Data center tcp (dctcp),” inProceedings of the ACM SIGCOMM 2010 Conference, ser. SIGCOMM ’10. New York, NY , USA: Association for Computing Machinery, 2010, p. 63–74. [Online]. Available: https: //doi.org/10.1145/1851182.1851192

  27. [27]

    CacheGen: KV cache compression and streaming for fast large language model serving,

    Y . Liu, H. Li, Y . Cheng, S. Ray, Y . Huang, Q. Zhang, K. Du, J. Yao, S. Lu, G. Ananthanarayanan, M. Maire, H. Hoffmann, A. Holtzman, and J. Jiang, “CacheGen: KV cache compression and streaming for fast large language model serving,” inProceedings of the ACM SIGCOMM 2024 Conference, 2024

  28. [28]

    LMCache: An efficient KV cache layer for enterprise-scale LLM inference,

    Y . Liu, J. Yao, H. Li, Y . Chenget al., “LMCache: An efficient KV cache layer for enterprise-scale LLM inference,” arXiv:2510.09665, Oct. 2025