pith. sign in

arxiv: 2606.01502 · v1 · pith:RXAIV4XQnew · submitted 2026-05-31 · 💻 cs.DC · cs.AI· cs.NI

Move the Query, Not the Cache: Characterizing Cross-Instance Latent Attention Redistribution Across GPU Fabrics

Pith reviewed 2026-06-28 16:00 UTC · model grok-4.3

classification 💻 cs.DC cs.AIcs.NI
keywords cross-instance attentionKV-cache redistributionMulti-head Latent AttentionRDMAcost modeldistributed LLM inferencesparse attentionGPU cluster
0
0 comments X

The pith

For cross-instance MLA attention, routing the small compressed query to the cache beats moving the cache to the query.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines attention across GPUs when a large model is partitioned and a query must attend to KV-cache blocks on remote instances. Multi-head Latent Attention compresses each token's key and value into one narrow vector, making the query row only about 1 KB while the attended chunk is larger. On a real multi-node H100 cluster with device-initiated RDMA, the work measures that routing the query takes tens of microseconds instead of roughly 3 ms for cache movement or re-adaptation. It supplies a topology-aware cost model built from probe, transfer, compute, return, and merge phases plus a closed-form route/fetch/local predicate, both of which track observed batched round-trips to within 7 percent. The same model and predicate apply to any architecture whose compression or sparse selection keeps attention units small.

Core claim

When a query and the KV-cache blocks it selects reside on different GPUs, Multi-head Latent Attention inverts the usual arithmetic because each token's key and value are compressed into one narrow vector, so the routed query row is smaller than the chunk it attends. Characterizing this on a multi-node H100 cluster with IBGDA produces a reusable topology-aware cost model and a closed-form route/fetch/local predicate whose constants are measured on the fabric; the model tracks batched round-trips to within 7 percent. At decode the system therefore routes the query, trading the cost of moving the cache (a 3 ms re-adaptation splice for a contiguous chunk or a scattered gather) for a tens-of-micr

What carries the argument

The topology-aware cost model (probe / transfer / compute / return / merge phases) together with the closed-form route/fetch/local predicate that decides whether to route the query, fetch the cache, or stay local.

If this is right

  • At decode, routing the query replaces the 3 ms cache re-adaptation splice with a tens-of-microsecond round trip.
  • The predicate selects the interconnect by measured probe latency rather than advertised peak bandwidth.
  • The model and predicate extend directly to other compressed or sparsely indexed attention systems such as DeepSeek-V3.2.
  • Extending the framework to a new architecture requires measuring only two coefficients: the size of the routed payload and the move-the-cache cost of fetch.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Distributed inference schedulers could embed the route/fetch/local predicate to decide data movement per attention step instead of fixing a single policy.
  • Workloads that repeatedly query the same large shared corpus across many sub-agents would see the largest reduction in cross-instance traffic.
  • The same probe-based decision logic could later guide initial KV-cache placement across instances to minimize expected future routing cost.

Load-bearing premise

MLA compression keeps the routed query row at roughly 1 KB and therefore smaller than the attended cache chunk on the measured fabrics.

What would settle it

A measurement on the same H100 cluster showing that query routing latency exceeds the 3 ms cost of cache movement and re-adaptation for the batch sizes tested would falsify the claimed advantage.

Figures

Figures reproduced from arXiv: 2606.01502 by Bole Ma, Gerhard Wellein, Harald K\"ostler, Jan Eitzinger.

Figure 1
Figure 1. Figure 1: (a) Payload asymmetry, shown for MLA: because a routed query row and a cached token are the same narrow object, a routed query+partial (𝑞+𝑝=2184 B) is ∼1000× smaller than the one-layer 𝑐 KV chunk a fetch would pull — the cleanest instance of the fixed routed payload versus chunk-scaling fetch that favours route. (b) The load-bearing argument is the cost shape, not the byte count: cost of the three primitiv… view at source ↗
Figure 2
Figure 2. Figure 2: (a) The 2-node × 4-H100 SXM5 testbed: each node is a direct all-to-all NVLink island (NV6 per GPU pair, no NVSwitch) with one ConnectX-7 NIC per GPU; cross-node traffic is device-initiated IBGDA over an InfiniBand NDR-200 switch, and the dashed path traces a routed query from a requester GPU to the corpus holder. The pair is drawn here same-leaf; we measure it both same-leaf and spine-traversing (cross-lea… view at source ↗
Figure 3
Figure 3. Figure 3: (a) The choice, with both operands resident in H100 HBM: a holder owns a large 𝑐 KV corpus, a requester has a ≈1 KB query row; route moves the query and merges a small partial back (chosen), while fetch would move the whole multi-megabyte cache (avoided). (b) route vs fetch on wire bytes over the (𝑀𝑞, 𝑐𝑡 ) grid (DeepSeek-V2-Lite, bf16): green where routing moves fewer bytes, red where pulling the chunk doe… view at source ↗
Figure 4
Figure 4. Figure 4: Under selection, route’s cost stays flat where the alternatives grow (cross-node H100, IBGDA). (a) Scatter transport. Gathering a 𝐾-entry selected set spread across 𝑀 holders (fetch, red, per layer) grows with 𝑀 — scattering defeats bulk coalescing, so each holder is a separate transfer — and is fabric-invariant (no kink as holders cross the node boundary at 𝑀 ≥4, shaded); the full pull scales this by 𝐿=27… view at source ↗
Figure 5
Figure 5. Figure 5: (a) 𝑁 routed requesters fan in to one corpus holder, which batches their partials over its resident 𝑐 KV (the highlighted slab is the selected top-𝑘 block); the copy and compute elbows both sit near 𝑁 ≈8. (b) Holder-side staging: p50 fetch round trip and steady-state floor vs. the number of CUDA streams 𝐾 in the holder pool (chunk-prefetch workload, 2-node × 4 H100). 𝐾=8 is the elbow; 𝐾=1 (a single async s… view at source ↗
Figure 6
Figure 6. Figure 6: Fabric robustness of redistribution at the decode operating point ( [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: route round trip under self-congestion. (a) 𝐾 concurrent route flows share one NDR-200 link (two same-leaf H100 nodes); latency is flat through 𝐾=2 at every batch and rises only once the link is fully subscribed (𝐾=3; 𝑀𝑞=1024: 114→250 𝜇s, +119%), every case far below fetch’s ≈3 ms splice (dashed); the route-vs-fetch ranking never inverts. (b) The same flat-until-saturation shape reproduces on an unrelated … view at source ↗
read the original abstract

Frontier LLMs increasingly decide what a query attends to with a sparse-attention indexer that picks a few KV-cache blocks per query: attention's unit is now a small, reusable chunk. Agentic workloads hammer it: many sub-agents query one large codebase, reusing the same blocks. When that corpus outgrows one GPU it is partitioned across instances, so a query and the blocks it selects often sit on different GPUs: answering it means attention across instances. The reflex of prior cross-instance KV systems is to move the cache: pull the selected blocks to the requester. Multi-head Latent Attention inverts the arithmetic, compressing each token's key and value into one narrow vector, so a routed query row is only ~1 KB, smaller than the chunk it attends; routing the query is then often cheaper than moving the cache. Which primitive wins, over which fabric and request shape, is uncharted, least of all on device-initiated RDMA that makes per-request cross-node transfers cheap. We characterize cross-instance MLA attention on a real multi-node H100 cluster, distilling two reusable artifacts: a topology-aware cost model (probe / transfer / compute / return / merge) and a closed-form route/fetch/local predicate, whose constants we measure on real IBGDA, where the model tracks batched round-trips to within ~7%. At decode it routes the query, trading the cost of moving the cache (a ~3 ms re-adaptation splice for a contiguous chunk, or a scattered gather under selection) for a tens-of-microsecond round trip, and picks the fabric by probe latency, not peak bandwidth. We instantiate the cost model and predicate for MLA, but neither is MLA-specific: they apply wherever compression or sparse selection shrinks attention to small chunks (DeepSeek-V3.2, V4, and GLM-5.1 today). Extending them to a new architecture requires measuring just two coefficients: the routed payload and fetch's move-the-cache cost.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript characterizes cross-instance MLA attention on a multi-node H100 cluster with device-initiated RDMA (IBGDA). It claims that MLA compression reduces each query row to ~1 KB—smaller than the attended KV chunk—so routing the query is cheaper than moving the cache (~3 ms re-adaptation splice or scattered gather). The authors distill a reusable topology-aware cost model (probe/transfer/compute/return/merge) and a closed-form route/fetch/local predicate; constants are measured on real hardware and the model tracks batched round-trips to within ~7%. The predicate is instantiated for MLA but presented as general for any compressed or sparse attention that shrinks the attended unit.

Significance. If the measurements and size inequality hold, the work supplies a practical, low-overhead primitive for distributed sparse attention in agentic workloads and a reusable cost model that requires only two hardware-specific coefficients for new architectures. The explicit measurement of constants on production fabrics and the non-MLA-specific formulation are concrete strengths that would make the artifacts immediately usable by systems builders.

major comments (2)
  1. [Abstract] Abstract: The central claim that query routing is preferred rests on the routed payload being ~1 KB and materially smaller than the attended chunk. No table, derivation, or measured payload size is referenced to anchor this figure or to confirm the inequality on the tested H100/IBGDA configuration; without it the arithmetic inversion and the reported preference for query movement cannot be verified.
  2. [Results] Results / evaluation section: The statement that the cost model tracks batched round-trips to within ~7% is load-bearing for the contribution, yet the provided text contains no data tables, per-batch error breakdowns, or full methods description that would allow independent assessment of that accuracy claim.
minor comments (1)
  1. [Discussion] The predicate and cost model are described as reusable beyond MLA, but the manuscript does not include even a brief worked example for another architecture (e.g., DeepSeek-V3.2) to illustrate the two-coefficient measurement process.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. The two major comments identify opportunities to strengthen verifiability of the central claims; we address each below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that query routing is preferred rests on the routed payload being ~1 KB and materially smaller than the attended chunk. No table, derivation, or measured payload size is referenced to anchor this figure or to confirm the inequality on the tested H100/IBGDA configuration; without it the arithmetic inversion and the reported preference for query movement cannot be verified.

    Authors: We agree that the abstract would be strengthened by an explicit anchor for the ~1 KB figure. The manuscript derives this size from the MLA compression (one narrow latent vector per token) and confirms the inequality via direct measurement on the target hardware, but the abstract does not cite the relevant section or table. In the revised manuscript we will add a parenthetical reference in the abstract to the section presenting the payload-size measurement and the chunk-size comparison on the H100/IBGDA fabric. revision: yes

  2. Referee: [Results] Results / evaluation section: The statement that the cost model tracks batched round-trips to within ~7% is load-bearing for the contribution, yet the provided text contains no data tables, per-batch error breakdowns, or full methods description that would allow independent assessment of that accuracy claim.

    Authors: We acknowledge that the current text does not supply the supporting tables or breakdowns needed to assess the ~7% accuracy claim. We will expand the results section to include (1) a table of measured versus predicted batched round-trip latencies, (2) the per-batch relative-error distribution, and (3) an expanded methods subsection describing how the two hardware-specific coefficients were obtained on the IBGDA fabric. These additions will make the accuracy statement independently verifiable. revision: yes

Circularity Check

0 steps flagged

No circularity: cost model relies on directly measured hardware constants independent of target predictions

full rationale

The paper's closed-form predicate and cost model are constructed from two coefficients (routed payload size and fetch move-the-cache cost) that are measured directly on the target IBGDA/H100 hardware. These inputs are not fitted to the final latency numbers or derived from the predicate's output; the model is then validated against observed round-trips (to ~7%). The ~1 KB query-row claim is stated as a direct consequence of MLA compression rather than derived or fitted inside the paper. No equation or step reduces the reported preference for query routing to a quantity defined by that preference itself. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on an empirical cost model whose constants are measured on the target hardware; no new physical entities are postulated.

free parameters (2)
  • routed payload size
    Stated as ~1 KB from MLA compression; used to decide when routing beats cache movement.
  • fetch move-the-cache cost
    Stated as ~3 ms for contiguous chunk re-adaptation; measured on the cluster.
axioms (1)
  • domain assumption The five-component breakdown (probe / transfer / compute / return / merge) captures all relevant latency sources on the measured fabrics.
    Invoked when the topology-aware cost model is introduced in the abstract.

pith-pipeline@v0.9.1-grok · 5916 in / 1456 out tokens · 28838 ms · 2026-06-28T16:00:30.706329+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 32 canonical work pages · 10 internal anchors

  1. [1]

    Yushi Bai, Qian Dong, Ting Jiang, Xin Lv, Zhengxiao Du, Aohan Zeng, Jie Tang, and Juanzi Li. 2026. IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse. arXiv:2603.12201 [cs.CL] https://arxiv.org/abs/2603.12201

  2. [2]

    Nidhi Bhatia, Ankit More, Ritika Borkar, Tiyasa Mitra, Ramon Matas, Ritchie Zhao, Maximilian Golub, Dheevatsa Mudigere, Brian Phar- ris, and Bita Darvish Rouhani. 2025. Helix Parallelism: Rethinking Sharding Strategies for Interactive Multi-Million-Token LLM Decoding. arXiv:2507.07120 [cs.DC] https://arxiv.org/abs/2507.07120

  3. [3]

    Yihua Cheng, Yuhan Liu, Jiayi Yao, Yuwei An, Xiaokun Chen, Shaoting Feng, Yuyang Huang, Samuel Shen, Kuntai Du, and Junchen Jiang. 2025. LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference.arXiv preprint arXiv:2510.09665(2025)

  4. [4]

    Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

    Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. InAdvances in Neural Information Processing Systems (NeurIPS)

  5. [5]

    DeepSeek-AI. 2026. DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence. https://huggingface.co/deepseek-ai/DeepSeek-V4- Pro/blob/main/DeepSeek_V4.pdf Technical report

  6. [6]

    DeepSeek-AI et al . 2024. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model. arXiv:2405.04434 [cs.CL] https://arxiv.org/abs/2405.04434

  7. [7]

    DeepSeek-AI et al. 2025. DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models. arXiv:2512.02556 [cs.CL] https://arxiv.org/abs/2512. 02556

  8. [8]

    Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, and Luo Mai. 2024. ServerlessLLM: Low-Latency Serverless Inference for Large Language Models. InOSDI’24

  9. [9]

    In Gim, Guojun Chen, Seung seob Lee, Nikhil Sarda, Anurag Khandelwal, and Lin Zhong. 2024. Prompt Cache: Modular Attention Reuse for Low-Latency Inference. arXiv:2311.04934 [cs.CL] https://arxiv.org/abs/2311.04934

  10. [10]

    GLM-5-Team et al. 2026. GLM-5: from Vibe Coding to Agentic Engineering. arXiv:2602.15763 [cs.LG] https://arxiv.org/abs/2602.15763

  11. [11]

    Amos Goldman, Nimrod Boker, Maayan Sheraizin, Nimrod Admoni, Artem Polyakov, Subhadeep Bhattacharya, Fan Yu, Kai Sun, Georgios Theodorakis, Hsin-Chun Yin, Peter-Jan Gootzen, Aamir Shafi, Assaf Ravid, Salvatore Di Girolamo, James Dinan, Xiaofan Li, Manjunath Gorentla Venkata, and Gil Bloch. 2026. NCCL EP: Towards a Unified Expert Parallel Communication API...

  12. [12]

    Khaled Hamidouche, John Bachan, Pak Markthub, Peter-Jan Gootzen, Elena Agostini, Sylvain Jeaugey, Aamir Shafi, Georgios Theodorakis, and Manjunath Gorentla Venkata. 2025. GPU-Initiated Networking for NCCL. arXiv:2511.15076 [cs.DC] https://arxiv.org/abs/2511.15076

  13. [13]

    Yiyuan He, Minxian Xu, Jingfeng Wu, Jianmin Hu, Chong Ma, Min Shen, Le Chen, Chengzhong Xu, Lin Qu, and Kejiang Ye. 2026. BanaServe: Unified KV Cache and Dynamic Module Migration for Balancing Disaggregated LLM Serving in AI Infrastructure.Software: Practice and Experience 56, 4 (2026), 424–444. arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/spe.70...

  14. [14]

    Junhao Hu, Wenrui Huang, Weidong Wang, Haoyi Wang, Tiancheng Hu, Qin Zhang, Hao Feng, Xusheng Chen, Yizhou Shan, and Tao Xie. 2025. EPIC: efficient position-independent caching for serving large language models. InProceedings of the 42nd International Conference on Machine Learning(Vancouver, Canada)(ICML’25). JMLR.org, Article 956, 12 pages

  15. [15]

    Yinsicheng Jiang, Yao Fu, Yeqi Huang, Ping Nie, Zhan Lu, Leyang Xue, Congjie He, Man-Kit Sit, Jilong Xue, Li Dong, Ziming Miao, DaYou Du, Tairan Xu, Kai Zou, Edoardo Ponti, and Luo Mai. 2026. MoE-CAP: Benchmarking Cost, Accuracy and Performance of Sparse Mixture- of-Experts Systems. InThe Thirty-ninth Annual Conference on Neural Information Processing Sys...

  16. [16]

    Yinsicheng Jiang, Yeqi Huang, Liang Cheng, Cheng Deng, Xuan Sun, and Luo Mai. 2026. ContextPilot: Fast Long-Context Inference via Context Reuse. InProceedings of the 9th Conference on Machine Learning and Systems (MLSys 2026). https://arxiv.org/abs/2511.03475

  17. [17]

    Shengyu Liu Jiashi Li. 2025. FlashMLA: Efficient Multi-head Latent Attention Kernels. https://github.com/deepseek-ai/FlashMLA. Manuscript submitted to ACM 22 Ma et al

  18. [18]

    Kimi Team et al. 2026. Kimi K2: Open Agentic Intelligence. arXiv:2507.20534 [cs.LG] https://arxiv.org/abs/2507.20534

  19. [19]

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles (Koblenz, Germany)(SOSP ’23). Association for Computing Machinery, New York, ...

  20. [20]

    Noam Levy. 2026. Dynamic Sparse Attention: Access Patterns and Architecture. arXiv:2603.13430 [cs.AR] https://arxiv.org/abs/2603.13430

  21. [21]

    Yunkai Liang, Zhangyu Chen, Pengfei Zuo, Zhi Zhou, Xu Chen, and Zhou Yu. 2025. Injecting Adrenaline into LLM Serving: Boosting Resource Utilization and Throughput via Attention Disaggregation. arXiv:2503.20552 [cs.DC] https://arxiv.org/abs/2503.20552

  22. [22]

    Nandor Licker, Kevin Hu, Vladimir Zaytsev, and Lequn Chen. 2026. fabric-lib: RDMA Point-to-Point Communication for LLM Systems. arXiv:2510.27656 [cs.DC] https://arxiv.org/abs/2510.27656

  23. [23]

    Bin Lin, Chen Zhang, Tao Peng, Hanyu Zhao, Wencong Xiao, Minmin Sun, Anmin Liu, Zhipeng Zhang, Lanbo Li, Xiafei Qiu, Shen Li, Zhigang Ji, Tao Xie, Yong Li, and Wei Lin. 2024. Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache. arXiv:2401.02669 [cs.DC] https://arxiv.org/abs/2401.02669

  24. [24]

    Ziming Mao, Yihan Zhang, Chihan Cui, Zhen Huang, Kaichao You, Zhongjie Chen, Zhiying Xu, Zhenyu Gu, Scott Shenker, Costin Raiciu, Yang Zhou, and Ion Stoica. 2025. UCCL-EP: Portable Expert-Parallel Communication. arXiv:2512.19849 [cs.DC] https://arxiv.org/abs/2512.19849

  25. [25]

    Maxim Milakov and Natalia Gimelshein. 2018. Online normalizer calculation for softmax.CoRRabs/1805.02867 (2018). arXiv:1805.02867 http://arxiv.org/abs/1805.02867

  26. [26]

    Moonshot AI. 2026. Kimi K2.6. Hugging Face model card. https://huggingface.co/moonshotai/Kimi-K2.6

  27. [27]

    Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Heyi Tang, Feng Ren, Teng Ma, Shangming Cai, Yineng Zhang, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. 2025. Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving.ACM Trans. Storage(Nov. 2025). doi:10.1145/3773772 Just Accepted

  28. [28]

    Nazmul Takbir, Hamidreza Alikhani, Nikil Dutt, and Sangeetha Abdu Jyothi. 2025. FlexiCache: Leveraging Temporal Stability of Attention Heads for Efficient KV Cache Management. arXiv:2511.00868 [cs.LG] https://arxiv.org/abs/2511.00868

  29. [29]

    Xiaojuan Tang, Fanxu Meng, Pingzhi Tang, Yuxuan Wang, Di Yin, Xing Sun, and Muhan Zhang. 2026. TPLA: Tensor Parallel Latent Attention for Efficient Disaggregated Prefill & Decode Inference. InProceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2(USA)(ASPLOS ’26). Association...

  30. [30]

    Qian Wang, Zahra Yousefijamarani, Morgan Lindsay Heisler, Rongzhi Gu, Bai Xiaolong, Shan Yizhou, Wei Zhang, Wang Lan, Ying Xiong, Yong Zhang, and Zhenan Fan. 2025. MEPIC: Memory Efficient Position Independent Caching for LLM Serving. arXiv:2512.16822 [cs.LG] https://arxiv.org/abs/2512.16822

  31. [31]

    Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: an insightful visual performance model for multicore architectures. Commun. ACM52, 4 (April 2009), 65–76. doi:10.1145/1498765.1498785

  32. [32]

    Amy Yang, Jingyi Yang, Aya Ibrahim, Xinfeng Xie, Bangsheng Tang, Grigory Sizov, Jeremy Reizenstein, Jongsoo Park, and Jianyu Huang. 2025. Context Parallelism for Scalable Million-Token Inference. arXiv:2411.01783 [cs.DC] https://arxiv.org/abs/2411.01783

  33. [33]

    Feiyu Yao, Zhixiong Niu, Xiaqing Li, Yongqiang Xiong, Juan Fang, and Qian Wang. 2026. An Efficient Hybrid Sparse Attention with CPU-GPU Parallelism for Long-Context Inference. arXiv:2605.07719 [cs.LG] https://arxiv.org/abs/2605.07719

  34. [34]

    Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, and Junchen Jiang. 2025. CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion. InProceedings of the Twentieth European Conference on Computer Systems. 94–109. doi:10.1145/3689031.3696098

  35. [35]

    Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, and Luis Ceze. 2025. FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving.arXiv preprint arXiv:2501.01005(2025). https://arxiv.org/abs/2501.01005

  36. [36]

    Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, Yuxing Wei, Lean Wang, Zhiping Xiao, Yuqing Wang, Chong Ruan, Ming Zhang, Wenfeng Liang, and Wangding Zeng. 2025. Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention. InProceedings of the 63rd Annual Meeting of the Association for Computa...

  37. [37]

    Sungmin Yun, Seonyong Park, Hwayong Nam, Younjoo Lee, Gunjun Lee, Kwanhee Kyung, Sangpyo Kim, Nam Sung Kim, Jongmin Kim, Hyungyo Kim, Juhwan Cho, Seungmin Baek, and Jung Ho Ahn. 2026. Rethinking LLM Inference Bottlenecks: Insights from Latent Attention and Mixture-of-Experts. arXiv:2507.15465 [cs.AR] https://arxiv.org/abs/2507.15465

  38. [38]

    Z.ai. 2026. GLM-5.1. Hugging Face model card. https://huggingface.co/zai-org/GLM-5.1

  39. [39]

    Shiqing Zhang, Mahmood Naderan-Tahan, Magnus Jahre, and Lieven Eeckhout. 2023. Characterizing Multi-Chip GPU Data Sharing.ACM Trans. Archit. Code Optim.20, 4, Article 56 (Dec. 2023), 24 pages. doi:10.1145/3629521

  40. [40]

    Shiqing Zhang, Mahmood Naderan-Tahan, Magnus Jahre, and Lieven Eeckhout. 2023. SAC: Sharing-Aware Caching in Multi-Chip GPUs. In Proceedings of the 50th Annual International Symposium on Computer Architecture(Orlando, FL, USA)(ISCA ’23). Association for Computing Machinery, New York, NY, USA, Article 43, 13 pages. doi:10.1145/3579371.3589078 Manuscript su...

  41. [41]

    Chenggang Zhao, Shangyan Zhou, Liyue Zhang, Chengqi Deng, Zhean Xu, Yuxuan Liu, Kuai Yu, Jiashi Li, and Liang Zhao. 2025. DeepEP: an efficient expert-parallel communication library. https://github.com/deepseek-ai/DeepEP

  42. [42]

    Gonzalez, Clark Barrett, and Ying Sheng

    Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. 2024. SGLang: efficient execution of structured language model programs. InProceedings of the 38th International Conference on Neural Information Processing Systems(Vancouver, BC, C...

  43. [43]

    Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. 2024. DistServe: disaggregating prefill and decoding for goodput-optimized large language model serving. InProceedings of the 18th USENIX Conference on Operating Systems Design and Implementation(Santa Clara, CA, USA)(OSDI’24). USENIX Association, USA, Article...