HBM Is Not All You Need: Efficient Disaggregated LLM Serving across Memory-heterogeneous Accelerators

James Yen; Mingyuan Xia; Yun Wang; Zhengwei Qi; Zhixiang Wei

arxiv: 2606.29986 · v1 · pith:62RBERZYnew · submitted 2026-06-29 · 💻 cs.AR · cs.DC

HBM Is Not All You Need: Efficient Disaggregated LLM Serving across Memory-heterogeneous Accelerators

Zhixiang Wei , Yun Wang , James Yen , Mingyuan Xia , Zhengwei Qi This is my paper

Pith reviewed 2026-06-30 04:23 UTC · model grok-4.3

classification 💻 cs.AR cs.DC

keywords LLM servingdisaggregated inferencememory heterogeneous acceleratorsGDDRHBMKV cachequantizationgoodput

0 comments

The pith

Serving LLMs on mixed GDDR and HBM hardware with phase-wise quantization and deferred dequantization raises goodput by up to 3.2 times.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

LLM inference splits into a compute-heavy prefill phase and a memory-heavy decode phase, but today's systems use expensive HBM everywhere even when its bandwidth sits idle during prefill. The paper proposes pairing lower-cost GDDR accelerators for prefill with HBM GPUs for decode across vendors. Three techniques address the resulting compatibility issues: phase-wise quantization, overlapping transfers with computation, and shipping quantized data for later reconstruction. Experiments on Qwen3 models and real traces show substantial gains in goodput and cost efficiency with no drop in output quality. This matters for building cheaper, more efficient inference clusters.

Core claim

HMA-Serve pairs GDDR-based accelerators for prefill with HBM-based GPUs for decode through phase-wise quantization that keeps decode in BF16, a compute-transfer pipeline that overlaps KV cache transfers, and deferred dequantization that reduces bandwidth by shipping raw quantized bytes.

What carries the argument

HMA-Serve's phase-wise quantization, compute-transfer pipeline, and deferred dequantization that handle cross-vendor KV cache and software differences in memory-heterogeneous disaggregation.

If this is right

Up to 3.2× higher goodput than state-of-the-art memory-homogeneous methods
4.8× higher goodput-per-dollar
No measurable loss on generation-quality benchmarks
Effective across four Qwen3 models from 4B to 32B and three production traces

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Hardware providers may need to improve cross-vendor compatibility for KV caches to make such mixing routine.
Similar disaggregation could extend to other accelerator types beyond GDDR and HBM.
Datacenter operators could reduce capital costs by matching hardware more precisely to phase requirements.

Load-bearing premise

The latency and complexity added by cross-vendor KV cache format conversion, network transfers, and deferred dequantization stay low enough that overall goodput and quality stay superior.

What would settle it

A test where the heterogeneous system with the three techniques shows equal or lower goodput than homogeneous baselines or reduced generation quality on the same models and traces.

Figures

Figures reproduced from arXiv: 2606.29986 by James Yen, Mingyuan Xia, Yun Wang, Zhengwei Qi, Zhixiang Wei.

**Figure 2.** Figure 2: Architecture of HMA-Serve. A scheduler routes each request to a Tenstorrent prefill worker and an A100 decode worker over a 100 Gbps RoCE link. (1) Each phase runs at its native precision—BFP8 prefill, BF16 decode (§3.1). (2) A compute-transfer pipeline overlaps each layer’s compute, device-to-host push, and RDMA on the producer with RDMA receive, host-to-device copy, and dequantization on the consumer (§3… view at source ↗

**Figure 3.** Figure 3: HMA-Serve vs. the strongest disaggregation (DistServe-Homo) and colocation (Sarathi-Hetero) baselines— [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Goodput per hardware budget, each system’s gp@90 divided by its cost in A100-equivalents (one A100 = 1.0, one p150 = 1/12). HMA-Serve and the oracle colocation share the same heterogeneous box (one A100 + the four-chip mesh, cost 1.33); DistServe-Homo pays for a second A100 (cost 2.0). Bars are normalized to DistServe-Homo (= 1), so each HMA-Serve bar’s height is its cost-efficiency advantage. The hatched … view at source ↗

read the original abstract

LLM inference comprises a compute-bound prefill phase and a memory-bound decode phase, and recent systems disaggregate them onto separate hardware. Yet today's datacenter GPUs rely on costly HBM whose bandwidth sits almost entirely idle during prefill. LLM serving across memory-heterogeneous accelerators (MemHA) pairs GDDR-based accelerators for prefill with HBM-based GPUs for decode, promising lower cost without sacrificing performance. Pushed to its most economical form, MemHA serving is inherently cross-vendor, since the best-suited chip for each phase may come from a different vendor. This breaks two assumptions that single-vendor disaggregation takes for granted -- a KV format both ends consume natively, and a shared software stack. We present \textbf{HMA-Serve}, a MemHA-centric disaggregated serving system pairing GDDR-based accelerators for prefill with HBM-based GPUs for decode efficiently. HMA-Serve achieves this through (1) phase-wise quantization, applying vendor-native low precision for high-throughput prefill while keeping decode in high-precision BF16, (2) a compute-transfer pipeline that overlaps each layer's KV cache transfer with later-layer prefill to reduce time-to-first-token (TTFT), and (3) deferred dequantization, shipping raw quantized bytes and reconstructing them lazily on the decode GPU to reduce network bandwidth and HBM usage. Across four Qwen3 models (4B--32B) and three production traces, HMA-Serve delivers up to $3.2\times$ higher goodput than state-of-the-art memory-homogeneous methods and $4.8\times$ higher goodput-per-dollar, with no measurable loss on generation-quality benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HMA-Serve shows concrete goodput gains from GDDR-HBM disaggregation with three targeted techniques, but the cross-vendor overheads remain the unproven part.

read the letter

The paper's core claim is that you can pair cheaper GDDR accelerators for the prefill phase with HBM GPUs for decode, even across vendors, and still beat memory-homogeneous baselines on goodput and cost. The three techniques—phase-wise quantization, compute-transfer overlap, and deferred dequantization—are presented as the fixes for KV format mismatches and extra network costs.

What stands out as new is the explicit handling of cross-vendor constraints that prior disaggregation work largely avoided. The evaluation covers four Qwen3 sizes and three real traces, which is a reasonable scope for a systems paper.

The results look promising on paper: up to 3.2× goodput and 4.8× goodput-per-dollar with no reported quality drop. That said, the stress-test concern lands. The abstract and claims rest on the added latency and bandwidth from transfers, format conversion, and lazy dequant staying small enough not to erase the savings. Without seeing the per-layer breakdowns, ablation numbers, or how those costs behave at 32B scale, it is difficult to judge whether the advantage holds under different traces or network conditions.

The work is aimed at people who run large-scale inference and care about hardware mix economics. It is worth sending to referees because the empirical setup is concrete and the problem is practical, even if the overhead accounting will probably need tightening in revision.

Referee Report

1 major / 1 minor

Summary. The manuscript presents HMA-Serve, a disaggregated LLM serving system that pairs GDDR-based accelerators for the compute-bound prefill phase with HBM-based GPUs for the memory-bound decode phase in a cross-vendor setting. It introduces three techniques—phase-wise quantization (vendor-native low precision for prefill, BF16 for decode), a compute-transfer pipeline overlapping KV cache transfers with later-layer prefill, and deferred dequantization (shipping quantized bytes for lazy reconstruction)—to handle KV cache format and software stack mismatches. Across four Qwen3 models (4B–32B) and three production traces, it reports up to 3.2× higher goodput and 4.8× higher goodput-per-dollar versus state-of-the-art memory-homogeneous methods, with no measurable loss on generation-quality benchmarks.

Significance. If the reported gains hold after accounting for cross-vendor overheads, the result would be significant for LLM serving systems by showing that costly HBM bandwidth is largely idle during prefill and that cheaper GDDR accelerators can be used effectively for that phase. The empirical evaluation across multiple model sizes and real traces, combined with explicit handling of cross-vendor KV cache issues, provides practical evidence for heterogeneous accelerator deployments. The work explicitly credits the three techniques for preserving TTFT and quality while improving cost-efficiency.

major comments (1)

[Evaluation / Results] The central claim that the three techniques keep cross-vendor overheads (KV cache format conversion, network transfers, deferred dequantization) low enough to deliver the 3.2× goodput and 4.8× goodput-per-dollar gains is load-bearing, yet the manuscript provides no quantitative breakdown (e.g., per-layer transfer time, dequantization compute cost, or bandwidth savings versus homogeneous baselines) in the evaluation. Without this, it is impossible to verify that the added latency and bandwidth costs do not scale with model size or trace characteristics and erode the net benefit.

minor comments (1)

[Abstract] The abstract and introduction would benefit from an explicit definition of 'goodput' (e.g., tokens per second under latency SLOs) and 'goodput-per-dollar' to allow direct comparison with prior disaggregation work.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our evaluation. We address the major comment below and will incorporate the requested breakdown in the revised manuscript.

read point-by-point responses

Referee: The central claim that the three techniques keep cross-vendor overheads (KV cache format conversion, network transfers, deferred dequantization) low enough to deliver the 3.2× goodput and 4.8× goodput-per-dollar gains is load-bearing, yet the manuscript provides no quantitative breakdown (e.g., per-layer transfer time, dequantization compute cost, or bandwidth savings versus homogeneous baselines) in the evaluation. Without this, it is impossible to verify that the added latency and bandwidth costs do not scale with model size or trace characteristics and erode the net benefit.

Authors: We agree that an explicit quantitative breakdown of the overheads from phase-wise quantization, the compute-transfer pipeline, and deferred dequantization is needed to substantiate the net gains. In the revised manuscript we will add a dedicated subsection (and associated figures/tables) that reports per-layer KV cache transfer latency, dequantization compute cost on the decode GPU, and effective bandwidth savings relative to the memory-homogeneous baselines. These measurements will be shown for all four Qwen3 model sizes and across the three production traces to demonstrate that the overheads remain small and do not scale in a way that erodes the reported 3.2× goodput and 4.8× goodput-per-dollar improvements. revision: yes

Circularity Check

0 steps flagged

No circularity; claims rest on empirical measurements

full rationale

The paper describes an engineering system (HMA-Serve) with three concrete techniques and evaluates it through experiments on four models and three traces, reporting measured goodput and goodput-per-dollar gains. No equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The central claims are directly tied to external benchmarks (production traces, generation-quality metrics) rather than reducing to inputs by construction, satisfying the self-contained criterion.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The contribution is an empirical systems design with no mathematical free parameters, axioms, or invented entities described in the abstract.

pith-pipeline@v0.9.1-grok · 5852 in / 1217 out tokens · 60219 ms · 2026-06-30T04:23:01.034672+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 6 canonical work pages · 1 internal anchor

[1]

Dennis Abts, Jonathan Ross, Jonathan Sparling, Mark Wong-VanHaren, Max Baker, Tom Hawkins, Andrew Bell, John Thompson, Temes- ghen Kahsai, Garrin Kimmell, Jennifer Hwang, Rebekah Leslie-Hurd, Michael Bye, E. R. Creswick, Matthew Boyd, Mahitha Venigalla, Evan Laforge, Jon Purdy, Purushotham Kamath, Dinesh Maheshwari, Michael Beidler, Geert Rosseel, Omar Ah...

work page doi:10.1109/isca45697 2020
[2]

Gulavani, Alexey Tumanov, and Ramachandran Ramjee

Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Alexey Tumanov, and Ramachandran Ramjee. 2024. Taming Throughput-Latency Tradeoff in LLM In- ference with Sarathi-Serve. In18th USENIX Symposium on Operat- ing Systems Design and Implementation (OSDI 24)(Santa Clara, CA, USA). USENIX Association, USA, 117–134.htt...

2024
[3]

Xing Chen, Rong Shi, Lu Zhao, Lingbin Wang, Xiao Jin, Yueqiang Chen, and Hongfeng Sun. 2025. Disaggregated Prefill and Decoding Inference System for Large Language Model Serving on Multi-Vendor GPUs.arXiv preprint arXiv:2509.17542(2025)

work page arXiv 2025
[4]

Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. FlashAttention: Fast and Memory-Efficient Exact Atten- tion with IO-Awareness. InAdvances in Neural Information Process- ing Systems 35 (NeurIPS 2022)(New Orleans, LA, USA). Curran As- sociates, Inc.http://papers.nips.cc/paper_files/paper/2022/hash/ 67d57c32e20fd0a7a302cb81d36e40d...

2022
[5]

Efficient memory management for large language model serving with PagedAttention, in: Proceed- ings of the 29th ACM Symposium on Operating Systems Principles, pp

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Sto- ica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles(Koblenz, Germany)(SOSP ’23). As- sociation for Computing Machinery, New ...

work page doi:10.1145/3600006.3613165 2023
[6]

Haiquan Lu, Zigeng Chen, Gongfan Fang, Xinyin Ma, and Xinchao Wang. 2026. Mix-Quant: Quantized Prefilling, Precise Decoding for Agentic LLMs.arXiv preprint arXiv:2605.20315(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[7]

Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. 2024. Splitwise: Efficient Generative LLM Inference Using Phase Splitting. In2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA) 7 Preprint, 2026, Wei et al. (Buenos Aires, Argentina). IEEE, 118–132. doi:10.1109/ISCA59...

work page doi:10.1109/isca59077.2024 2024
[8]

Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Feng Ren, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. 2025. Mooncake: Trading More Storage for Less Computation — A KVCache-centric Ar- chitecture for Serving LLM Chatbot. In23rd USENIX Conference on File and Storage Technologies (FAST 25)(Santa Clara, CA, USA). USENIX As- sociation, USA, 155–170....

2025
[9]

Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. 2022. Orca: A Distributed Serving System for Transformer-Based Generative Models. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22)(Carlsbad, CA, USA). USENIX Association, USA, 521–538.https://www.usenix. org/conference/osdi22/presentation/yu

2022
[10]

Hengrui Zhang, Pratyush Patel, August Ning, and David Wentzlaff
[11]

SPAD: Specialized Prefill and Decode Hardware for Disaggre- gated LLM Inference.arXiv preprint arXiv:2510.08544(2025)

work page arXiv 2025
[12]

Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xu- anzhe Liu, Xin Jin, and Hao Zhang. 2024. DistServe: disaggregating prefill and decoding for goodput-optimized large language model serv- ing. InProceedings of the 18th USENIX Conference on Operating Systems Design and Implementation(Santa Clara, CA, USA)(OSDI’24). USENIX Association, USA, Art...

2024

[1] [1]

Dennis Abts, Jonathan Ross, Jonathan Sparling, Mark Wong-VanHaren, Max Baker, Tom Hawkins, Andrew Bell, John Thompson, Temes- ghen Kahsai, Garrin Kimmell, Jennifer Hwang, Rebekah Leslie-Hurd, Michael Bye, E. R. Creswick, Matthew Boyd, Mahitha Venigalla, Evan Laforge, Jon Purdy, Purushotham Kamath, Dinesh Maheshwari, Michael Beidler, Geert Rosseel, Omar Ah...

work page doi:10.1109/isca45697 2020

[2] [2]

Gulavani, Alexey Tumanov, and Ramachandran Ramjee

Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Alexey Tumanov, and Ramachandran Ramjee. 2024. Taming Throughput-Latency Tradeoff in LLM In- ference with Sarathi-Serve. In18th USENIX Symposium on Operat- ing Systems Design and Implementation (OSDI 24)(Santa Clara, CA, USA). USENIX Association, USA, 117–134.htt...

2024

[3] [3]

Xing Chen, Rong Shi, Lu Zhao, Lingbin Wang, Xiao Jin, Yueqiang Chen, and Hongfeng Sun. 2025. Disaggregated Prefill and Decoding Inference System for Large Language Model Serving on Multi-Vendor GPUs.arXiv preprint arXiv:2509.17542(2025)

work page arXiv 2025

[4] [4]

Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. FlashAttention: Fast and Memory-Efficient Exact Atten- tion with IO-Awareness. InAdvances in Neural Information Process- ing Systems 35 (NeurIPS 2022)(New Orleans, LA, USA). Curran As- sociates, Inc.http://papers.nips.cc/paper_files/paper/2022/hash/ 67d57c32e20fd0a7a302cb81d36e40d...

2022

[5] [5]

Efficient memory management for large language model serving with PagedAttention, in: Proceed- ings of the 29th ACM Symposium on Operating Systems Principles, pp

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Sto- ica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles(Koblenz, Germany)(SOSP ’23). As- sociation for Computing Machinery, New ...

work page doi:10.1145/3600006.3613165 2023

[6] [6]

Haiquan Lu, Zigeng Chen, Gongfan Fang, Xinyin Ma, and Xinchao Wang. 2026. Mix-Quant: Quantized Prefilling, Precise Decoding for Agentic LLMs.arXiv preprint arXiv:2605.20315(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[7] [7]

Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. 2024. Splitwise: Efficient Generative LLM Inference Using Phase Splitting. In2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA) 7 Preprint, 2026, Wei et al. (Buenos Aires, Argentina). IEEE, 118–132. doi:10.1109/ISCA59...

work page doi:10.1109/isca59077.2024 2024

[8] [8]

Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Feng Ren, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. 2025. Mooncake: Trading More Storage for Less Computation — A KVCache-centric Ar- chitecture for Serving LLM Chatbot. In23rd USENIX Conference on File and Storage Technologies (FAST 25)(Santa Clara, CA, USA). USENIX As- sociation, USA, 155–170....

2025

[9] [9]

Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. 2022. Orca: A Distributed Serving System for Transformer-Based Generative Models. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22)(Carlsbad, CA, USA). USENIX Association, USA, 521–538.https://www.usenix. org/conference/osdi22/presentation/yu

2022

[10] [10]

Hengrui Zhang, Pratyush Patel, August Ning, and David Wentzlaff

[11] [11]

SPAD: Specialized Prefill and Decode Hardware for Disaggre- gated LLM Inference.arXiv preprint arXiv:2510.08544(2025)

work page arXiv 2025

[12] [12]

Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xu- anzhe Liu, Xin Jin, and Hao Zhang. 2024. DistServe: disaggregating prefill and decoding for goodput-optimized large language model serv- ing. InProceedings of the 18th USENIX Conference on Operating Systems Design and Implementation(Santa Clara, CA, USA)(OSDI’24). USENIX Association, USA, Art...

2024