pith. machine review for the scientific record. sign in

arxiv: 2604.26103 · v2 · submitted 2026-04-28 · 💻 cs.AR · cs.AI· cs.DC· cs.LG

Recognition: unknown

AMMA: A Multi-Chiplet Memory-Centric Architecture for Low-Latency 1M Context Attention Serving

Authors on Pith no claims yet

Pith reviewed 2026-05-07 14:11 UTC · model grok-4.3

classification 💻 cs.AR cs.AIcs.DCcs.LG
keywords multi-chiplet architecturememory-centric designlong-context attentionLLM servingHBM-PNMdecode-phase optimizationhybrid parallelism
0
0 comments X

The pith

A memory-centric multi-chiplet architecture replaces GPU dies with HBM-PNM cubes to cut attention latency for million-token contexts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that GPU-centered systems waste power and area on compute units idle during memory-bound decode attention, especially as contexts reach one million tokens. AMMA instead builds around HBM-PNM cubes that double available memory bandwidth and adds a logic-die microarchitecture plus two-level hybrid parallelism and reordered collective flow to convert that bandwidth into real gains. A sympathetic reader would see this as a direct attack on the primary latency bottleneck in long-context LLM serving. If correct, the design would let attention run at far lower latency and energy than current platforms without relying on compute-heavy GPUs.

Core claim

AMMA replaces GPU compute dies with HBM-PNM cubes to roughly double memory bandwidth for decode-phase attention. A logic-die microarchitecture exploits per-cube internal bandwidth under tight power and area limits, while a two-level hybrid parallelism scheme and reordered collective flow cut intra-chip communication costs. Design-space exploration over compute power and die-to-die link bandwidth guides hardware choices. The result is 15.5 times lower attention latency and 6.9 times lower energy use than a standard GPU baseline on long-context workloads.

What carries the argument

HBM-PNM cubes paired with a custom logic-die microarchitecture that routes decode attention directly to internal memory bandwidth, supported by two-level hybrid parallelism and reordered collective flow to minimize die-to-die traffic.

If this is right

  • Decode attention for 1M-token contexts becomes feasible at interactive latencies on memory-centric hardware.
  • Power and die area currently spent on idle compute units in GPUs can be reallocated to more memory bandwidth.
  • Hardware designers gain concrete guidance on per-cube compute power and intra-chip link bandwidth trade-offs.
  • Serving systems can shift from GPU-centric to memory-centric designs without losing attention performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same memory-centric approach could apply to other memory-bound phases such as KV cache management or retrieval-augmented generation.
  • Future chiplet stacks might generalize the two-level parallelism pattern to additional AI primitives beyond attention.
  • If the bandwidth-to-performance translation holds, it suggests a broader shift away from compute-rich dies for inference workloads.

Load-bearing premise

The proposed logic-die microarchitecture, hybrid parallelism, and reordered flow can turn the extra per-cube bandwidth into proportional latency and energy reductions without large unmodeled overheads once built in silicon.

What would settle it

Fabricate a prototype of the AMMA chiplet stack, run 1M-token decode attention workloads on it, and compare measured latency and energy directly against an equivalent GPU baseline under identical conditions.

Figures

Figures reproduced from arXiv: 2604.26103 by Chenyang Zhou, Haotian Ye, Junsung Kim, Ohm Rishabh Venkatachalam, Po-An Tsai, Shuyi Pei, Won Woo Ro, Yangwook Kang, Yufei Ding, Zaifeng Pan, Zhengding Hu, Zhongkai Yu.

Figure 1
Figure 1. Figure 1: (a) Existing serving systems rely on GPUs for decode attention. AMMA replaces compute dies with HBM-PNM cubes view at source ↗
Figure 2
Figure 2. Figure 2: PIM/PNM architecture overview. (a) Compute and HBM BW utilization (b) Power breakdown (estimated) view at source ↗
Figure 3
Figure 3. Figure 3: Profiling of H100 (a) hardware utilization and (b) view at source ↗
Figure 4
Figure 4. Figure 4: A roofline analysis of Rubin and AMMA. 3.2 Power Breakdown of GPU To understand how GPUs spend their power budget, we profile an H100 running Qwen3-235B [61] attention at batch size 1, using NVML [39] for measurement and CACTI [35] for modeling. The result in view at source ↗
Figure 6
Figure 6. Figure 6: AMMA architecture hierarchy. (a) Package and cube level integration and (b) Core and SA level microarchitecture. view at source ↗
Figure 7
Figure 7. Figure 7: (a) Three conventional SA dataflows: WS, OS, IS. view at source ↗
Figure 8
Figure 8. Figure 8: Parallelism design for AMMA. (a) Naïve TP16 distributes every attention stage across all 16 cubes. (b) Our two-level view at source ↗
Figure 9
Figure 9. Figure 9: Reordered collective operation flow for proj O. (a) Default two-level hybrid flow. (b) Our reordered flow. view at source ↗
Figure 10
Figure 10. Figure 10: Decode latency speedup, normalized to H100. view at source ↗
Figure 12
Figure 12. Figure 12: Ablation study because RO yields a fixed latency reduction that is increasingly diluted by attention compute as sequences grow. Since most gains stem from reduced communication, view at source ↗
Figure 13
Figure 13. Figure 13: Per-layer decode latency breakdown on Qwen3 view at source ↗
Figure 14
Figure 14. Figure 14: Batch size exploration for Qwen3-235B at seq=64K. view at source ↗
read the original abstract

All current LLM serving systems place the GPU at the center, from production-level attention-FFN disaggregation to NVIDIA's Rubin GPU-LPU heterogeneous platform. Even academic PIM/PNM proposals still treat the GPU as the central hub for cross-device communication. Yet the GPU's compute-rich architecture is fundamentally mismatched with the memory-bound nature of decode-phase attention, inflating serving latency while wasting power and die area on idle compute units. The problem is compounded as reasoning and agentic workloads push context lengths toward one million tokens, making attention latency the primary user-facing bottleneck. To address these inefficiencies, we present AMMA, a multi-chiplet, memory-centric architecture for low-latency long-context attention. AMMA replaces GPU compute dies with HBM-PNM cubes, roughly doubling the available memory bandwidth to better serve memory-bound attention workloads. To translate this bandwidth into proportional performance gains, we introduce (i) a logic-die microarchitecture that fully exploits per-cube internal bandwidth for decode attention under a minimal power and area budget, (ii) a two-level hybrid parallelism scheme, and (iii) a reordered collective flow that reduces intra-chip die-to-die communication overhead. We further conduct a design-space exploration over per-cube compute power and intra-chip D2D link bandwidth, providing actionable guidance for hardware designers. Evaluations show that AMMA achieves 15.5X lower attention latency and 6.9X lower energy consumption compared with the NVIDIA H100.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper proposes AMMA, a multi-chiplet memory-centric architecture for low-latency 1M-context attention serving in LLMs. It replaces GPU compute dies with HBM-PNM cubes to roughly double per-cube memory bandwidth, introduces a logic-die microarchitecture optimized for decode attention under tight power/area budgets, a two-level hybrid parallelism scheme, and a reordered collective flow to minimize intra-chip D2D communication. A design-space exploration over per-cube compute power and D2D link bandwidth is performed, with evaluations claiming 15.5X lower attention latency and 6.9X lower energy consumption versus the NVIDIA H100.

Significance. If the performance claims hold after validation, the work would be significant for computer architecture and LLM systems research by directly targeting the memory-bound nature of long-context decode attention rather than relying on GPU-centric designs. The explicit design-space exploration over compute power and intra-chip D2D bandwidth provides actionable guidance for hardware designers and is a clear strength. The approach of using HBM-PNM cubes with custom logic dies offers a concrete alternative to current disaggregated serving systems.

major comments (1)
  1. [Evaluation] Evaluation section: The headline claims of 15.5X latency reduction and 6.9X energy reduction versus H100 rest on the logic-die microarchitecture, two-level hybrid parallelism, and reordered collective flow converting doubled HBM bandwidth into proportional gains. However, the design-space exploration does not appear to include stress-testing against measured multi-chiplet D2D link latencies, control logic costs, or utilization losses at scale; without these, the proportional gains remain projections rather than demonstrated outcomes.
minor comments (1)
  1. [Abstract] Abstract: The quantitative claims are presented without any reference to workloads, baseline configurations, or simulation methodology; adding a single sentence summarizing these would improve clarity for readers.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the significance of targeting memory-bound decode attention with a memory-centric multi-chiplet design. We address the single major comment below with clarifications on our evaluation methodology and modeling choices.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section: The headline claims of 15.5X latency reduction and 6.9X energy reduction versus H100 rest on the logic-die microarchitecture, two-level hybrid parallelism, and reordered collective flow converting doubled HBM bandwidth into proportional gains. However, the design-space exploration does not appear to include stress-testing against measured multi-chiplet D2D link latencies, control logic costs, or utilization losses at scale; without these, the proportional gains remain projections rather than demonstrated outcomes.

    Authors: We thank the referee for this observation. Our design-space exploration (Section 5) systematically varies intra-chip D2D link bandwidth over a wide range of values while holding other parameters fixed; this directly exercises the sensitivity of end-to-end latency and energy to communication performance, serving as a proxy for link-latency stress testing. The logic-die microarchitecture is deliberately minimal, with control logic area and power explicitly budgeted and subtracted from the per-cube envelope so that overheads are not ignored. Utilization is evaluated at full 1 M-token scale under the two-level hybrid parallelism and reordered collective schedule; the reported speedups already reflect the achieved utilization after all communication and synchronization costs. Because AMMA is a forward-looking proposal, we rely on validated cycle-accurate simulation rather than silicon measurements of future D2D links. We will add a short subsection in the revised manuscript that explicitly tabulates the control-logic overhead assumptions and includes an additional sensitivity sweep on utilization under pessimistic D2D latency assumptions. revision: partial

Circularity Check

0 steps flagged

No circularity: performance claims arise from design-space exploration of the proposed architecture, not from self-referential definitions or fitted inputs.

full rationale

The manuscript presents AMMA as a hardware architecture proposal whose latency and energy improvements versus H100 are obtained via design-space exploration over compute power and D2D bandwidth. No equations, fitted parameters, or self-citations appear as load-bearing steps that reduce the headline claims to their own inputs by construction. The central results are simulation outcomes of the described logic-die microarchitecture, hybrid parallelism, and reordered collectives; they are not tautological renamings or predictions forced by prior self-citations. External comparison to a commercial GPU further keeps the evaluation independent of the paper's own definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no identifiable free parameters, axioms, or invented entities; the work rests on standard assumptions about HBM bandwidth scaling and D2D link costs that are not quantified here.

pith-pipeline@v0.9.0 · 5624 in / 1090 out tokens · 54009 ms · 2026-05-07T14:11:56.607029+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

73 extracted references · 20 canonical work pages · 7 internal anchors

  1. [1]

    Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S Gulavani, and Ramachandran Ramjee. 2023. Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills.arXiv preprint arXiv:2308.16369 (2023)

  2. [2]

    Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. 2023. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

  3. [3]

    Anthropic. 2026. Introducing Claude Sonnet 4.6. https://www.anthropic.com/ news/claude-sonnet-4-6 Accessed: 2026-04-05

  4. [4]

    Kyle Aubrey and Farshad Ghodsian. 2026. Inside NVIDIA Groq 3 LPX: The Low- Latency Inference Accelerator for the NVIDIA Vera Rubin Platform. NVIDIA Technical Blog. https://developer.nvidia.com/blog/inside-nvidia-groq-3-lpx-the- low-latency-inference-accelerator-for-the-nvidia-vera-rubin-platform/

  5. [5]

    BloombergNEF. 2025. Power for AI: Easier Said Than Built. https://about.bnef. com/insights/commodities/power-for-ai-easier-said-than-built/. Accessed: Apr. 2026

  6. [6]

    CBRE. 2025. Global Data Center Trends 2025. https://www.cbre.com/insights/ reports/global-data-center-trends-2025. Accessed: Apr. 2026

  7. [7]

    Liyan Chen, Dongxu Lyu, Zhenyu Li, Jianfei Jiang, Qin Wang, Zhigang Mao, and Naifeng Jing. 2025. AttenPIM: Accelerating LLM Attention with Dual-mode GEMV in Processing-in-Memory. In2025 62nd ACM/IEEE Design Automation Conference (DAC). IEEE, 1–7

  8. [8]

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. 2025. Gemini 2.5: Pushing the frontier with advanced reasoning, multi- modality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261(2025)

  9. [9]

    Guohao Dai, Ke Hong, Qiuli Mao, Xiuhong Li, Jiaming Xu, Haofeng Huang, Hongtu Xia, Xuefei Ning, Shengen Yan, Yun Liang, et al . 2025. FlashDecod- ing++Next: High Throughput LLM Inference with Latency and Memory Opti- mization.IEEE Trans. Comput.(2025)

  10. [10]

    Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

    Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. InAdvances in Neural Information Processing Systems

  11. [11]

    Fabrice Devaux. 2019. The true processing in memory accelerator. In2019 IEEE Hot Chips 31 Symposium (HCS). IEEE, 1–24

  12. [12]

    Wei Fu, Jiaxuan Gao, Xujie Shen, Chen Zhu, Zhiyu Mei, Chuyi He, Shusheng Xu, Guo Wei, Jun Mei, Jiashu Wang, Tongkai Yang, Binhang Yuan, and Yi Wu. 2025. Areal: A large-scale asynchronous reinforcement learning system for language reasoning.arXiv preprint arXiv:2505.24298(2025)

  13. [13]

    Wei Gao, Yuheng Zhao, Dakai An, Tianyuan Wu, Lunxi Cao, Shaopan Xiong, Ju Huang, Weixun Wang, Siran Yang, Wenbo Su, Jiamang Wang, Lin Qu, Bo Zheng, and Wei Wang. 2025. Rollpacker: Mitigating long-tail rollouts for fast, synchronous rl post-training.arXiv preprint arXiv:2509.21009(2025)

  14. [14]

    Siyuan He, Zhantong Zhu, Yandong He, and Tianyu Jia. 2025. LP-Spec: Leveraging LPDDR PIM for Efficient LLM Mobile Speculative Inference with Architecture- Dataflow Co-Optimization. In2025 IEEE/ACM International Conference On Com- puter Aided Design (ICCAD). IEEE, 1–9

  15. [15]

    Guseul Heo, Sangyeop Lee, Jaehong Cho, Hyunmin Choi, Sanghyeon Lee, Hyungkyu Ham, Gwangsun Kim, Divya Mahajan, and Jongse Park. 2024. Ne- uPIMs: NPU-PIM Heterogeneous Acceleration for Batched LLM Inferencing. InProceedings of the 29th ACM International Conference on Architectural Sup- port for Programming Languages and Operating Systems, Volume 3. https: ...

  16. [16]

    Ke Hong, Guohao Dai, Jiaming Xu, Qiuli Mao, Xiuhong Li, Jun Liu, Kangdi Chen, Yuhan Dong, and Yu Wang. 2024. Flashdecoding++: Faster large language model inference with asynchronization, flat gemm optimization, and heuristics. Proceedings of Machine Learning and Systems6 (2024), 148–161

  17. [17]

    JEDEC Solid State Technology Association. 2024. HBM4 High Bandwidth Memory DRAM Standard. JESD235E

  18. [18]

    Sunghwan Joo, Jinyeon Kim, Yongsun Lee, Ji-Young Kim, Youngsik Lee, Yong-Min Kim, ChiSung Oh, Kyu-Ha Shim, Haesuk Lee, Young-Yong Byun, et al. 2026. 15.6 A 36GB 3.3 TB/S HBM4 DRAM with Per-Channel TSV RDQS Auto Calibration and Fully-Programmable MBIST. In2026 IEEE International Solid-State Circuits Conference (ISSCC), Vol. 69. IEEE, 264–266

  19. [19]

    Hongju Kal, Seokmin Lee, Gun Ko, and Won Woo Ro. 2021. Space: locality-aware processing in heterogeneous memory for personalized recommendations. In2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). IEEE, 679–691

  20. [20]

    Liu Ke, Udit Gupta, Benjamin Youngjae Cho, David Brooks, Vikas Chandra, Utku Diril, Amin Firoozshahian, Kim Hazelwood, Bill Jia, Hsien-Hsin S Lee, et al

  21. [21]

    In2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA)

    Recnmp: Accelerating personalized recommendation with near-memory processing. In2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). IEEE, 790–803

  22. [22]

    Heewoo Kim, Sanjay Sri Vallabh Singapuram, Haojie Ye, Joseph Izraelevitz, Trevor Mudge, Ronald Dreslinski, and Nishil Talati. 2025. NMP-PaK: Near-memory processing acceleration of scalable de Novo genome assembly. InProceedings of the 52nd Annual International Symposium on Computer Architecture. 1834–1847

  23. [23]

    Jin Hyun Kim, Shin-haeng Kang, Sukhan Lee, Hyeonsu Kim, Woongjae Song, Yuhwan Ro, Seungwon Lee, David Wang, Hyunsung Shin, Bengseng Phuah, et al

  24. [24]

    In2021 IEEE Hot Chips 33 Symposium (HCS)

    Aquabolt-XL: Samsung HBM2-PIM with in-memory processing for ML accelerators and beyond. In2021 IEEE Hot Chips 33 Symposium (HCS). IEEE, 1–26

  25. [25]

    Hyucksung Kwon, Kyungmo Koo, Janghyeon Kim, Woongkyu Lee, Minjae Lee, Hyungdeok Lee, Yousub Jung, Jaehan Park, Yosub Song, Byeongsu Yang, et al

  26. [26]

    Lol-pim: Long-context llm decoding with scalable dram-pim system.arXiv e-prints(2024), arXiv–2412

  27. [27]

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th symposium on operating systems principles. 611–626

  28. [28]

    Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2019. Tensordimm: A practical near-memory processing architecture for embeddings and tensor operations in deep learning. InProceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture. 740–753

  29. [29]

    Yongkee Kwon, Kornijcuk Vladimir, Nahsung Kim, Woojae Shin, Jongsoon Won, Minkyu Lee, Hyunha Joo, Haerang Choi, Guhyun Kim, Byeongju An, et al. 2022. 12 AMMA: A Multi-Chiplet Memory-Centric Architecture for Low-Latency 1M Context Attention Serving XXXX, XXXX, XXXX System architecture and software stack for GDDR6-AiM. In2022 IEEE Hot Chips 34 Symposium (HC...

  30. [30]

    Sukhan Lee, Shin-haeng Kang, Jaehoon Lee, Hyeonsu Kim, Eojin Lee, Seungwoo Seo, Hosang Yoon, Seungwon Lee, Kyounghwan Lim, Hyunsung Shin, et al. 2021. Hardware architecture and software stack for PIM based on commercial DRAM technology: Industrial product. In2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). IEEE, 43–56

  31. [31]

    Cong Li, Yihan Yin, Xintong Wu, Jingchen Zhu, Zhutianya Gao, Dimin Niu, Qiang Wu, Xin Si, Yuan Xie, Chen Zhang, et al. 2025. H2-llm: Hardware-dataflow co- exploration for heterogeneous hybrid-bonding-based low-batch llm inference. In Proceedings of the 52nd Annual International Symposium on Computer Architecture. 194–210

  32. [32]

    Sixu Li, Yuzhou Chen, Chaojian Li, Yonggan Fu, Zheng Wang, Zhongzhi Yu, Hao- ran You, Zhifan Ye, Wei Zhou, Yongan Zhang, et al. 2025. ORCHES: Orchestrated Test-Time-Compute-based LLM Reasoning on Collaborative GPU-PIM HEteroge- neous System. InProceedings of the 58th IEEE/ACM International Symposium on Microarchitecture. 476–489

  33. [33]

    Bin Lin, Chen Zhang, Tao Peng, Hanyu Zhao, Wencong Xiao, Minmin Sun, Anmin Liu, Zhipeng Zhang, Lanbo Li, Xiafei Qiu, et al. 2024. Infinite-llm: Efficient llm service for long context with distattention and distributed kvcache.arXiv preprint arXiv:2401.02669(2024)

  34. [34]

    Hao Liu, Matei Zaharia, and Pieter Abbeel. 2024. Ring Attention with Blockwise Transformers for Near-Infinite Context. InInternational Conference on Learning Representations

  35. [35]

    Xiaoyu Ma and David Patterson. 2026. Challenges and Research Directions for Large Language Model Inference Hardware.arXiv preprint arXiv:2601.05047 (2026)

  36. [36]

    Meta AI. 2025. The Llama 4 Herd: The Beginning of a New Era of Natively Multi- modal AI Innovation. https://ai.meta.com/blog/llama-4-multimodal-intelligence/ Accessed: 2026-04-05

  37. [37]

    Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori B Hashimoto. 2025. s1: Simple test-time scaling. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

  38. [38]

    Naveen Muralimanohar, Rajeev Balasubramonian, Norman P Jouppi, et al. 2009. CACTI 6.0: A tool to model large caches.HP laboratories27 (2009), 28

  39. [39]

    Dimin Niu, Shuangchen Li, Yuhao Wang, Wei Han, Zhe Zhang, Yijin Guan, Tianchan Guan, Fei Sun, Fei Xue, Lide Duan, et al. 2022. 184QPS/W 64Mb/mm 2 3D logic-to-DRAM hybrid bonding with process-near-memory engine for recommendation system. In2022 IEEE International Solid-State Circuits Conference (ISSCC), Vol. 65. IEEE, 1–3

  40. [40]

    NVIDIA. 2023. NVIDIA H100 Tensor Core GPU Archtiecture. https://resources.nvidia.com/en-us-tensor-core

  41. [41]

    NVIDIA. 2025. NVIDIA Blackwell Architecture Overview. https://resources. nvidia.com/en-us-blackwell-architecture

  42. [42]

    NVIDIA. 2025. NVIDIA Management Library (NVML). https://developer.nvidia. com/management-library-nvml. Accessed: Apr. 2026

  43. [43]

    NVIDIA. 2025. NVIDIA Rubin HGX Platform. https://www.nvidia.com/en- us/data-center/hgx/. Accessed: Apr. 2026

  44. [44]

    Zaifeng Pan, Ajjkumar Patel, Zhengding Hu, Yipeng Shen, Yue Guan, Wan- Lu Li, Lianhui Qin, Yida Wang, and Yufei Ding. 2025. KVFlow: Efficient pre- fix caching for accelerating LLM-based multi-agent workflows.arXiv preprint arXiv:2507.07400(2025)

  45. [45]

    Jaehyun Park, Jaewan Choi, Kwanhee Kyung, Michael Jaemin Kim, Yongsuk Kwon, Nam Sung Kim, and Jung Ho Ahn. 2024. AttAcc! Unleashing the Power of PIM for Batched Transformer-based Generative Model Inference. InProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. https://doi...

  46. [46]

    Matthew Poremba, Itir Akgun, Jieming Yin, Onur Kayiran, Yuan Xie, and Gabriel H Loh. 2017. There and back again: Optimizing the interconnect in networks of memory cubes.ACM SIGARCH Computer Architecture News45, 2 (2017), 678–690

  47. [47]

    Derrick Quinn, E Ezgi Yücel, Jinkwon Kim, José F Martínez, and Mohammad Alian. 2025. LongSight: Compute-Enabled Memory to Accelerate Large-Context LLMs via Sparse Attention. InProceedings of the 58th IEEE/ACM International Symposium on Microarchitecture. 34–48

  48. [48]

    Ezgi Yücel, Martin Prammer, Zhenxing Fan, Kevin Skadron, Jignesh M

    Derrick Quinn, E. Ezgi Yücel, Martin Prammer, Zhenxing Fan, Kevin Skadron, Jignesh M. Patel, José F. Martínez, and Mohammad Alian. 2025. DReX: Accurate and Scalable Dense Retrieval Acceleration via Algorithmic-Hardware Codesign. InProceedings of the 52nd Annual International Symposium on Computer Archi- tecture (ISCA ’25). Association for Computing Machin...

  49. [49]

    Saeed Rashidi, Srinivas Sridharan, Sudarshan Srinivasan, and Tushar Krishna

  50. [50]

    In2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)

    Astra-sim: Enabling sw/hw co-design exploration for distributed dl training platforms. In2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 81–92

  51. [51]

    Ananda Samajdar, Yuhao Zhu, Paul Whatmough, Matthew Mattina, and Tushar Krishna. 2018. Scale-sim: Systolic cnn accelerator simulator.arXiv preprint arXiv:1811.02883(2018)

  52. [52]

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-lm: Training multi-billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053(2019)

  53. [53]

    Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. 2024. Scaling llm test- time compute optimally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314(2024)

  54. [54]

    GLM-4.5 Team, Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, Kedong Wang, Lucen Zhong, Mingdao Liu, Rui Lu, Shulin Cao, Xiaohan Zhang, Xuancheng Huang, Yao Wei, Yean Cheng, Yifan An, Yilin Niu, Yuanhao Wen, Yushi Bai, Zhengxiao Du, Zihan Wang, Zilin Zhu, Bohan Zhang, Bosi Wen, Bowen Wu...

  55. [55]

    Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. 2024. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530(2024)

  56. [56]

    Kimi Team, Yifan Bai, Yiping Bao, Y. Charles, Cheng Chen, Guanduo Chen, Haiting Chen, Huarong Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, Zhuofu Chen, Jialei Cui, Hao Ding, Mengnan Dong, Angang Du, Chenzhuang Du, Dikang Du, Yulun Du, Yu Fan, Yichen Feng, Kelin Fu, Bofei Gao, Chenxiao Gao, Hongcheng Gao, Peizhong Ga...

  57. [57]

    Boyu Tian, Yiwei Li, Li Jiang, Shuangyu Cai, and Mingyu Gao. 2024. Ndpbridge: Enabling cross-bank coordination in near-dram-bank processing architectures. 13 XXXX, XXXX, XXXX Zhongkai Yu, Haotian Ye, Chenyang Zhou, Ohm Rishabh Venkatachalam, Zaifeng Pan, Zhengding Hu, Junsung Kim, Won Woo Ro, Po-An Tsai, Shuyi Pei, Yangwook Kang, and Yufei Ding In2024 ACM...

  58. [58]

    Universal Chiplet Interconnect Express, Inc. 2025. UCIe (Universal Chiplet Interconnect Express) Specification 3.0. https://www.uciexpress.org/3-0-spec- download. Accessed: Apr. 2026

  59. [59]

    Gomez, Łukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. InAdvances in Neural Information Processing Systems

  60. [60]

    Bin Wang, Bojun Wang, Changyi Wan, Guanzhe Huang, Hanpeng Hu, Haonan Jia, Hao Nie, Mingliang Li, Nuo Chen, Siyu Chen, et al. 2025. Step-3 is large yet affordable: Model-system co-design for cost-effective decoding.arXiv preprint arXiv:2507.19427(2025)

  61. [61]

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. 2022. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems35 (2022), 24824–24837

  62. [62]

    William Won, Taekyung Heo, Saeed Rashidi, Srinivas Sridharan, Sudarshan Srinivasan, and Tushar Krishna. 2023. Astra-sim2. 0: Modeling hierarchical networks and disaggregated systems for large-model training at scale. In2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 283–294

  63. [63]

    Bingyang Wu, Shengyu Liu, Yinmin Zhong, Peng Sun, Xuanzhe Liu, and Xin Jin

  64. [64]

    InProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles

    Loongserve: Efficiently serving long-context large language models with elastic sequence parallelism. InProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles. 640–654

  65. [65]

    Ruofan Wu, Jae-Won Chung, and Mosharaf Chowdhury. 2026. Kareus: Joint Reduction of Dynamic and Static Energy in Large Model Training.arXiv preprint arXiv:2601.17654(2026)

  66. [66]

    An Yang, Anfeng Li, Baosong Yang, et al. 2025. Qwen3 Technical Report.arXiv preprint arXiv:2505.09388(2025)

  67. [67]

    John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems37 (2024), 50528–50652

  68. [68]

    Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. 2022. Webshop: Towards scalable real-world web interaction with grounded language agents. Advances in Neural Information Processing Systems35 (2022), 20744–20757

  69. [69]

    Sungmin Yun, Byeongho Kim, Jaehyun Park, Hwayong Nam, Jung Ho Ahn, and Eojin Lee. 2022. GraNDe: Near-data processing architecture with adaptive matrix mapping for graph convolutional networks.IEEE Computer Architecture Letters 21, 2 (2022), 45–48

  70. [70]

    Sungmin Yun, Kwanhee Kyung, Juhwan Cho, Jaewan Choi, Jongmin Kim, Byeongho Kim, Sukhan Lee, Kyomin Sohn, and Jung Ho Ahn. 2024. Duplex: A device for large language models with mixture of experts, grouped query atten- tion, and continuous batching. In2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 1429–1443

  71. [71]

    Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. 2024. {DistServe}: Disaggregating prefill and decoding for goodput-optimized large language model serving. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 193–210

  72. [72]

    Minxuan Zhou, Weihong Xu, Jaeyoung Kang, and Tajana Rosing. 2022. TransPIM: A memory-based acceleration via software-hardware co-design for transformer. In2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 1071–1085

  73. [73]

    Ruidong Zhu, Ziheng Jiang, Chao Jin, Peng Wu, Cesar A Stuardo, Dongyang Wang, Xinlei Zhang, Huaping Zhou, Haoran Wei, Yang Cheng, et al . 2025. Megascale-infer: Serving mixture-of-experts at scale with disaggregated expert parallelism.arXiv preprint arXiv:2504.02263(2025). 14