pith. machine review for the scientific record. sign in

arxiv: 2510.05497 · v5 · submitted 2025-10-07 · 💻 cs.DC · cs.AI· cs.AR· cs.LG

Patterns behind Chaos: Forecasting Data Movement for Efficient Large-Scale MoE LLM Inference

Pith reviewed 2026-05-18 09:49 UTC · model grok-4.3

classification 💻 cs.DC cs.AIcs.ARcs.LG
keywords Mixture of ExpertsLLM inferencedata movementexpert placementGPU servingperformance profilingMoE models
0
0 comments X

The pith

Profiling data movement in large MoE models uncovers six patterns that reduce serving bottlenecks and deliver substantial speedups.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper profiles expert selection and resulting data movement across four 200B-to-1000B MoE LLMs using over 24,000 requests from varied workloads. Analysis from temporal and spatial angles yields six concrete insights about when and where data must move. These insights directly inform both hardware modifications and software placement strategies. A reader would care because random expert routing, while enabling model scale, now dominates the cost of multi-unit inference systems.

Core claim

Comprehensive data-movement-centric profiling of state-of-the-art large-scale MoE models reveals consistent patterns in expert activation. From temporal and spatial perspectives the authors distill six key insights. These insights guide lightweight architectural changes on wafer-scale GPUs and a prefill-aware expert placement algorithm on existing GPUs, producing measured speedups of 6.6 times on average and up to 1.25 times respectively.

What carries the argument

Six distilled insights from temporal and spatial analysis of expert selection traces that identify repeatable patterns in data movement and guide placement and architectural decisions.

If this is right

  • Lightweight architectural modifications on wafer-scale GPUs produce a 6.6 times average speedup across the four tested 200B-1000B models.
  • A prefill-aware expert placement algorithm derived from the insights achieves up to 1.25 times speedup on MoE computation on existing GPU systems.
  • The insights apply across diverse workloads spanning the 24,000 profiled requests.
  • The same analysis framework can be reused on future model releases to generate new placement or hardware designs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The patterns could be used to build predictive prefetching logic that anticipates expert needs before the current token batch.
  • Combining the placement algorithm with existing expert parallelism techniques might further reduce inter-node traffic in multi-rack deployments.
  • If the temporal patterns prove stable, online monitoring of recent expert choices could dynamically adjust placement without full re-profiling.

Load-bearing premise

The six patterns observed on the profiled workloads and current model scales will continue to hold and produce similar speedups on other workloads, larger future models, and real production serving conditions.

What would settle it

Run the same profiling on a new 500B+ MoE model with a different routing scheme or workload distribution and measure whether the reported speedups from the derived placement algorithm or modifications drop below 1.1 times.

Figures

Figures reproduced from arXiv: 2510.05497 by Chenyang Zhou, Po-An Tsai, Shuyi Pei, Yangwook Kang, Yue Guan, Yufei Ding, Zhengding Hu, Zhongkai Yu, Zihao Yu.

Figure 1
Figure 1. Figure 1: MoE LLM models sizes and release dates. Bubble size indicates [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Latency breakdown for different data movement in DeepSeekV3, [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Inference process of MoE LLMs and the categorization method for our proposed data-centric profiling approach. [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Layer-level temporal correlation heatmaps for (a) Deepseek and [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Token-level temporal correlation heatmaps for (a) Deepseek, (b) Llama, and (c) Qwen, together with (d) statistical results across all layers, [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Expert activation patterns remain consistent across prefill and [PITH_FULL_IMAGE:figures/full_fig_p004_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Single-expert spatial relation analysis of Llama4 layer 7 shows: [PITH_FULL_IMAGE:figures/full_fig_p005_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Co-activation probability heatmap of expert pair for (a) Deepseek [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: (a) Wafer-scale multi-chiplet GPU architecture with additional units highlighted in orange. (b) SoW (System-on-Wafer) technology structure. (c) [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Proposed task allocation algorithm and data-driven predictor. [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Throughput of MoE layers (Top) and hop number reduction ratio (Bottom). All figures are scaled to baseline. [PITH_FULL_IMAGE:figures/full_fig_p010_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Simulator validation with real data generated from 8xH100 DGX, [PITH_FULL_IMAGE:figures/full_fig_p010_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: DRAM access breakdown for [PITH_FULL_IMAGE:figures/full_fig_p011_13.png] view at source ↗
read the original abstract

Large-scale Mixture of Experts (MoE) Large Language Models (LLMs) have recently become the frontier open-weight models, achieving remarkable model capability similar to proprietary ones. But their random expert selection mechanism introduces significant data movement overhead that becomes the dominant bottleneck in multi-unit LLM serving systems. To understand the patterns underlying this data movement, we conduct comprehensive data-movement-centric profiling across four state-of-the-art large-scale MoE models released in 2025 (200B-1000B) using over 24,000 requests spanning diverse workloads. We perform systematic analysis from both temporal and spatial perspectives and distill six key insights to guide the design of diverse serving systems. We verify these insights on both future wafer-scale GPU architectures and existing GPU systems. On wafer-scale GPUs, lightweight architectural modifications guided by our insights yield a 6.6$\times$ average speedup across four 200B--1000B models. On existing GPU systems, our insights drive the design of a prefill-aware expert placement algorithm that achieves up to 1.25$\times$ speedup on MoE computation. Our work presents the first comprehensive data-centric analysis of large-scale MoE models together with a concrete design study applying the learned lessons. Our profiling traces are publicly available at \href{https://huggingface.co/datasets/core12345/MoE_expert_selection_trace}{\textcolor{blue}{https://huggingface.co/datasets/core12345/MoE\_expert\_selection\_trace}}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper conducts comprehensive profiling of data movement in four large-scale MoE LLMs (200B-1000B) using 24,000 requests. It distills six key insights from temporal and spatial analysis to guide serving system design. These insights lead to lightweight modifications yielding 6.6× speedup on wafer-scale GPUs and a prefill-aware placement algorithm with 1.25× speedup on existing systems. Profiling traces are publicly released.

Significance. This work addresses a critical bottleneck in large MoE LLM inference by providing data-centric insights. The reported speedups demonstrate practical impact if the insights hold. Public traces enhance reproducibility. The significance hinges on whether the patterns generalize beyond the studied models and workloads.

major comments (2)
  1. §4: The six distilled insights are derived from the profiled 2025 models and workloads without reported cross-validation on alternate request distributions, routing strategies (e.g., top-k variants), or model scales; this is load-bearing for the claim that the insights guide design for diverse serving systems.
  2. §6.1, wafer-scale results: The 6.6× average speedup is obtained via simulated architectural modifications; the manuscript does not detail how the simulation captures realistic interconnect and memory hierarchy costs or whether the modifications introduce unaccounted overheads.
minor comments (2)
  1. The methods section should explicitly state the workload selection criteria and the exact composition (e.g., request length distributions, batch sizes) of the 24k requests to allow assessment of representativeness.
  2. Figure captions and axis labels in the spatial/temporal analysis sections would benefit from additional detail on the units and normalization used for data-movement volume.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and describe the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: §4: The six distilled insights are derived from the profiled 2025 models and workloads without reported cross-validation on alternate request distributions, routing strategies (e.g., top-k variants), or model scales; this is load-bearing for the claim that the insights guide design for diverse serving systems.

    Authors: We acknowledge that the primary analysis is performed on the four 2025 models (200B–1000B) and the 24,000-request workloads described in the manuscript. These models already span a substantial range of scales and exhibit different expert-selection behaviors. The six insights are presented as observations from this data rather than universal claims. To address the concern directly, we will revise §4 to add an explicit discussion of the studied models’ diversity, note the absence of exhaustive cross-validation on alternate top-k variants or request distributions, and include a short analysis applying the insights to a small set of additional traces with varied request patterns drawn from the public dataset. We will also qualify the guidance for serving-system design accordingly. revision: partial

  2. Referee: §6.1, wafer-scale results: The 6.6× average speedup is obtained via simulated architectural modifications; the manuscript does not detail how the simulation captures realistic interconnect and memory hierarchy costs or whether the modifications introduce unaccounted overheads.

    Authors: We appreciate the referee highlighting the need for greater transparency in the simulation. The results in §6.1 are produced by a cycle-level simulator that incorporates published wafer-scale interconnect bandwidth figures and memory-access latencies. In the revised manuscript we will expand the methodology subsection to describe (1) how the interconnect topology and bandwidth are modeled, (2) how memory-hierarchy costs are accounted for, and (3) an overhead analysis of the proposed lightweight modifications together with sensitivity results showing that the reported speedups remain robust under reasonable variations in these parameters. revision: yes

Circularity Check

0 steps flagged

Empirical profiling study; speedups are direct measurements, not derived quantities

full rationale

The paper conducts data-movement profiling on four 2025 MoE models using 24k requests, distills six insights from observed temporal/spatial patterns, and reports measured speedups from architectural modifications and a placement algorithm. No equations, fitted parameters, or first-principles derivations are presented; claims rest on empirical observation and verification rather than any reduction to inputs by construction. No self-citation chains or ansatzes are load-bearing for the central results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is empirical profiling and does not introduce new mathematical axioms, fitted parameters, or postulated entities; it relies on standard assumptions about GPU memory hierarchies and MoE routing behavior.

pith-pipeline@v0.9.0 · 5835 in / 1131 out tokens · 28992 ms · 2026-05-18T09:49:47.241694+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Replication in Graph Partitioning and Scheduling Problems

    cs.DC 2026-04 unverdicted novelty 5.0

    Replication reduces costs by 17-65% on average in hypergraph partitioning and 11-23% in DAG scheduling, sometimes eliminating communication needs entirely.

Reference graph

Works this paper leans on

82 extracted references · 82 canonical work pages · cited by 1 Pith paper · 8 internal anchors

  1. [1]

    Gpt-4 technical report,

    J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “Gpt-4 technical report,”arXiv, 2023

  2. [2]

    Analyzing cuda workloads using a detailed gpu simulator,

    A. Bakhoda, G. L. Yuan, W. W. Fung, H. Wong, and T. M. Aamodt, “Analyzing cuda workloads using a detailed gpu simulator,” in2009 IEEE international symposium on performance analysis of systems and software, 2009

  3. [3]

    The gem5 simulator,

    N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashtiet al., “The gem5 simulator,”ACM SIGARCH computer architecture news, 2011

  4. [4]

    Moe-lightning: High-throughput moe inference on memory-constrained gpus,

    S. Cao, S. Liu, T. Griggs, P. Schafhalter, X. Liu, Y . Sheng, J. E. Gon- zalez, M. Zaharia, and I. Stoica, “Moe-lightning: High-throughput moe inference on memory-constrained gpus,” inProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2025

  5. [5]

    Waferscale network switches,

    S. Chen, S. Pal, and R. Kumar, “Waferscale network switches,” in 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA). IEEE, 2024, pp. 215–229

  6. [6]

    Chatbot arena: An open platform for evaluating llms by human preference,

    W.-L. Chiang, L. Zheng, Y . Sheng, A. N. Angelopoulos, T. Li, D. Li, B. Zhu, H. Zhang, M. Jordan, J. E. Gonzalezet al., “Chatbot arena: An open platform for evaluating llms by human preference,” inForty-first International Conference on Machine Learning, 2024

  7. [7]

    Lexi: Layer-adaptive active experts for efficient moe model inference,

    K. T. Chitty-Venkata, S. Madireddy, M. Emani, and V . Vishwanath, “Lexi: Layer-adaptive active experts for efficient moe model inference,” arXiv preprint arXiv:2509.02753, 2025

  8. [8]

    DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

    D. Dai, C. Deng, C. Zhao, R. Xu, H. Gao, D. Chen, J. Li, W. Zeng, X. Yu, Y . Wuet al., “Deepseekmoe: Towards ultimate expert spe- cialization in mixture-of-experts language models,”arXiv preprint arXiv:2401.06066, 2024

  9. [9]

    Cpelide: Efficient multi- chiplet gpu implicit synchronization,

    P. Dalmia, R. S. Kumar, and M. D. Sinclair, “Cpelide: Efficient multi- chiplet gpu implicit synchronization,” in2024 57th IEEE/ACM Interna- tional Symposium on Microarchitecture (MICRO), 2024

  10. [10]

    A complete survey on llm-based ai chatbots,

    S. K. Dam, C. S. Hong, Y . Qiao, and C. Zhang, “A complete survey on llm-based ai chatbots,”arXiv, 2024

  11. [11]

    Sida: Sparsity-inspired data-aware serving for efficient and scalable large mixture-of-experts models,

    Z. Du, S. Li, Y . Wu, X. Jiang, J. Sun, Q. Zheng, Y . Wu, A. Li, H. Li, and Y . Chen, “Sida: Sparsity-inspired data-aware serving for efficient and scalable large mixture-of-experts models,”Proceedings of Machine Learning and Systems, vol. 6, pp. 224–238, 2024

  12. [12]

    Fast inference of mixture-of-experts language models with offloading,

    A. Eliseev and D. Mazur, “Fast inference of mixture-of-experts language models with offloading,”arXiv preprint arXiv:2312.17238, 2023

  13. [13]

    Klotski: Efficient mixture-of-expert inference via expert- aware multi-batch pipeline,

    Z. Fang, Y . Huang, Z. Hong, Y . Lyu, W. Chen, Y . Yu, F. Yu, and Z. Zheng, “Klotski: Efficient mixture-of-expert inference via expert- aware multi-batch pipeline,” inProceedings of the 30th ACM Interna- tional Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, 2025

  14. [14]

    Heterogeneous die-to-die interfaces: Enabling more flexible chiplet interconnection systems,

    Y . Feng, D. Xiang, and K. Ma, “Heterogeneous die-to-die interfaces: Enabling more flexible chiplet interconnection systems,” inProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchi- tecture, 2023, pp. 930–943

  15. [15]

    A scalable methodology for designing efficient interconnection network of chiplets,

    ——, “A scalable methodology for designing efficient interconnection network of chiplets,” in2023 IEEE International Symposium on High- Performance Computer Architecture (HPCA). IEEE, 2023, pp. 1059– 1071

  16. [16]

    S. T. from LMSYS Org. (2025) Deploying deepseek with pd disaggregation and large-scale expert parallelism on 96 h100 gpus. [Online]. Available: https://lmsys.org/blog/2025-05-05-large-scale-ep/

  17. [17]

    (2025) Together with sglang: Best practices for serving deepseek- r1 on h20-96g

    ——. (2025) Together with sglang: Best practices for serving deepseek- r1 on h20-96g. [Online]. Available: https://lmsys.org/blog/2025-09-26- sglang-ant-group/

  18. [18]

    Moetuner: Optimized mixture of expert serving with balanced expert placement and token routing.arXiv preprint arXiv:2502.06643,

    S. Go and D. Mahajan, “Moetuner: Optimized mixture of expert serving with balanced expert placement and token routing,”arXiv preprint arXiv:2502.06643, 2025

  19. [19]

    Lynx: Enabling efficient moe inference through dynamic batch-aware expert selection,

    V . Gupta, K. Sinha, A. Gavrilovska, and A. P. Iyer, “Lynx: Enabling efficient moe inference through dynamic batch-aware expert selection,” arXiv preprint arXiv:2411.08982, 2024

  20. [20]

    Lost in abstraction: Pitfalls of analyzing gpus at the intermediate language level,

    A. Gutierrez, B. M. Beckmann, A. Dutu, J. Gross, M. LeBeane, J. Kala- matianos, O. Kayiran, M. Poremba, B. Potter, S. Puthooret al., “Lost in abstraction: Pitfalls of analyzing gpus at the intermediate language level,” in2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2018

  21. [21]

    Waferllm: Large language model inference at wafer scale,

    C. He, Y . Huang, P. Mu, Z. Miao, J. Xue, L. Ma, F. Yang, and L. Mai, “Waferllm: Large language model inference at wafer scale,” in19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25). USENIX Association, 2025

  22. [22]

    Chinese simpleqa: A chinese factuality evaluation for large language models,

    Y . He, S. Li, J. Liu, Y . Tan, W. Wang, H. Huang, X. Bu, H. Guo, C. Hu, B. Zhenget al., “Chinese simpleqa: A chinese factuality evaluation for large language models,”arXiv preprint arXiv:2411.07140, 2024

  23. [23]

    Measuring Massive Multitask Language Understanding

    D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt, “Measuring massive multitask language understanding,” arXiv preprint arXiv:2009.03300, 2020

  24. [24]

    Wafer-level integration of an advanced logic-memory system through the second-generation cowos technology,

    S. Hou, W. C. Chen, C. Hu, C. Chiu, K. Ting, T. Lin, W. Wei, W. Chiou, V . J. Lin, V . C. Changet al., “Wafer-level integration of an advanced logic-memory system through the second-generation cowos technology,” IEEE Transactions on Electron Devices, 2017

  25. [25]

    Integrated deep trench capacitor in si inter- poser for cowos heterogeneous integration,

    S. Hou, H. Hsia, C. Tsai, K. Ting, T. Yu, Y . Lee, F. Chen, W. Chiou, C. Wang, C. Wuet al., “Integrated deep trench capacitor in si inter- poser for cowos heterogeneous integration,” in2019 IEEE International Electron Devices Meeting (IEDM). IEEE, 2019, pp. 19–5

  26. [26]

    Cowos architecture evolution for next generation hpc on 2.5 d system in package,

    Y .-C. Hu, Y .-M. Liang, H.-P. Hu, C.-Y . Tan, C.-T. Shen, C.-H. Lee, and S. Hou, “Cowos architecture evolution for next generation hpc on 2.5 d system in package,” in2023 IEEE 73rd Electronic Components and Technology Conference (ECTC), 2023

  27. [27]

    Tutel: Adaptive mixture-of-experts at scale,

    C. Hwang, W. Cui, Y . Xiong, Z. Yang, Z. Liu, H. Hu, Z. Wang, R. Salas, J. Jose, P. Ramet al., “Tutel: Adaptive mixture-of-experts at scale,” Proceedings of Machine Learning and Systems, vol. 5, pp. 269–287, 2023

  28. [28]

    Pre-gated moe: An algorithm-system co-design for fast and scalable mixture-of-expert inference,

    R. Hwang, J. Wei, S. Cao, C. Hwang, X. Tang, T. Cao, and M. Yang, “Pre-gated moe: An algorithm-system co-design for fast and scalable mixture-of-expert inference,” in2024 ACM/IEEE 51st Annual Interna- tional Symposium on Computer Architecture (ISCA). IEEE, 2024, pp. 1018–1031

  29. [29]

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    N. Jain, K. Han, A. Gu, W.-D. Li, F. Yan, T. Zhang, S. Wang, A. Solar-Lezama, K. Sen, and I. Stoica, “Livecodebench: Holistic and contamination free evaluation of large language models for code,”arXiv preprint arXiv:2403.07974, 2024

  30. [30]

    Mixtral of Experts

    A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bam- ford, D. S. Chaplot, D. d. l. Casas, E. B. Hanna, F. Bressandet al., “Mixtral of experts,”arXiv preprint arXiv:2401.04088, 2024

  31. [31]

    Fiddler: CPU-GPU orchestration for fast inference of mixture-of-experts models,

    K. Kamahori, T. Tang, Y . Gu, K. Zhu, and B. Kasikci, “Fiddler: CPU-GPU orchestration for fast inference of mixture-of-experts models,” inThe Thirteenth International Conference on Learning Representations, 2025. [Online]. Available: https://openreview.net/ forum?id=N5fVv6PZGz

  32. [32]

    Comp-net: Command processor networking for efficient intra-kernel communications on gpus,

    M. LeBeane, K. Hamidouche, B. Benton, M. Breternitz, S. K. Reinhardt, and L. K. John, “Comp-net: Command processor networking for efficient intra-kernel communications on gpus,” inProceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques, 2018

  33. [33]

    Optimizing mixture-of-experts inference time combining model deployment and communication scheduling,

    J. Li, S. Tripathi, L. Rastogi, Y . Lei, R. Pan, and Y . Xia, “Optimizing mixture-of-experts inference time combining model deployment and communication scheduling,”arXiv preprint arXiv:2410.17043, 2024. 12

  34. [34]

    Accelerating distributed moe training and inference with lina,

    J. Li, Y . Jiang, Y . Zhu, C. Wang, and H. Xu, “Accelerating distributed moe training and inference with lina,” in2023 USENIX Annual Technical Conference (USENIX ATC 23), 2023, pp. 945–959

  35. [35]

    Lucie: A universal chiplet-interposer design framework for plug-and-play integration,

    Z. Li and D. Wentzlaff, “Lucie: A universal chiplet-interposer design framework for plug-and-play integration,” in2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2024, pp. 423–436

  36. [36]

    DeepSeek-V3 Technical Report

    A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruanet al., “Deepseek-v3 technical report,”arXiv preprint arXiv:2412.19437, 2024

  37. [37]

    Llamax: Scaling linguis- tic horizons of llm by enhancing translation capabilities beyond 100 languages,

    Y . Lu, W. Zhu, L. Li, Y . Qiao, and F. Yuan, “Llamax: Scaling linguis- tic horizons of llm by enhancing translation capabilities beyond 100 languages,”arXiv, 2024

  38. [38]

    Moore’s law forever?

    M. Lundstrom, “Moore’s law forever?”Science, 2003

  39. [39]

    Embedded multi-die interconnect bridge (emib)–a high density, high bandwidth packaging interconnect,

    R. Mahajan, R. Sankman, N. Patel, D.-W. Kim, K. Aygun, Z. Qian, Y . Mekonnen, I. Salama, S. Sharan, D. Iyengaret al., “Embedded multi-die interconnect bridge (emib)–a high density, high bandwidth packaging interconnect,” in2016 IEEE 66th Electronic Components and Technology Conference (ECTC), 2016

  40. [40]

    Mixture of experts: a literature survey,

    S. Masoudnia and R. Ebrahimpour, “Mixture of experts: a literature survey,”Artificial Intelligence Review, pp. 275–293, 2014

  41. [41]

    (2025) Llama4 technical report

    Meta. (2025) Llama4 technical report. [Online]. Available: https: //ai.meta.com/blog/llama-4-multimodal-intelligence/

  42. [42]

    OLMoE: Open Mixture-of-Experts Language Models

    N. Muennighoff, L. Soldaini, D. Groeneveld, K. Lo, J. Morrison, S. Min, W. Shi, P. Walsh, O. Tafjord, N. Lambertet al., “Olmoe: Open mixture- of-experts language models,”arXiv preprint arXiv:2409.02060, 2024

  43. [43]

    Using an llm to help with code understanding,

    D. Nam, A. Macvean, V . Hellendoorn, B. Vasilescu, and B. Myers, “Using an llm to help with code understanding,” inProceedings of the IEEE/ACM 46th International Conference on Software Engineering, ser. ICSE ’24, 2024

  44. [44]

    Nvidia blackwell architecture overview,

    NVIDIA, “Nvidia blackwell architecture overview,” https://resources. nvidia.com/en-us-blackwell-architecture, 2025

  45. [45]

    Nvidia gtc 2025,

    ——, “Nvidia gtc 2025,” https://www.nvidia.com/gtc/, 2025

  46. [46]

    Scar: Schedul- ing multi-model ai workloads on heterogeneous multi-chiplet module accelerators,

    M. Odema, L. Chen, H. Kwon, and M. A. Al Faruque, “Scar: Schedul- ing multi-model ai workloads on heterogeneous multi-chiplet module accelerators,” in2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2024, pp. 565–579

  47. [47]

    Mooncake: Trading more storage for less computation—a {KVCache-centric}architecture for serving{LLM}chatbot,

    R. Qin, Z. Li, W. He, J. Cui, F. Ren, M. Zhang, Y . Wu, W. Zheng, and X. Xu, “Mooncake: Trading more storage for less computation—a {KVCache-centric}architecture for serving{LLM}chatbot,” in23rd USENIX Conference on File and Storage Technologies (FAST 25), 2025

  48. [48]

    Fred: A wafer-scale fabric for 3d parallel dnn training,

    S. Rashidi, W. Won, S. Srinivasan, P. Gupta, and T. Krishna, “Fred: A wafer-scale fabric for 3d parallel dnn training,” inProceedings of the 52nd Annual International Symposium on Computer Architecture, 2025, pp. 34–48

  49. [49]

    Correlation coefficients: appropriate use and interpretation,

    P. Schober, C. Boer, and L. A. Schwarte, “Correlation coefficients: appropriate use and interpretation,”Anesthesia & analgesia, 2018

  50. [50]

    Simba: Scaling deep-learning inference with multi-chip-module-based architecture,

    Y . S. Shao, J. Clemons, R. Venkatesan, B. Zimmer, M. Fojtik, N. Jiang, B. Keller, A. Klinefelter, N. Pinckney, P. Rainaet al., “Simba: Scaling deep-learning inference with multi-chip-module-based architecture,” in Proceedings of the 52nd annual IEEE/ACM international symposium on microarchitecture, 2019, pp. 14–27

  51. [51]

    Flexgen: High-throughput generative inference of large language models with a single gpu,

    Y . Sheng, L. Zheng, B. Yuan, Z. Li, M. Ryabinin, B. Chen, P. Liang, C. R ´e, I. Stoica, and C. Zhang, “Flexgen: High-throughput generative inference of large language models with a single gpu,” inInternational Conference on Machine Learning, 2023

  52. [52]

    Sow-x: A novel system-on-wafer technology for next generation ai server application,

    P.-C. Shih, A.-J. Su, K.-H. Tam, T.-C. Huang, K. Chuang, and J. Yeh, “Sow-x: A novel system-on-wafer technology for next generation ai server application,” in2025 IEEE 75th Electronic Components and Technology Conference (ECTC). IEEE, 2025

  53. [53]

    Signal integrity of die-to-die interface with advanced packages for co-packaged optics,

    J. Shin, H. Eslampour, S. Jeong, W. Kim, S. Yong, S.-O. Ahn, E. Park, and S. Song, “Signal integrity of die-to-die interface with advanced packages for co-packaged optics,” in2024 IEEE 33rd Conference on Electrical Performance of Electronic Packaging and Systems (EPEPS), 2024

  54. [54]

    Mixture of cache- conditional experts for efficient mobile device inference,

    A. Skliar, T. van Rozendaal, R. Lepert, T. Boinovski, M. Van Baalen, M. Nagel, P. Whatmough, and B. E. Bejnordi, “Mixture of cache- conditional experts for efficient mobile device inference,”arXiv preprint arXiv:2412.00099, 2024

  55. [55]

    Amd instinct™ mi300x accelerator: Packaging and architecture co-optimization,

    A. Smith, G. H. Loh, J. Wuu, S. Naffziger, T. Huang, H. McIntyre, R. Mangaser, W. Jung, and R. Swaminathan, “Amd instinct™ mi300x accelerator: Packaging and architecture co-optimization,” in2024 IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits). IEEE, 2024

  56. [56]

    Hunyuan-large: An open-source moe model with 52 billion activated parameters by tencent,

    X. Sun, Y . Chen, Y . Huang, R. Xie, J. Zhu, K. Zhang, S. Li, Z. Yang, J. Han, X. Shuet al., “Hunyuan-large: An open-source moe model with 52 billion activated parameters by tencent,”arXiv preprint arXiv:2411.02265, 2024

  57. [57]

    Mgpusim: Enabling multi-gpu performance modeling and optimization,

    Y . Sun, T. Baruah, S. A. Mojumder, S. Dong, X. Gong, S. Treadway, Y . Bao, S. Hance, C. McCardwell, V . Zhaoet al., “Mgpusim: Enabling multi-gpu performance modeling and optimization,” inProceedings of the 46th International Symposium on Computer Architecture, 2019

  58. [58]

    Coserve: Efficient collaboration-of-experts (coe) model inference with limited memory,

    J. Suo, X. Liao, L. Xiao, L. Ruan, J. Wang, X. Su, and Z. Huo, “Coserve: Efficient collaboration-of-experts (coe) model inference with limited memory,” inProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2025

  59. [59]

    emoe: Task-aware mem- ory efficient mixture-of-experts-based (moe) model inference,

    S. Tairin, S. Mahmud, H. Shen, and A. Iyer, “emoe: Task-aware mem- ory efficient mixture-of-experts-based (moe) model inference,”arXiv preprint arXiv:2503.06823, 2025

  60. [60]

    The microarchitecture of dojo, tesla’s exa-scale computer,

    E. Talpes, D. D. Sarma, D. Williams, S. Arora, T. Kunjan, B. Floering, A. Jalote, C. Hsiong, C. Poorna, V . Samantet al., “The microarchitecture of dojo, tesla’s exa-scale computer,”IEEE Micro, 2023

  61. [61]

    Dojo: The microarchitecture of tesla’s exa-scale computer,

    E. Talpes, D. Williams, and D. D. Sarma, “Dojo: The microarchitecture of tesla’s exa-scale computer,” in2022 IEEE Hot Chips 34 Symposium (HCS), 2022

  62. [62]

    Nn-baton: Dnn workload orches- tration and chiplet granularity exploration for multichip accelerators,

    Z. Tan, H. Cai, R. Dong, and K. Ma, “Nn-baton: Dnn workload orches- tration and chiplet granularity exploration for multichip accelerators,” in2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2021, pp. 1013–1026

  63. [63]

    Kimi K2: Open Agentic Intelligence

    K. Team, Y . Bai, Y . Bao, G. Chen, J. Chen, N. Chen, R. Chen, Y . Chen, Y . Chen, Y . Chenet al., “Kimi k2: Open agentic intelligence,”arXiv preprint arXiv:2507.20534, 2025

  64. [64]

    (2025) Tsmc’s next generation of system-on- wafer package will make today’s cpus and gpus look pathetically feeble in comparison

    TSMC. (2025) Tsmc’s next generation of system-on- wafer package will make today’s cpus and gpus look pathetically feeble in comparison. [Online]. Available: https://www.pcgamer.com/hardware/processors/tsmcs-next-generation- of-system-on-wafer-packaging-will-make-todays-cpus-and-gpus-look- pathetically-feeble-in-comparison/

  65. [65]

    Rag-based llm chatbot using llama-2,

    S. Vakayil, D. S. Juliet, S. Vakayilet al., “Rag-based llm chatbot using llama-2,” in2024 7th International Conference on Devices, Circuits and Systems (ICDCS), 2024

  66. [66]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems, 2017

  67. [67]

    Mmlu-pro: A more robust and challenging multi-task language understanding benchmark,

    Y . Wang, X. Ma, G. Zhang, Y . Ni, A. Chandra, S. Guo, W. Ren, A. Arulraj, X. He, Z. Jianget al., “Mmlu-pro: A more robust and challenging multi-task language understanding benchmark,”Advances in Neural Information Processing Systems, 2024

  68. [68]

    Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation,

    Y . Wang, W. Wang, S. Joty, and S. C. H. Hoi, “Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation,”arXiv, 2021

  69. [69]

    Astra-sim2. 0: Modeling hierarchical networks and disaggregated systems for large-model training at scale,

    W. Won, T. Heo, S. Rashidi, S. Sridharan, S. Srinivasan, and T. Kr- ishna, “Astra-sim2. 0: Modeling hierarchical networks and disaggregated systems for large-model training at scale,” in2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2023

  70. [70]

    Loongserve: Effi- ciently serving long-context large language models with elastic sequence parallelism,

    B. Wu, S. Liu, Y . Zhong, P. Sun, X. Liu, and X. Jin, “Loongserve: Effi- ciently serving long-context large language models with elastic sequence parallelism,” inProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles, 2024, pp. 640–654

  71. [71]

    Wsc-llm: Efficient llm service and architecture co-exploration for wafer-scale chips,

    Z. Xu, D. Kong, J. Liu, J. Li, J. Hou, X. Dai, C. Li, S. Wei, Y . Hu, and S. Yin, “Wsc-llm: Efficient llm service and architecture co-exploration for wafer-scale chips,” inProceedings of the 52nd Annual International Symposium on Computer Architecture, 2025, pp. 1–17

  72. [72]

    Qwen3 Technical Report

    A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lvet al., “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025

  73. [73]

    Pd constraint-aware physical/logical topology co-design for network on wafer,

    Q. Yang, T. Wei, S. Guan, C. Li, H. Shang, J. Deng, H. Wang, C. Li, L. Wang, Y . Zhanget al., “Pd constraint-aware physical/logical topology co-design for network on wafer,” inProceedings of the 52nd Annual International Symposium on Computer Architecture, 2025, pp. 49–64

  74. [74]

    Exploiting inter-layer expert affinity for accelerating mixture-of-experts model inference,

    J. Yao, Q. Anthony, A. Shafi, H. Subramoni, and D. K. D. Panda, “Exploiting inter-layer expert affinity for accelerating mixture-of-experts model inference,” in2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 2024

  75. [75]

    Cramming a data center into one cabinet, a co-exploration of computing and hardware 13 architecture of waferscale chip,

    X. Yu, D. Jiang, J. Deng, J. Liu, C. Li, S. Yin, and Y . Hu, “Cramming a data center into one cabinet, a co-exploration of computing and hardware 13 architecture of waferscale chip,” inProceedings of the 52nd Annual International Symposium on Computer Architecture, 2025, pp. 631–645

  76. [76]

    Cambricon-llm: A chiplet-based hybrid archi- tecture for on-device inference of 70b llm,

    Z. Yu, S. Liang, T. Ma, Y . Cai, Z. Nan, D. Huang, X. Song, Y . Hao, J. Zhang, T. Zhiet al., “Cambricon-llm: A chiplet-based hybrid archi- tecture for on-device inference of 70b llm,” in2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO), 2024

  77. [77]

    Duplex: A device for large language models with mixture of experts, grouped query attention, and continuous batching,

    S. Yun, K. Kyung, J. Cho, J. Choi, J. Kim, B. Kim, S. Lee, K. Sohn, and J. H. Ahn, “Duplex: A device for large language models with mixture of experts, grouped query attention, and continuous batching,” in2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO), 2024

  78. [78]

    {SmartMoE}: Efficiently training sparsely-activated models through combining offline and online parallelization,

    M. Zhai, J. He, Z. Ma, Z. Zong, R. Zhang, and J. Zhai, “{SmartMoE}: Efficiently training sparsely-activated models through combining offline and online parallelization,” in2023 USENIX Annual Technical Confer- ence (USENIX ATC 23), 2023, pp. 961–975

  79. [79]

    COMET: Fine-grained computation-communication overlapping for mixture-of-experts,

    S. Zhang, N. Zheng, H. Lin, Z. Jiang, W. Bao, C. Jiang, Q. Hou, W. Cui, S. Zheng, L.-W. Chang, Q. Chen, and X. Liu, “COMET: Fine-grained computation-communication overlapping for mixture-of-experts,” in Eighth Conference on Machine Learning and Systems, 2025. [Online]. Available: https://openreview.net/forum?id=fGgQS5VW09

  80. [80]

    Sglang: Efficient execution of structured language model programs,

    L. Zheng, L. Yin, Z. Xie, C. L. Sun, J. Huang, C. H. Yu, S. Cao, C. Kozyrakis, I. Stoica, J. E. Gonzalezet al., “Sglang: Efficient execution of structured language model programs,”Advances in neural information processing systems, 2024

Showing first 80 references.