arxiv: 2510.05497 · v5 · submitted 2025-10-07 · 💻 cs.DC · cs.AI· cs.AR· cs.LG

Patterns behind Chaos: Forecasting Data Movement for Efficient Large-Scale MoE LLM Inference

Zhongkai Yu , Yue Guan , Zihao Yu , Chenyang Zhou , Zhengding Hu , Shuyi Pei , Yangwook Kang , Yufei Ding

show 1 more author

Po-An Tsai

This is my paper

Pith reviewed 2026-05-18 09:49 UTC · model grok-4.3

classification 💻 cs.DC cs.AIcs.ARcs.LG

keywords Mixture of ExpertsLLM inferencedata movementexpert placementGPU servingperformance profilingMoE models

0 comments

The pith

Profiling data movement in large MoE models uncovers six patterns that reduce serving bottlenecks and deliver substantial speedups.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper profiles expert selection and resulting data movement across four 200B-to-1000B MoE LLMs using over 24,000 requests from varied workloads. Analysis from temporal and spatial angles yields six concrete insights about when and where data must move. These insights directly inform both hardware modifications and software placement strategies. A reader would care because random expert routing, while enabling model scale, now dominates the cost of multi-unit inference systems.

Core claim

Comprehensive data-movement-centric profiling of state-of-the-art large-scale MoE models reveals consistent patterns in expert activation. From temporal and spatial perspectives the authors distill six key insights. These insights guide lightweight architectural changes on wafer-scale GPUs and a prefill-aware expert placement algorithm on existing GPUs, producing measured speedups of 6.6 times on average and up to 1.25 times respectively.

What carries the argument

Six distilled insights from temporal and spatial analysis of expert selection traces that identify repeatable patterns in data movement and guide placement and architectural decisions.

If this is right

Lightweight architectural modifications on wafer-scale GPUs produce a 6.6 times average speedup across the four tested 200B-1000B models.
A prefill-aware expert placement algorithm derived from the insights achieves up to 1.25 times speedup on MoE computation on existing GPU systems.
The insights apply across diverse workloads spanning the 24,000 profiled requests.
The same analysis framework can be reused on future model releases to generate new placement or hardware designs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The patterns could be used to build predictive prefetching logic that anticipates expert needs before the current token batch.
Combining the placement algorithm with existing expert parallelism techniques might further reduce inter-node traffic in multi-rack deployments.
If the temporal patterns prove stable, online monitoring of recent expert choices could dynamically adjust placement without full re-profiling.

Load-bearing premise

The six patterns observed on the profiled workloads and current model scales will continue to hold and produce similar speedups on other workloads, larger future models, and real production serving conditions.

What would settle it

Run the same profiling on a new 500B+ MoE model with a different routing scheme or workload distribution and measure whether the reported speedups from the derived placement algorithm or modifications drop below 1.1 times.

Figures

Figures reproduced from arXiv: 2510.05497 by Chenyang Zhou, Po-An Tsai, Shuyi Pei, Yangwook Kang, Yue Guan, Yufei Ding, Zhengding Hu, Zhongkai Yu, Zihao Yu.

**Figure 2.** Figure 2: Latency breakdown for different data movement in DeepSeekV3, [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Inference process of MoE LLMs and the categorization method for our proposed data-centric profiling approach. [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Layer-level temporal correlation heatmaps for (a) Deepseek and [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗

**Figure 5.** Figure 5: Token-level temporal correlation heatmaps for (a) Deepseek, (b) Llama, and (c) Qwen, together with (d) statistical results across all layers, [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗

**Figure 6.** Figure 6: Expert activation patterns remain consistent across prefill and [PITH_FULL_IMAGE:figures/full_fig_p004_6.png] view at source ↗

**Figure 7.** Figure 7: Single-expert spatial relation analysis of Llama4 layer 7 shows: [PITH_FULL_IMAGE:figures/full_fig_p005_7.png] view at source ↗

**Figure 8.** Figure 8: Co-activation probability heatmap of expert pair for (a) Deepseek [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗

**Figure 9.** Figure 9: (a) Wafer-scale multi-chiplet GPU architecture with additional units highlighted in orange. (b) SoW (System-on-Wafer) technology structure. (c) [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗

**Figure 10.** Figure 10: Proposed task allocation algorithm and data-driven predictor. [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗

**Figure 11.** Figure 11: Throughput of MoE layers (Top) and hop number reduction ratio (Bottom). All figures are scaled to baseline. [PITH_FULL_IMAGE:figures/full_fig_p010_11.png] view at source ↗

**Figure 12.** Figure 12: Simulator validation with real data generated from 8xH100 DGX, [PITH_FULL_IMAGE:figures/full_fig_p010_12.png] view at source ↗

**Figure 13.** Figure 13: DRAM access breakdown for [PITH_FULL_IMAGE:figures/full_fig_p011_13.png] view at source ↗

read the original abstract

Large-scale Mixture of Experts (MoE) Large Language Models (LLMs) have recently become the frontier open-weight models, achieving remarkable model capability similar to proprietary ones. But their random expert selection mechanism introduces significant data movement overhead that becomes the dominant bottleneck in multi-unit LLM serving systems. To understand the patterns underlying this data movement, we conduct comprehensive data-movement-centric profiling across four state-of-the-art large-scale MoE models released in 2025 (200B-1000B) using over 24,000 requests spanning diverse workloads. We perform systematic analysis from both temporal and spatial perspectives and distill six key insights to guide the design of diverse serving systems. We verify these insights on both future wafer-scale GPU architectures and existing GPU systems. On wafer-scale GPUs, lightweight architectural modifications guided by our insights yield a 6.6$\times$ average speedup across four 200B--1000B models. On existing GPU systems, our insights drive the design of a prefill-aware expert placement algorithm that achieves up to 1.25$\times$ speedup on MoE computation. Our work presents the first comprehensive data-centric analysis of large-scale MoE models together with a concrete design study applying the learned lessons. Our profiling traces are publicly available at \href{https://huggingface.co/datasets/core12345/MoE_expert_selection_trace}{\textcolor{blue}{https://huggingface.co/datasets/core12345/MoE\_expert\_selection\_trace}}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper delivers the first large-scale data-movement traces for 2025-era 200B-1000B MoE models plus concrete speedups from profiled insights, but those gains rest on untested transferability to other workloads and hardware.

read the letter

The paper's real contribution is the public release of expert-selection traces from four 200B-1000B MoE models using 24k requests, together with a temporal and spatial breakdown that yields six usable insights. That dataset alone is worth having for anyone working on MoE serving. They then apply the insights to show 6.6x average speedup from lightweight changes on wafer-scale GPU simulations and 1.25x from a prefill-aware placement algorithm on existing hardware. The measurements are empirical rather than circular, and the design study gives a clear path from observation to modification. The soft spot is transferability. The abstract gives no detail on how the request mix was chosen, whether alternate routing schemes or batch sizes were tried, or how performance holds when model scale or workload distribution shifts. If the observed patterns are specific to these four models and this request set, the architectural guidance and reported speedups will not travel as far as claimed. This work is aimed at systems researchers and engineers who need real traces to test placement, scheduling, or hardware ideas for large MoE inference. It is not a theoretical advance but supplies the kind of grounded data that can move the field forward. The paper deserves peer review so referees can examine the workload selection, statistical reporting, and any cross-validation experiments that may exist in the full text.

Referee Report

2 major / 2 minor

Summary. The paper conducts comprehensive profiling of data movement in four large-scale MoE LLMs (200B-1000B) using 24,000 requests. It distills six key insights from temporal and spatial analysis to guide serving system design. These insights lead to lightweight modifications yielding 6.6× speedup on wafer-scale GPUs and a prefill-aware placement algorithm with 1.25× speedup on existing systems. Profiling traces are publicly released.

Significance. This work addresses a critical bottleneck in large MoE LLM inference by providing data-centric insights. The reported speedups demonstrate practical impact if the insights hold. Public traces enhance reproducibility. The significance hinges on whether the patterns generalize beyond the studied models and workloads.

major comments (2)

§4: The six distilled insights are derived from the profiled 2025 models and workloads without reported cross-validation on alternate request distributions, routing strategies (e.g., top-k variants), or model scales; this is load-bearing for the claim that the insights guide design for diverse serving systems.
§6.1, wafer-scale results: The 6.6× average speedup is obtained via simulated architectural modifications; the manuscript does not detail how the simulation captures realistic interconnect and memory hierarchy costs or whether the modifications introduce unaccounted overheads.

minor comments (2)

The methods section should explicitly state the workload selection criteria and the exact composition (e.g., request length distributions, batch sizes) of the 24k requests to allow assessment of representativeness.
Figure captions and axis labels in the spatial/temporal analysis sections would benefit from additional detail on the units and normalization used for data-movement volume.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and describe the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: §4: The six distilled insights are derived from the profiled 2025 models and workloads without reported cross-validation on alternate request distributions, routing strategies (e.g., top-k variants), or model scales; this is load-bearing for the claim that the insights guide design for diverse serving systems.

Authors: We acknowledge that the primary analysis is performed on the four 2025 models (200B–1000B) and the 24,000-request workloads described in the manuscript. These models already span a substantial range of scales and exhibit different expert-selection behaviors. The six insights are presented as observations from this data rather than universal claims. To address the concern directly, we will revise §4 to add an explicit discussion of the studied models’ diversity, note the absence of exhaustive cross-validation on alternate top-k variants or request distributions, and include a short analysis applying the insights to a small set of additional traces with varied request patterns drawn from the public dataset. We will also qualify the guidance for serving-system design accordingly. revision: partial
Referee: §6.1, wafer-scale results: The 6.6× average speedup is obtained via simulated architectural modifications; the manuscript does not detail how the simulation captures realistic interconnect and memory hierarchy costs or whether the modifications introduce unaccounted overheads.

Authors: We appreciate the referee highlighting the need for greater transparency in the simulation. The results in §6.1 are produced by a cycle-level simulator that incorporates published wafer-scale interconnect bandwidth figures and memory-access latencies. In the revised manuscript we will expand the methodology subsection to describe (1) how the interconnect topology and bandwidth are modeled, (2) how memory-hierarchy costs are accounted for, and (3) an overhead analysis of the proposed lightweight modifications together with sensitivity results showing that the reported speedups remain robust under reasonable variations in these parameters. revision: yes

Circularity Check

0 steps flagged

Empirical profiling study; speedups are direct measurements, not derived quantities

full rationale

The paper conducts data-movement profiling on four 2025 MoE models using 24k requests, distills six insights from observed temporal/spatial patterns, and reports measured speedups from architectural modifications and a placement algorithm. No equations, fitted parameters, or first-principles derivations are presented; claims rest on empirical observation and verification rather than any reduction to inputs by construction. No self-citation chains or ansatzes are load-bearing for the central results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is empirical profiling and does not introduce new mathematical axioms, fitted parameters, or postulated entities; it relies on standard assumptions about GPU memory hierarchies and MoE routing behavior.

pith-pipeline@v0.9.0 · 5835 in / 1131 out tokens · 28992 ms · 2026-05-18T09:49:47.241694+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We perform systematic analysis from both temporal and spatial perspectives and distill six key insights... prefill-aware expert placement algorithm that achieves up to 1.25× speedup
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_strictMono_of_one_lt unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

layer-level temporal correlation heatmaps... token-level... prefill-decode-level

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Replication in Graph Partitioning and Scheduling Problems
cs.DC 2026-04 unverdicted novelty 5.0

Replication reduces costs by 17-65% on average in hypergraph partitioning and 11-23% in DAG scheduling, sometimes eliminating communication needs entirely.

Reference graph

Works this paper leans on

82 extracted references · 82 canonical work pages · cited by 1 Pith paper · 8 internal anchors

[1]

Gpt-4 technical report,

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “Gpt-4 technical report,”arXiv, 2023

work page 2023
[2]

Analyzing cuda workloads using a detailed gpu simulator,

A. Bakhoda, G. L. Yuan, W. W. Fung, H. Wong, and T. M. Aamodt, “Analyzing cuda workloads using a detailed gpu simulator,” in2009 IEEE international symposium on performance analysis of systems and software, 2009

work page 2009
[3]

The gem5 simulator,

N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashtiet al., “The gem5 simulator,”ACM SIGARCH computer architecture news, 2011

work page 2011
[4]

Moe-lightning: High-throughput moe inference on memory-constrained gpus,

S. Cao, S. Liu, T. Griggs, P. Schafhalter, X. Liu, Y . Sheng, J. E. Gon- zalez, M. Zaharia, and I. Stoica, “Moe-lightning: High-throughput moe inference on memory-constrained gpus,” inProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2025

work page 2025
[5]

Waferscale network switches,

S. Chen, S. Pal, and R. Kumar, “Waferscale network switches,” in 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA). IEEE, 2024, pp. 215–229

work page 2024
[6]

Chatbot arena: An open platform for evaluating llms by human preference,

W.-L. Chiang, L. Zheng, Y . Sheng, A. N. Angelopoulos, T. Li, D. Li, B. Zhu, H. Zhang, M. Jordan, J. E. Gonzalezet al., “Chatbot arena: An open platform for evaluating llms by human preference,” inForty-first International Conference on Machine Learning, 2024

work page 2024
[7]

Lexi: Layer-adaptive active experts for efficient moe model inference,

K. T. Chitty-Venkata, S. Madireddy, M. Emani, and V . Vishwanath, “Lexi: Layer-adaptive active experts for efficient moe model inference,” arXiv preprint arXiv:2509.02753, 2025

work page arXiv 2025
[8]

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

D. Dai, C. Deng, C. Zhao, R. Xu, H. Gao, D. Chen, J. Li, W. Zeng, X. Yu, Y . Wuet al., “Deepseekmoe: Towards ultimate expert spe- cialization in mixture-of-experts language models,”arXiv preprint arXiv:2401.06066, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Cpelide: Efficient multi- chiplet gpu implicit synchronization,

P. Dalmia, R. S. Kumar, and M. D. Sinclair, “Cpelide: Efficient multi- chiplet gpu implicit synchronization,” in2024 57th IEEE/ACM Interna- tional Symposium on Microarchitecture (MICRO), 2024

work page 2024
[10]

A complete survey on llm-based ai chatbots,

S. K. Dam, C. S. Hong, Y . Qiao, and C. Zhang, “A complete survey on llm-based ai chatbots,”arXiv, 2024

work page 2024
[11]

Sida: Sparsity-inspired data-aware serving for efficient and scalable large mixture-of-experts models,

Z. Du, S. Li, Y . Wu, X. Jiang, J. Sun, Q. Zheng, Y . Wu, A. Li, H. Li, and Y . Chen, “Sida: Sparsity-inspired data-aware serving for efficient and scalable large mixture-of-experts models,”Proceedings of Machine Learning and Systems, vol. 6, pp. 224–238, 2024

work page 2024
[12]

Fast inference of mixture-of-experts language models with offloading,

A. Eliseev and D. Mazur, “Fast inference of mixture-of-experts language models with offloading,”arXiv preprint arXiv:2312.17238, 2023

work page arXiv 2023
[13]

Klotski: Efficient mixture-of-expert inference via expert- aware multi-batch pipeline,

Z. Fang, Y . Huang, Z. Hong, Y . Lyu, W. Chen, Y . Yu, F. Yu, and Z. Zheng, “Klotski: Efficient mixture-of-expert inference via expert- aware multi-batch pipeline,” inProceedings of the 30th ACM Interna- tional Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, 2025

work page 2025
[14]

Heterogeneous die-to-die interfaces: Enabling more flexible chiplet interconnection systems,

Y . Feng, D. Xiang, and K. Ma, “Heterogeneous die-to-die interfaces: Enabling more flexible chiplet interconnection systems,” inProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchi- tecture, 2023, pp. 930–943

work page 2023
[15]

A scalable methodology for designing efficient interconnection network of chiplets,

——, “A scalable methodology for designing efficient interconnection network of chiplets,” in2023 IEEE International Symposium on High- Performance Computer Architecture (HPCA). IEEE, 2023, pp. 1059– 1071

work page 2023
[16]

S. T. from LMSYS Org. (2025) Deploying deepseek with pd disaggregation and large-scale expert parallelism on 96 h100 gpus. [Online]. Available: https://lmsys.org/blog/2025-05-05-large-scale-ep/

work page 2025
[17]

(2025) Together with sglang: Best practices for serving deepseek- r1 on h20-96g

——. (2025) Together with sglang: Best practices for serving deepseek- r1 on h20-96g. [Online]. Available: https://lmsys.org/blog/2025-09-26- sglang-ant-group/

work page 2025
[18]

Moetuner: Optimized mixture of expert serving with balanced expert placement and token routing.arXiv preprint arXiv:2502.06643,

S. Go and D. Mahajan, “Moetuner: Optimized mixture of expert serving with balanced expert placement and token routing,”arXiv preprint arXiv:2502.06643, 2025

work page arXiv 2025
[19]

Lynx: Enabling efficient moe inference through dynamic batch-aware expert selection,

V . Gupta, K. Sinha, A. Gavrilovska, and A. P. Iyer, “Lynx: Enabling efficient moe inference through dynamic batch-aware expert selection,” arXiv preprint arXiv:2411.08982, 2024

work page arXiv 2024
[20]

Lost in abstraction: Pitfalls of analyzing gpus at the intermediate language level,

A. Gutierrez, B. M. Beckmann, A. Dutu, J. Gross, M. LeBeane, J. Kala- matianos, O. Kayiran, M. Poremba, B. Potter, S. Puthooret al., “Lost in abstraction: Pitfalls of analyzing gpus at the intermediate language level,” in2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2018

work page 2018
[21]

Waferllm: Large language model inference at wafer scale,

C. He, Y . Huang, P. Mu, Z. Miao, J. Xue, L. Ma, F. Yang, and L. Mai, “Waferllm: Large language model inference at wafer scale,” in19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25). USENIX Association, 2025

work page 2025
[22]

Chinese simpleqa: A chinese factuality evaluation for large language models,

Y . He, S. Li, J. Liu, Y . Tan, W. Wang, H. Huang, X. Bu, H. Guo, C. Hu, B. Zhenget al., “Chinese simpleqa: A chinese factuality evaluation for large language models,”arXiv preprint arXiv:2411.07140, 2024

work page arXiv 2024
[23]

Measuring Massive Multitask Language Understanding

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt, “Measuring massive multitask language understanding,” arXiv preprint arXiv:2009.03300, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009
[24]

Wafer-level integration of an advanced logic-memory system through the second-generation cowos technology,

S. Hou, W. C. Chen, C. Hu, C. Chiu, K. Ting, T. Lin, W. Wei, W. Chiou, V . J. Lin, V . C. Changet al., “Wafer-level integration of an advanced logic-memory system through the second-generation cowos technology,” IEEE Transactions on Electron Devices, 2017

work page 2017
[25]

Integrated deep trench capacitor in si inter- poser for cowos heterogeneous integration,

S. Hou, H. Hsia, C. Tsai, K. Ting, T. Yu, Y . Lee, F. Chen, W. Chiou, C. Wang, C. Wuet al., “Integrated deep trench capacitor in si inter- poser for cowos heterogeneous integration,” in2019 IEEE International Electron Devices Meeting (IEDM). IEEE, 2019, pp. 19–5

work page 2019
[26]

Cowos architecture evolution for next generation hpc on 2.5 d system in package,

Y .-C. Hu, Y .-M. Liang, H.-P. Hu, C.-Y . Tan, C.-T. Shen, C.-H. Lee, and S. Hou, “Cowos architecture evolution for next generation hpc on 2.5 d system in package,” in2023 IEEE 73rd Electronic Components and Technology Conference (ECTC), 2023

work page 2023
[27]

Tutel: Adaptive mixture-of-experts at scale,

C. Hwang, W. Cui, Y . Xiong, Z. Yang, Z. Liu, H. Hu, Z. Wang, R. Salas, J. Jose, P. Ramet al., “Tutel: Adaptive mixture-of-experts at scale,” Proceedings of Machine Learning and Systems, vol. 5, pp. 269–287, 2023

work page 2023
[28]

Pre-gated moe: An algorithm-system co-design for fast and scalable mixture-of-expert inference,

R. Hwang, J. Wei, S. Cao, C. Hwang, X. Tang, T. Cao, and M. Yang, “Pre-gated moe: An algorithm-system co-design for fast and scalable mixture-of-expert inference,” in2024 ACM/IEEE 51st Annual Interna- tional Symposium on Computer Architecture (ISCA). IEEE, 2024, pp. 1018–1031

work page 2024
[29]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

N. Jain, K. Han, A. Gu, W.-D. Li, F. Yan, T. Zhang, S. Wang, A. Solar-Lezama, K. Sen, and I. Stoica, “Livecodebench: Holistic and contamination free evaluation of large language models for code,”arXiv preprint arXiv:2403.07974, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

Mixtral of Experts

A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bam- ford, D. S. Chaplot, D. d. l. Casas, E. B. Hanna, F. Bressandet al., “Mixtral of experts,”arXiv preprint arXiv:2401.04088, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

Fiddler: CPU-GPU orchestration for fast inference of mixture-of-experts models,

K. Kamahori, T. Tang, Y . Gu, K. Zhu, and B. Kasikci, “Fiddler: CPU-GPU orchestration for fast inference of mixture-of-experts models,” inThe Thirteenth International Conference on Learning Representations, 2025. [Online]. Available: https://openreview.net/ forum?id=N5fVv6PZGz

work page 2025
[32]

Comp-net: Command processor networking for efficient intra-kernel communications on gpus,

M. LeBeane, K. Hamidouche, B. Benton, M. Breternitz, S. K. Reinhardt, and L. K. John, “Comp-net: Command processor networking for efficient intra-kernel communications on gpus,” inProceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques, 2018

work page 2018
[33]

Optimizing mixture-of-experts inference time combining model deployment and communication scheduling,

J. Li, S. Tripathi, L. Rastogi, Y . Lei, R. Pan, and Y . Xia, “Optimizing mixture-of-experts inference time combining model deployment and communication scheduling,”arXiv preprint arXiv:2410.17043, 2024. 12

work page arXiv 2024
[34]

Accelerating distributed moe training and inference with lina,

J. Li, Y . Jiang, Y . Zhu, C. Wang, and H. Xu, “Accelerating distributed moe training and inference with lina,” in2023 USENIX Annual Technical Conference (USENIX ATC 23), 2023, pp. 945–959

work page 2023
[35]

Lucie: A universal chiplet-interposer design framework for plug-and-play integration,

Z. Li and D. Wentzlaff, “Lucie: A universal chiplet-interposer design framework for plug-and-play integration,” in2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2024, pp. 423–436

work page 2024
[36]

DeepSeek-V3 Technical Report

A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruanet al., “Deepseek-v3 technical report,”arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

Llamax: Scaling linguis- tic horizons of llm by enhancing translation capabilities beyond 100 languages,

Y . Lu, W. Zhu, L. Li, Y . Qiao, and F. Yuan, “Llamax: Scaling linguis- tic horizons of llm by enhancing translation capabilities beyond 100 languages,”arXiv, 2024

work page 2024
[38]

Moore’s law forever?

M. Lundstrom, “Moore’s law forever?”Science, 2003

work page 2003
[39]

Embedded multi-die interconnect bridge (emib)–a high density, high bandwidth packaging interconnect,

R. Mahajan, R. Sankman, N. Patel, D.-W. Kim, K. Aygun, Z. Qian, Y . Mekonnen, I. Salama, S. Sharan, D. Iyengaret al., “Embedded multi-die interconnect bridge (emib)–a high density, high bandwidth packaging interconnect,” in2016 IEEE 66th Electronic Components and Technology Conference (ECTC), 2016

work page 2016
[40]

Mixture of experts: a literature survey,

S. Masoudnia and R. Ebrahimpour, “Mixture of experts: a literature survey,”Artificial Intelligence Review, pp. 275–293, 2014

work page 2014
[41]

(2025) Llama4 technical report

Meta. (2025) Llama4 technical report. [Online]. Available: https: //ai.meta.com/blog/llama-4-multimodal-intelligence/

work page 2025
[42]

OLMoE: Open Mixture-of-Experts Language Models

N. Muennighoff, L. Soldaini, D. Groeneveld, K. Lo, J. Morrison, S. Min, W. Shi, P. Walsh, O. Tafjord, N. Lambertet al., “Olmoe: Open mixture- of-experts language models,”arXiv preprint arXiv:2409.02060, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[43]

Using an llm to help with code understanding,

D. Nam, A. Macvean, V . Hellendoorn, B. Vasilescu, and B. Myers, “Using an llm to help with code understanding,” inProceedings of the IEEE/ACM 46th International Conference on Software Engineering, ser. ICSE ’24, 2024

work page 2024
[44]

Nvidia blackwell architecture overview,

NVIDIA, “Nvidia blackwell architecture overview,” https://resources. nvidia.com/en-us-blackwell-architecture, 2025

work page 2025
[45]

Nvidia gtc 2025,

——, “Nvidia gtc 2025,” https://www.nvidia.com/gtc/, 2025

work page 2025
[46]

Scar: Schedul- ing multi-model ai workloads on heterogeneous multi-chiplet module accelerators,

M. Odema, L. Chen, H. Kwon, and M. A. Al Faruque, “Scar: Schedul- ing multi-model ai workloads on heterogeneous multi-chiplet module accelerators,” in2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2024, pp. 565–579

work page 2024
[47]

Mooncake: Trading more storage for less computation—a {KVCache-centric}architecture for serving{LLM}chatbot,

R. Qin, Z. Li, W. He, J. Cui, F. Ren, M. Zhang, Y . Wu, W. Zheng, and X. Xu, “Mooncake: Trading more storage for less computation—a {KVCache-centric}architecture for serving{LLM}chatbot,” in23rd USENIX Conference on File and Storage Technologies (FAST 25), 2025

work page 2025
[48]

Fred: A wafer-scale fabric for 3d parallel dnn training,

S. Rashidi, W. Won, S. Srinivasan, P. Gupta, and T. Krishna, “Fred: A wafer-scale fabric for 3d parallel dnn training,” inProceedings of the 52nd Annual International Symposium on Computer Architecture, 2025, pp. 34–48

work page 2025
[49]

Correlation coefficients: appropriate use and interpretation,

P. Schober, C. Boer, and L. A. Schwarte, “Correlation coefficients: appropriate use and interpretation,”Anesthesia & analgesia, 2018

work page 2018
[50]

Simba: Scaling deep-learning inference with multi-chip-module-based architecture,

Y . S. Shao, J. Clemons, R. Venkatesan, B. Zimmer, M. Fojtik, N. Jiang, B. Keller, A. Klinefelter, N. Pinckney, P. Rainaet al., “Simba: Scaling deep-learning inference with multi-chip-module-based architecture,” in Proceedings of the 52nd annual IEEE/ACM international symposium on microarchitecture, 2019, pp. 14–27

work page 2019
[51]

Flexgen: High-throughput generative inference of large language models with a single gpu,

Y . Sheng, L. Zheng, B. Yuan, Z. Li, M. Ryabinin, B. Chen, P. Liang, C. R ´e, I. Stoica, and C. Zhang, “Flexgen: High-throughput generative inference of large language models with a single gpu,” inInternational Conference on Machine Learning, 2023

work page 2023
[52]

Sow-x: A novel system-on-wafer technology for next generation ai server application,

P.-C. Shih, A.-J. Su, K.-H. Tam, T.-C. Huang, K. Chuang, and J. Yeh, “Sow-x: A novel system-on-wafer technology for next generation ai server application,” in2025 IEEE 75th Electronic Components and Technology Conference (ECTC). IEEE, 2025

work page 2025
[53]

Signal integrity of die-to-die interface with advanced packages for co-packaged optics,

J. Shin, H. Eslampour, S. Jeong, W. Kim, S. Yong, S.-O. Ahn, E. Park, and S. Song, “Signal integrity of die-to-die interface with advanced packages for co-packaged optics,” in2024 IEEE 33rd Conference on Electrical Performance of Electronic Packaging and Systems (EPEPS), 2024

work page 2024
[54]

Mixture of cache- conditional experts for efficient mobile device inference,

A. Skliar, T. van Rozendaal, R. Lepert, T. Boinovski, M. Van Baalen, M. Nagel, P. Whatmough, and B. E. Bejnordi, “Mixture of cache- conditional experts for efficient mobile device inference,”arXiv preprint arXiv:2412.00099, 2024

work page arXiv 2024
[55]

Amd instinct™ mi300x accelerator: Packaging and architecture co-optimization,

A. Smith, G. H. Loh, J. Wuu, S. Naffziger, T. Huang, H. McIntyre, R. Mangaser, W. Jung, and R. Swaminathan, “Amd instinct™ mi300x accelerator: Packaging and architecture co-optimization,” in2024 IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits). IEEE, 2024

work page 2024
[56]

Hunyuan-large: An open-source moe model with 52 billion activated parameters by tencent,

X. Sun, Y . Chen, Y . Huang, R. Xie, J. Zhu, K. Zhang, S. Li, Z. Yang, J. Han, X. Shuet al., “Hunyuan-large: An open-source moe model with 52 billion activated parameters by tencent,”arXiv preprint arXiv:2411.02265, 2024

work page arXiv 2024
[57]

Mgpusim: Enabling multi-gpu performance modeling and optimization,

Y . Sun, T. Baruah, S. A. Mojumder, S. Dong, X. Gong, S. Treadway, Y . Bao, S. Hance, C. McCardwell, V . Zhaoet al., “Mgpusim: Enabling multi-gpu performance modeling and optimization,” inProceedings of the 46th International Symposium on Computer Architecture, 2019

work page 2019
[58]

Coserve: Efficient collaboration-of-experts (coe) model inference with limited memory,

J. Suo, X. Liao, L. Xiao, L. Ruan, J. Wang, X. Su, and Z. Huo, “Coserve: Efficient collaboration-of-experts (coe) model inference with limited memory,” inProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2025

work page 2025
[59]

emoe: Task-aware mem- ory efficient mixture-of-experts-based (moe) model inference,

S. Tairin, S. Mahmud, H. Shen, and A. Iyer, “emoe: Task-aware mem- ory efficient mixture-of-experts-based (moe) model inference,”arXiv preprint arXiv:2503.06823, 2025

work page arXiv 2025
[60]

The microarchitecture of dojo, tesla’s exa-scale computer,

E. Talpes, D. D. Sarma, D. Williams, S. Arora, T. Kunjan, B. Floering, A. Jalote, C. Hsiong, C. Poorna, V . Samantet al., “The microarchitecture of dojo, tesla’s exa-scale computer,”IEEE Micro, 2023

work page 2023
[61]

Dojo: The microarchitecture of tesla’s exa-scale computer,

E. Talpes, D. Williams, and D. D. Sarma, “Dojo: The microarchitecture of tesla’s exa-scale computer,” in2022 IEEE Hot Chips 34 Symposium (HCS), 2022

work page 2022
[62]

Nn-baton: Dnn workload orches- tration and chiplet granularity exploration for multichip accelerators,

Z. Tan, H. Cai, R. Dong, and K. Ma, “Nn-baton: Dnn workload orches- tration and chiplet granularity exploration for multichip accelerators,” in2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2021, pp. 1013–1026

work page 2021
[63]

Kimi K2: Open Agentic Intelligence

K. Team, Y . Bai, Y . Bao, G. Chen, J. Chen, N. Chen, R. Chen, Y . Chen, Y . Chen, Y . Chenet al., “Kimi k2: Open agentic intelligence,”arXiv preprint arXiv:2507.20534, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[64]

(2025) Tsmc’s next generation of system-on- wafer package will make today’s cpus and gpus look pathetically feeble in comparison

TSMC. (2025) Tsmc’s next generation of system-on- wafer package will make today’s cpus and gpus look pathetically feeble in comparison. [Online]. Available: https://www.pcgamer.com/hardware/processors/tsmcs-next-generation- of-system-on-wafer-packaging-will-make-todays-cpus-and-gpus-look- pathetically-feeble-in-comparison/

work page 2025
[65]

Rag-based llm chatbot using llama-2,

S. Vakayil, D. S. Juliet, S. Vakayilet al., “Rag-based llm chatbot using llama-2,” in2024 7th International Conference on Devices, Circuits and Systems (ICDCS), 2024

work page 2024
[66]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems, 2017

work page 2017
[67]

Mmlu-pro: A more robust and challenging multi-task language understanding benchmark,

Y . Wang, X. Ma, G. Zhang, Y . Ni, A. Chandra, S. Guo, W. Ren, A. Arulraj, X. He, Z. Jianget al., “Mmlu-pro: A more robust and challenging multi-task language understanding benchmark,”Advances in Neural Information Processing Systems, 2024

work page 2024
[68]

Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation,

Y . Wang, W. Wang, S. Joty, and S. C. H. Hoi, “Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation,”arXiv, 2021

work page 2021
[69]

Astra-sim2. 0: Modeling hierarchical networks and disaggregated systems for large-model training at scale,

W. Won, T. Heo, S. Rashidi, S. Sridharan, S. Srinivasan, and T. Kr- ishna, “Astra-sim2. 0: Modeling hierarchical networks and disaggregated systems for large-model training at scale,” in2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2023

work page 2023
[70]

Loongserve: Effi- ciently serving long-context large language models with elastic sequence parallelism,

B. Wu, S. Liu, Y . Zhong, P. Sun, X. Liu, and X. Jin, “Loongserve: Effi- ciently serving long-context large language models with elastic sequence parallelism,” inProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles, 2024, pp. 640–654

work page 2024
[71]

Wsc-llm: Efficient llm service and architecture co-exploration for wafer-scale chips,

Z. Xu, D. Kong, J. Liu, J. Li, J. Hou, X. Dai, C. Li, S. Wei, Y . Hu, and S. Yin, “Wsc-llm: Efficient llm service and architecture co-exploration for wafer-scale chips,” inProceedings of the 52nd Annual International Symposium on Computer Architecture, 2025, pp. 1–17

work page 2025
[72]

Qwen3 Technical Report

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lvet al., “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[73]

Pd constraint-aware physical/logical topology co-design for network on wafer,

Q. Yang, T. Wei, S. Guan, C. Li, H. Shang, J. Deng, H. Wang, C. Li, L. Wang, Y . Zhanget al., “Pd constraint-aware physical/logical topology co-design for network on wafer,” inProceedings of the 52nd Annual International Symposium on Computer Architecture, 2025, pp. 49–64

work page 2025
[74]

Exploiting inter-layer expert affinity for accelerating mixture-of-experts model inference,

J. Yao, Q. Anthony, A. Shafi, H. Subramoni, and D. K. D. Panda, “Exploiting inter-layer expert affinity for accelerating mixture-of-experts model inference,” in2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 2024

work page 2024
[75]

Cramming a data center into one cabinet, a co-exploration of computing and hardware 13 architecture of waferscale chip,

X. Yu, D. Jiang, J. Deng, J. Liu, C. Li, S. Yin, and Y . Hu, “Cramming a data center into one cabinet, a co-exploration of computing and hardware 13 architecture of waferscale chip,” inProceedings of the 52nd Annual International Symposium on Computer Architecture, 2025, pp. 631–645

work page 2025
[76]

Cambricon-llm: A chiplet-based hybrid archi- tecture for on-device inference of 70b llm,

Z. Yu, S. Liang, T. Ma, Y . Cai, Z. Nan, D. Huang, X. Song, Y . Hao, J. Zhang, T. Zhiet al., “Cambricon-llm: A chiplet-based hybrid archi- tecture for on-device inference of 70b llm,” in2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO), 2024

work page 2024
[77]

Duplex: A device for large language models with mixture of experts, grouped query attention, and continuous batching,

S. Yun, K. Kyung, J. Cho, J. Choi, J. Kim, B. Kim, S. Lee, K. Sohn, and J. H. Ahn, “Duplex: A device for large language models with mixture of experts, grouped query attention, and continuous batching,” in2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO), 2024

work page 2024
[78]

{SmartMoE}: Efficiently training sparsely-activated models through combining offline and online parallelization,

M. Zhai, J. He, Z. Ma, Z. Zong, R. Zhang, and J. Zhai, “{SmartMoE}: Efficiently training sparsely-activated models through combining offline and online parallelization,” in2023 USENIX Annual Technical Confer- ence (USENIX ATC 23), 2023, pp. 961–975

work page 2023
[79]

COMET: Fine-grained computation-communication overlapping for mixture-of-experts,

S. Zhang, N. Zheng, H. Lin, Z. Jiang, W. Bao, C. Jiang, Q. Hou, W. Cui, S. Zheng, L.-W. Chang, Q. Chen, and X. Liu, “COMET: Fine-grained computation-communication overlapping for mixture-of-experts,” in Eighth Conference on Machine Learning and Systems, 2025. [Online]. Available: https://openreview.net/forum?id=fGgQS5VW09

work page 2025
[80]

Sglang: Efficient execution of structured language model programs,

L. Zheng, L. Yin, Z. Xie, C. L. Sun, J. Huang, C. H. Yu, S. Cao, C. Kozyrakis, I. Stoica, J. E. Gonzalezet al., “Sglang: Efficient execution of structured language model programs,”Advances in neural information processing systems, 2024

work page 2024

Showing first 80 references.