pith. machine review for the scientific record. sign in

arxiv: 2605.11277 · v1 · submitted 2026-05-11 · 💻 cs.AR

Recognition: 2 theorem links

· Lean Theorem

Sieve: Dynamic Expert-Aware PIM Acceleration for Evolving Mixture-of-Experts Models

Christos Kozyrakis, Genghan Zhang, Gina Sohn, Jungwoo Kim, Kunle Olukotun, Qizheng Zhang, Rubens Lacouture, Swapnil Gandhi

Pith reviewed 2026-05-13 00:50 UTC · model grok-4.3

classification 💻 cs.AR
keywords Mixture-of-ExpertsProcessing-in-MemoryDynamic SchedulingLLM InferenceExpert ParallelismLoad ImbalanceBimodal DistributionGPU Acceleration
0
0 comments X

The pith

Sieve uses runtime monitoring of token-to-expert distributions to dynamically assign experts to GPU or attached PIM, cutting load imbalance in modern MoE models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Recent Mixture-of-Experts models activate fewer experts but with a strong bias toward a small subset, producing very different arithmetic intensities across experts. Static PIM offload rules that worked for earlier balanced MoE setups therefore leave either GPUs or PIM under-utilized while communication costs remain. The paper introduces a scheduler that observes the current token distribution at runtime and jointly weighs interconnect overhead, memory bandwidth, GPU throughput, and PIM throughput to decide where each expert should run. The Sieve framework applies this scheduler across multiple GPUs and their HBM-PIM stacks while overlapping computation and communication and preserving expert-parallelism dependencies. On a cycle-accurate simulator the resulting system raises both throughput and interactivity by 1.3x to 1.6x over prior static PIM baselines for three large recent MoE models.

Core claim

The central claim is that the emerging bimodal token-to-expert distribution in modern MoE models creates a disparity in arithmetic intensity that static PIM rules cannot exploit; a dynamic scheduler that partitions experts between GPU and PIM at runtime, while accounting for interconnect, bandwidth, and device throughputs, restores efficiency and enables overlapping of GPU work, PIM work, and cross-device communication.

What carries the argument

The runtime scheduler that partitions expert execution between GPU and PIM according to observed token-to-expert distributions while jointly considering interconnect overhead, memory bandwidth, GPU throughput, and PIM throughput.

If this is right

  • Throughput and interactivity gains scale with the degree of bimodality in token-to-expert distributions.
  • Overlapping GPU computation, PIM computation, and intra- and inter-device communication remains feasible while respecting expert-parallelism ordering constraints.
  • The same monitoring-plus-joint-cost scheduler can be applied to other heterogeneous memory systems that combine high-bandwidth memory with attached compute.
  • Performance remains stable as the number of experts grows and activation sparsity increases.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If monitoring latency proves higher on real silicon than in simulation, the framework may need to switch from per-batch to per-layer or per-request monitoring intervals.
  • The same cost model could be reused to decide when to migrate experts between devices in a multi-node setting rather than only within a single node.
  • Training pipelines could incorporate a lightweight version of the scheduler to produce MoE checkpoints that are already tuned for PIM-augmented inference hardware.

Load-bearing premise

The cycle-accurate simulator captures every relevant overhead of the dynamic scheduler and that the bimodal expert distributions observed in the evaluated models will continue in future MoE deployments.

What would settle it

Running the full Sieve scheduler on physical multi-GPU hardware with HBM-PIM stacks and measuring end-to-end latency against a static PIM baseline on a new MoE model whose expert activations are nearly uniform.

Figures

Figures reproduced from arXiv: 2605.11277 by Christos Kozyrakis, Genghan Zhang, Gina Sohn, Jungwoo Kim, Kunle Olukotun, Qizheng Zhang, Rubens Lacouture, Swapnil Gandhi.

Figure 2
Figure 2. Figure 2: Comparison of the key differences between dense [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Benchmarking the sparsity–capability relationship. [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Roofline models of the NVIDIA B200 GPU [ [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Proportion of GEMV, skinny GEMM, and GEMM experts in [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Execution flow examples in PIM-enabled systems [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: An overview of Sieve. Hardware and model con￾figurations are determined before serving MoE models. The Sieve scheduler leverages the runtime-determined distribu￾tion of tokens across experts to enable efficient co-execution of GPU and PIM computations. of existing PIM resources by exploiting the interaction between PIM architectural characteristics and modern MoE serving workloads. Memory Model: A PIM die … view at source ↗
Figure 8
Figure 8. Figure 8: An overview of the Sieve System and the dependencies across operations. column indices of the memory array that stores expert parameters for GEMV computation in sub-step (ii), and the GPU on-chip mem￾ory address used to load the results back in sub-step (iii). These addresses vary across PIM channels because each channel pro￾duces results for a different portion of the expert output. Preparing these argume… view at source ↗
Figure 9
Figure 9. Figure 9: Evaluation of throughput and interactivity achieved by [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Utilization of PIM channels when running [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Evaluation of throughput and interactivity [PITH_FULL_IMAGE:figures/full_fig_p012_11.png] view at source ↗
read the original abstract

Mixture-of-Experts (MoE) has become a dominant architecture for scaling large language models (LLMs). However, the execution characteristics of MoE inference are changing rapidly and increasingly mismatch the assumptions underlying existing Processing-in-Memory (PIM) systems. Prior PIM systems for LLMs rely on static rules to offload memory-bound operations to PIM, without accounting for the combined effects of load imbalance and inter-GPU communication. Meanwhile, modern MoE models activate fewer experts out of increasingly many, creating a bimodal expert distribution: a small set of experts receives many tokens, while a long tail of experts receives only one or a few. We identify a trend in modern MoE models toward increasingly bimodal token-to-expert distributions, quantify the resulting disparity in arithmetic intensity across experts, and show that this disparity dramatically reduces the efficiency of state-of-the-art PIM systems for LLMs. To address this problem, we propose a scheduler for serving MoE models on multi-GPU systems with attached HBM-PIM stacks. Our scheduler partitions expert execution between GPU and PIM based on runtime token-to-expert distributions, while jointly considering interconnect overhead, memory bandwidth, GPU throughput, and PIM throughput. Moreover, we propose Sieve, a runtime framework that employs the scheduler to coordinate execution across GPUs and their attached HBM-PIM stacks. Sieve overlaps GPU computation, PIM computation, and intra- and inter-device communication while preserving cross-device dependencies induced by expert parallelism. Sieve is evaluated on our cycle-accurate simulator based on Ramulator 2.0. Compared to state-of-the-art PIM systems for MoE, Sieve improves both throughput and interactivity by 1.3x, 1.3x, and 1.6x on Qwen3.5-397B-A17B, GPT-OSS-120B, and Qwen3-30B-A3B, respectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript presents Sieve, a runtime framework and dynamic scheduler for accelerating inference of evolving Mixture-of-Experts (MoE) large language models on multi-GPU systems equipped with HBM-based Processing-in-Memory (PIM) stacks. The authors observe that modern MoE models exhibit increasingly bimodal token-to-expert activation distributions, leading to varying arithmetic intensities that static PIM offloading strategies fail to handle efficiently due to load imbalance and inter-GPU communication. Sieve's scheduler dynamically partitions expert execution between GPUs and PIM devices based on runtime token counts, while jointly optimizing for interconnect overhead, bandwidth, and throughputs, and overlaps GPU/PIM computation with communications while respecting expert parallelism dependencies. Using a cycle-accurate simulator extending Ramulator 2.0, the paper reports that Sieve improves both throughput and interactivity by 1.3× on Qwen3.5-397B-A17B, 1.3× on GPT-OSS-120B, and 1.6× on Qwen3-30B-A3B compared to prior state-of-the-art PIM systems for MoE.

Significance. If the reported performance improvements prove robust once all runtime overheads are accurately modeled, this work would represent a meaningful advance in PIM acceleration for large-scale MoE inference. It directly targets the emerging mismatch between static PIM designs and the dynamic, bimodal activation patterns of contemporary MoE models, offering a practical scheduling approach that accounts for load imbalance and communication costs. The quantification of the bimodal distribution trend supplies useful empirical grounding for future hardware-software co-design efforts in LLM serving.

major comments (3)
  1. [§5] §5 (Evaluation and Simulator): The central speedups (1.3–1.6×) are obtained from a cycle-accurate simulator based on Ramulator 2.0, yet the manuscript supplies no description of how runtime token-to-expert distribution monitoring, scheduler decision latency, or the additional cross-device synchronization traffic generated by dynamic partitioning are modeled at cycle granularity. If these costs are omitted or underestimated, the net advantage over static PIM baselines cannot be substantiated.
  2. [§4] §4 (Scheduler Design): The scheduler is said to partition experts by jointly considering interconnect overhead, memory bandwidth, GPU throughput, and PIM throughput, but the text provides neither pseudocode, decision algorithm, nor cost-model equations. Without these, it is impossible to verify that the partitioning logic itself does not introduce overheads that offset the claimed gains.
  3. [§3] §3 (Motivation): The bimodal token-to-expert distribution is demonstrated on the three evaluated models, but no analysis or sensitivity study shows whether this distribution persists across future MoE scales or training regimes. The headline claims rest on the assumption that the observed pattern generalizes; absent such evidence, the broader applicability of the dynamic scheduler remains unproven.
minor comments (2)
  1. [§5] The term 'interactivity' is used in the abstract and results but is never explicitly defined (e.g., as 99th-percentile latency or token generation time); a clear metric definition should be added in §5.
  2. Figure captions and legends in the evaluation section would benefit from explicit listing of all compared systems and their configurations to improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major point below and indicate the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§5] §5 (Evaluation and Simulator): The central speedups (1.3–1.6×) are obtained from a cycle-accurate simulator based on Ramulator 2.0, yet the manuscript supplies no description of how runtime token-to-expert distribution monitoring, scheduler decision latency, or the additional cross-device synchronization traffic generated by dynamic partitioning are modeled at cycle granularity. If these costs are omitted or underestimated, the net advantage over static PIM baselines cannot be substantiated.

    Authors: We agree that the modeling of these runtime overheads requires more explicit description for full transparency. The simulator extends Ramulator 2.0 with support for dynamic expert partitioning, where token-to-expert monitoring is implemented via lightweight periodic sampling whose cycle cost is included in the execution trace, scheduler decisions incur a modeled fixed-latency overhead per invocation, and additional synchronization traffic is captured within the existing interconnect model. However, these details are only summarized rather than fully elaborated. We will expand Section 5 with a dedicated subsection on overhead modeling, including the exact cycle costs and assumptions used, to confirm that they do not negate the reported gains. revision: yes

  2. Referee: [§4] §4 (Scheduler Design): The scheduler is said to partition experts by jointly considering interconnect overhead, memory bandwidth, GPU throughput, and PIM throughput, but the text provides neither pseudocode, decision algorithm, nor cost-model equations. Without these, it is impossible to verify that the partitioning logic itself does not introduce overheads that offset the claimed gains.

    Authors: We concur that the absence of pseudocode and explicit cost-model equations limits verifiability. Section 4 describes the joint optimization criteria and the high-level partitioning heuristic, but does not present the algorithmic steps or equations in a formal manner. We will add pseudocode for the decision procedure and the full set of cost-model equations (including how interconnect, bandwidth, and throughput terms are combined) to the revised manuscript so that readers can independently assess overheads. revision: yes

  3. Referee: [§3] §3 (Motivation): The bimodal token-to-expert distribution is demonstrated on the three evaluated models, but no analysis or sensitivity study shows whether this distribution persists across future MoE scales or training regimes. The headline claims rest on the assumption that the observed pattern generalizes; absent such evidence, the broader applicability of the dynamic scheduler remains unproven.

    Authors: The manuscript quantifies the bimodal pattern on three contemporary large-scale MoE models that already span a range of sizes and expert counts. We will add a short discussion in Section 3 explaining why the underlying design trends (increasing total experts while keeping per-token activation sparse) make continued bimodality likely. A full sensitivity study on hypothetical future scales or training regimes is not feasible within the current evaluation, as it would require access to models that do not yet exist. revision: partial

Circularity Check

0 steps flagged

No significant circularity; performance claims are direct empirical measurements from simulator runs

full rationale

The paper proposes a dynamic scheduler for partitioning experts between GPU and PIM based on observed token distributions, then evaluates the resulting Sieve framework via cycle-accurate simulation against static baselines. Reported speedups (1.3–1.6×) are measured outcomes on three concrete models; they do not reduce via any equations to fitted parameters, self-referential definitions, or load-bearing self-citations. The analysis of bimodal distributions is an empirical observation used to motivate the design, not a derivation that presupposes the result. The simulator (Ramulator 2.0) is an external tool, and no uniqueness theorems or ansatzes are imported from prior author work to force the outcome.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the domain assumption that bimodal token distributions are now typical and that runtime monitoring can be performed with low enough overhead to enable beneficial partitioning decisions. The Sieve framework itself is an invented coordination layer. No explicit numerical free parameters are described in the abstract.

axioms (2)
  • domain assumption Modern MoE models exhibit increasingly bimodal token-to-expert distributions that create large disparities in arithmetic intensity across experts.
    Identified and quantified by the authors as the root cause of reduced PIM efficiency.
  • domain assumption A runtime scheduler can accurately weigh interconnect overhead, memory bandwidth, GPU throughput, and PIM throughput to make per-expert placement decisions.
    Core premise of the proposed partitioning logic.
invented entities (1)
  • Sieve runtime framework and scheduler no independent evidence
    purpose: Coordinates dynamic partitioning of expert execution across GPUs and attached HBM-PIM stacks while overlapping computation and communication.
    New software layer introduced to address the static-rule limitations of prior PIM systems.

pith-pipeline@v0.9.0 · 5685 in / 1706 out tokens · 49641 ms · 2026-05-13T00:50:53.272479+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · 13 internal anchors

  1. [1]

    [n. d.]. GLM-4.5-FP8 Model Card. https://huggingface.co/zai-org/GLM-4.5-FP8. Accessed: 2025-11-15

  2. [2]

    Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al . 2025. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925(2025)

  3. [3]

    Artificial Analysis. 2025. Artificial Analysis. https://artificialanalysis.ai. Accessed: 2025-11-03

  4. [4]

    Daehyeon Baek, Soojin Hwang, and Jaehyuk Huh. 2024. pSyncPIM: Partially Synchronous Execution of Sparse Matrix Operations for All-Bank PIM Archi- tectures. In2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA). IEEE, 354–367. 12 Sieve: Dynamic Expert-Aware PIM Acceleration for Evolving Mixture-of-Experts Models

  5. [5]

    Baidu-ERNIE-Team. 2025. ERNIE 4.5 Technical Report. https://yiyan.baidu.com/ blog/publication/ERNIE_Technical_Report.pdf

  6. [6]

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners.Advances in neural information processing systems33 (2020), 1877–1901

  7. [7]

    Hongtao Chen, Weiyu Xie, Boxin Zhang, Jingqi Tang, Jiahao Wang, Jianwei Dong, Shaoyuan Chen, Ziwei Yuan, Chen Lin, Chengyu Qiu, Yuening Zhu, Qingliang Ou, Jiaqi Liao, Xianglin Chen, Zhiyuan Ai, Yongwei Wu, and Mingxing Zhang. 2025. KTransformers: Unleashing the Full Potential of CPU/GPU Hybrid Inference for MoE Models. InProceedings of the ACM SIGOPS 31st...

  8. [8]

    Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, et al . 2014. Dadiannao: A machine- learning supercomputer. In2014 47th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 609–622

  9. [9]

    Yu-Hsin Chen, Joel Emer, and Vivienne Sze. 2016. Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks.ACM SIGARCH computer architecture news44, 3 (2016), 367–379

  10. [10]

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. 2025. Gemini 2.5: Pushing the frontier with advanced reasoning, multi- modality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261(2025)

  11. [11]

    Damai Dai, Chengqi Deng, Chenggang Zhao, R. X. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y. Wu, Zhenda Xie, Y. K. Li, Panpan Huang, Fuli Luo, Chong Ruan, Zhifang Sui, and Wenfeng Liang. 2024. DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models. arXiv:2401.06066 [cs.CL] https://arxiv.org/abs/2401.06066

  12. [12]

    Bill Dally. 2023. Hardware for deep learning. In2023 IEEE Hot Chips 35 Symposium (HCS). IEEE Computer Society, 1–58

  13. [13]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). 4171–4186

  14. [14]

    Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, et al

  15. [15]

    In International conference on machine learning

    Glam: Efficient scaling of language models with mixture-of-experts. In International conference on machine learning. PMLR, 5547–5569

  16. [16]

    William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research23, 120 (2022), 1–39

  17. [17]

    Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, Andy Jones, Sam Bowman, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Nelson Elhage, Sheer El-Showk, Stanislav Fort, Zac Hatfield-Dodds, Tom Henighan, Danny Hernandez, Tristan Hume, Josh Jacobson, Scott Joh...

  18. [18]

    Amir Gholami, Zhewei Yao, Sehoon Kim, Coleman Hooper, Michael W Mahoney, and Kurt Keutzer. 2024. Ai and memory wall.IEEE Micro44, 3 (2024), 33–39

  19. [19]

    Yufeng Gu, Alireza Khadem, Sumanth Umesh, Ning Liang, Xavier Servot, Onur Mutlu, Ravi Iyer, and Reetuparna Das. 2025. PIM is all you need: A CXL-enabled GPU-free system for large language model inference. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Lan- guages and Operating Systems, Volume 2. 862–881

  20. [20]

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al . 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948(2025)

  21. [21]

    Mingxuan He, Choungki Song, Ilkon Kim, Chunseok Jeong, Seho Kim, Il Park, Mithuna Thottethodi, and TN Vijaykumar. 2020. Newton: A DRAM-maker’s accelerator-in-memory (AiM) architecture for machine learning. In2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 372–385

  22. [22]

    Yintao He, Haiyu Mao, Christina Giannoula, Mohammad Sadrosadati, Juan Gómez-Luna, Huawei Li, Xiaowei Li, Ying Wang, and Onur Mutlu. 2025. Papi: Ex- ploiting dynamic parallelism in large language model decoding with a processing- in-memory-enabled computing system. InProceedings of the 30th ACM Inter- national Conference on Architectural Support for Progra...

  23. [23]

    Guseul Heo, Sangyeop Lee, Jaehong Cho, Hyunmin Choi, Sanghyeon Lee, Hyungkyu Ham, Gwangsun Kim, Divya Mahajan, and Jongse Park. 2024. Ne- upims: Npu-pim heterogeneous acceleration for batched llm inferencing. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3. 722–737

  24. [24]

    Yang Hong, Junlong Yang, Bo Peng, and Jianguo Yao. 2026. REPA: Re configurable P IM for the Joint A cceleration of KV Cache Offloading and Processing. In Proceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. 1622–1639

  25. [25]

    Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven...

  26. [26]

    Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. 2017. In-datacenter performance analysis of a tensor processing unit. InProceedings of the 44th annual international symposium on computer architecture. 1–12

  27. [27]

    Jin Hyun Kim, Shin-Haeng Kang, Sukhan Lee, Hyeonsu Kim, Yuhwan Ro, Seung- won Lee, David Wang, Jihyun Choi, Jinin So, YeonGon Cho, et al. 2022. Aquabolt- XL HBM2-PIM, LPDDR5-PIM with in-memory processing, and AXDIMM with acceleration buffer.IEEE Micro42, 3 (2022), 20–30

  28. [28]

    Jin Hyun Kim, Yuhwan Ro, Jinin So, Sukhan Lee, Shin-haeng Kang, YeonGon Cho, Hyeonsu Kim, Byeongho Kim, Kyungsoo Kim, Sangsoo Park, et al. 2023. Samsung pim/pnm for transfmer based ai: Energy efficiency on pim/pnm cluster. In2023 IEEE Hot Chips 35 Symposium (HCS). IEEE Computer Society, 1–31

  29. [29]

    Wonung Kim, Yubin Lee, Yoonsung Kim, Jinwoo Hwang, Seongryong Oh, Jiyong Jung, Aziz Huseynov, Woong Gyu Park, Chang Hyun Park, Divya Mahajan, et al

  30. [30]

    Pimba: A Processing-in-Memory Acceleration for Post-Transformer Large Language Model Serving.arXiv preprint arXiv:2507.10178(2025)

  31. [31]

    Hyucksung Kwon, Kyungmo Koo, Janghyeon Kim, Woongkyu Lee, Minjae Lee, Gyeonggeun Jung, Hyungdeok Lee, Yousub Jung, Jaehan Park, Yosub Song, et al

  32. [32]

    In2026 IEEE International Symposium on High Performance Computer Architecture (HPCA)

    PIMphony: Overcoming Bandwidth and Capacity Inefficiency in PIM-Based Long-Context LLM Inference System. In2026 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 1–21

  33. [33]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAtten- tion. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles

  34. [34]

    Yongkee Kwon, Kornijcuk Vladimir, Nahsung Kim, Woojae Shin, Jongsoon Won, Minkyu Lee, Hyunha Joo, Haerang Choi, Guhyun Kim, Byeongju An, et al. 2022. System architecture and software stack for GDDR6-AiM. In2022 IEEE Hot Chips 34 Symposium (HCS). IEEE, 1–25

  35. [35]

    Hyojung Lee, Daehyeon Baek, Jimyoung Son, Jieun Choi, Kihyo Moon, and Minsung Jang. 2025. PAISE: PIM-Accelerated Inference Scheduling Engine for Transformer-based LLM. In2025 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 1707–1719

  36. [36]

    Sukhan Lee, Shin-haeng Kang, Jaehoon Lee, Hyeonsu Kim, Eojin Lee, Seungwoo Seo, Hosang Yoon, Seungwon Lee, Kyounghwan Lim, Hyunsung Shin, et al. 2021. Hardware architecture and software stack for PIM based on commercial DRAM technology: Industrial product. In2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). IEEE, 43–56

  37. [37]

    Seongju Lee, Kyuyoung Kim, Sanghoon Oh, Joonhong Park, Gimoon Hong, Dongyoon Ka, Kyudong Hwang, Jeongje Park, Kyeongpil Kang, Jungyeon Kim, et al. 2022. A 1ynm 1.25 V 8Gb, 16Gb/s/pin GDDR6-based accelerator-in-memory supporting 1TFLOPS MAC operation and various activation functions for deep- learning applications. In2022 IEEE International Solid-State Cir...

  38. [38]

    Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. 2020. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668(2020)

  39. [39]

    Cong Li, Zhe Zhou, Size Zheng, Jiaxi Zhang, Yun Liang, and Guangyu Sun

  40. [40]

    InProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3

    Specpim: Accelerating speculative inference on pim-enabled system via architecture-dataflow co-exploration. InProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3. 950–965

  41. [41]

    Sixu Li, Yuzhou Chen, Chaojian Li, Yonggan Fu, Zheng Wang, Zhongzhi Yu, Hao- ran You, Zhifan Ye, Wei Zhou, Yongan Zhang, et al. 2025. ORCHES: Orchestrated Test-Time-Compute-based LLM Reasoning on Collaborative GPU-PIM HEtero- geneous System. InProceedings of the 58th IEEE/ACM International Symposium on Microarchitecture. 476–489

  42. [42]

    Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. 2023. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations

  43. [43]

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Cheng- gang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. 2024. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437(2024). 13 Jungwoo Kim et al

  44. [44]

    Nisa Bostancı, Ataberk Olgun, A

    Haocong Luo, Yahya Can Tuğrul, F. Nisa Bostancı, Ataberk Olgun, A. Giray Yağlıkçı, , and Onur Mutlu. 2023. Ramulator 2.0: A Modern, Modular, and Extensible DRAM Simulator

  45. [45]

    Meta AI. 2025. The Llama 4 Herd: The Beginning of a New Era of Natively Multimodal Models. https://ai.meta.com/blog/llama-4-multimodal-intelligence/

  46. [46]

    NVIDIA. 2025. NVIDIA DGX B200 Datasheet. https://resources.nvidia.com/en- us-dgx-systems/dgx-b200-datasheet

  47. [47]

    Yue Pan, Zihan Xia, Po-Kai Hsu, Lanxiang Hu, Hyungyo Kim, Janak Sharda, Minxuan Zhou, Nam Sung Kim, Shimeng Yu, Tajana Rosing, et al. 2025. Stratum: System-Hardware Co-Design with Tiered Monolithic 3D-Stackable DRAM for Efficient MoE Serving.arXiv preprint arXiv:2510.05245(2025)

  48. [48]

    Jaehyun Park, Jaewan Choi, Kwanhee Kyung, Michael Jaemin Kim, Yongsuk Kwon, Nam Sung Kim, and Jung Ho Ahn. 2024. Attacc! unleashing the power of pim for batched transformer-based generative model inference. InProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. 103–119

  49. [49]

    Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. 2024. Splitwise: Efficient generative llm inference using phase splitting. In2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA). IEEE, 118–132

  50. [50]

    Derrick Quinn, E Ezgi Yücel, Jinkwon Kim, José F Martínez, and Mohammad Alian. 2025. LongSight: Compute-Enabled Memory to Accelerate Large-Context LLMs via Sparse Attention. InProceedings of the 58th IEEE/ACM International Symposium on Microarchitecture. 34–48

  51. [51]

    Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. 2017. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538(2017)

  52. [52]

    GLM-4.5 Team. 2025. GLM-4.5: Agentic, Reasoning, and Coding (ARC) Founda- tion Models. https://arxiv.org/abs/2508.06471

  53. [53]

    Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al . 2025. Kimi K2: Open Agentic Intelligence.arXiv preprint arXiv:2507.20534(2025)

  54. [54]

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement De- langue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. 2019. Huggingface’s transformers: State-of-the-art natural language pro- cessing.arXiv preprint arXiv:1910.03771(2019)

  55. [55]

    Lizhou Wu, Haozhe Zhu, Siqi He, Xuanda Lin, Xiaoyang Zeng, and Chixiao Chen. 2025. PIMoE: Towards efficient MoE transformer deployment on NPU- PIM system through throttle-aware task offloading. In2025 62nd ACM/IEEE Design Automation Conference (DAC). IEEE, 1–7

  56. [56]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report.arXiv preprint arXiv:2505.09388(2025)

  57. [57]

    Sungmin Yun, Kwanhee Kyung, Juhwan Cho, Jaewan Choi, Jongmin Kim, Byeongho Kim, Sukhan Lee, Kyomin Sohn, and Jung Ho Ahn. 2024. Duplex: A device for large language models with mixture of experts, grouped query atten- tion, and continuous batching. In2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 1429–1443

  58. [58]

    Shulai Zhang, Ningxin Zheng, Haibin Lin, Ziheng Jiang, Wenlei Bao, Chengquan Jiang, Qi Hou, Weihao Cui, Size Zheng, Li-Wen Chang, et al . 2025. Comet: Fine-grained computation-communication overlapping for mixture-of-experts. Proceedings of Machine Learning and Systems7 (2025)

  59. [59]

    Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Livia Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. 2024. Sglang: Efficient execution of structured language model programs. Advances in neural information processing systems37 (2024), 62557–62583

  60. [60]

    Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. 2024. DistServe: disaggregating prefill and decoding for goodput-optimized large language model serving. InProceedings of the 18th USENIX Conference on Operating Systems Design and Implementation(Santa Clara, CA, USA)(OSDI’24). USENIX Association, USA, Article...

  61. [61]

    Ruidong Zhu, Ziheng Jiang, Chao Jin, Peng Wu, Cesar A Stuardo, Dongyang Wang, Xinlei Zhang, Huaping Zhou, Haoran Wei, Yang Cheng, et al . 2025. Megascale-infer: Serving mixture-of-experts at scale with disaggregated expert parallelism.arXiv preprint arXiv:2504.02263(2025). 14