pith. sign in

arxiv: 2605.20982 · v1 · pith:FXHQIZXMnew · submitted 2026-05-20 · 💻 cs.DC · cs.AI· cs.LG

Diagnosing Overhead in Dispatch Operations: Cross-architecture Observatory

Pith reviewed 2026-05-21 02:06 UTC · model grok-4.3

classification 💻 cs.DC cs.AIcs.LG
keywords mixture of expertsexpert parallelismAlltoAll dispatchrouting imbalanceMoE diagnosticstoken distributionGini coefficient
0
0 comments X

The pith

MoE routing imbalance is intrinsic to model decisions, not fixed by expert placement or EP scaling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests two common assumptions behind efforts to speed up AlltoAll dispatch in mixture-of-experts models: that system-level fixes like expert relayout can correct routing imbalance, and that mock-token benchmarks accurately reflect real workloads. By running five different MoE architectures across real text and mock data plus an EP scan from 4 to 32 ranks, the measurements show that scaling expert parallelism changes the max-to-mean token ratio by at most 5 percent. Real text produces far lower and more stable imbalance than mock data, and the five models fall into two consistent groups based on how concentrated their routing stays.

Core claim

Scaling EP changes the per-expert max/mean token ratio by at most 5% within every architecture's measurable range: the straggler is intrinsic to the routing decision the model makes, not to how its experts land on ranks. Mock tokens overestimate routing Gini by up to a factor of 2.35 and fabricate a batch-size scaling trend that vanishes the moment real text replaces random IDs. The five architectures cleave into two stable bands: MHA and Mamba-2 drop to Gini 0.105 and 0.150 on wikitext while MLA and GDN stay above 0.24 on every real-text condition.

What carries the argument

DODOCO, an observatory that instruments multiple MoE checkpoints under a 5-by-6 grid of data conditions and an EP sweep to measure per-expert token ratios and Gini coefficients directly.

If this is right

  • AlltoAll-aware interconnect designs should treat routing bands as fixed workload inputs instead of tuning for EP degree.
  • Mock-token benchmarks are unsuitable for evaluating dispatch mitigations because they exaggerate imbalance and invent nonexistent scaling trends.
  • Adaptive expert relayout and predictive placement offer at most marginal relief once the model's routing decisions are fixed.
  • Dispatch optimizations should prioritize the persistently concentrated band (MLA, GDN) as the worst-case input.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Model training or fine-tuning could target routing algorithms directly to shrink the high-Gini band rather than relying on runtime fixes.
  • The two-band split may appear in larger models or different tasks, offering a quick way to classify new architectures before hardware tuning.
  • Interconnect vendors could build separate collective paths tuned to the resilient versus concentrated routing patterns observed here.

Load-bearing premise

The 5 by 6 grid of data conditions and real-text workloads used in DODOCO faithfully capture the routing behavior of production MoE deployments across the tested sequence-mixer designs.

What would settle it

A production MoE run on real user traffic that shows more than 5% change in per-expert max/mean token ratio when expert parallelism is scaled from 8 to 64 ranks, or whose routing Gini matches the mock-token values rather than the real-text band.

Figures

Figures reproduced from arXiv: 2605.20982 by Bole Ma, Gerhard Wellein, Harald Koestler, Jan Eitzinger.

Figure 1
Figure 1. Figure 1: (a) Per-expert load imbalance versus EP degree, matched configura [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Routing Gini across five architectures and six data conditions, EP=16, [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: MLA vs MHA, the architecture-controlled pair (same expert count, [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Routing Gini at each MoE layer under wikitext (solid) and mock (dotted) input. MHA stays below [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Random-init routing Gini under (a) mock tokens vs (b) wikitext [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

AlltoAll dispatch is the dominant bottleneck of MoE expert parallelism, and the interconnect community has responded with four families of mitigations: predictive sample placement, adaptive expert relayout, hierarchical collectives, and EP-aware topology. All four rest on two assumptions about the workload. The first is that routing imbalance is correctable by the system layer. The second is that the mock-token benchmarks evaluating them faithfully represent production routing. We introduce DODOCO to test both assumptions. We instrument five MoE checkpoints spanning five sequence-mixer designs (DeepSeek-V2-Lite MLA, DeepSeek-MoE-16B MHA, Qwen3-30B GQA, Nemotron-30B Mamba-2, Qwen3.5-35B GDN) under a 5 by 6 grid of data conditions plus a matched EP scan from 4 to 32 ranks on H100s; both assumptions fail. Scaling EP changes the per-expert max/mean token ratio by at most 5% within every architecture's measurable range: the straggler is intrinsic to the routing decision the model makes, not to how its experts land on ranks. Mock tokens overestimate routing Gini by up to a factor of 2.35 and fabricate a batch-size scaling trend that vanishes the moment real text replaces random IDs. A third pattern, unexpected, emerges from the same matrix: the five architectures cleave into two stable bands. MHA and Mamba-2 (data-resilient) drop to Gini 0.105 and 0.150 on wikitext. MLA and GDN (persistently concentrated) stay above 0.24 on every real-text condition and reach 0.29 to 0.38 on mock. GQA is the intermediate case. These bands, not the EP degree or the mock-data profile, are the right workload input to AlltoAll-aware interconnect and dispatch design.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces DODOCO to diagnose overhead in AlltoAll dispatch operations for MoE models. It instruments five MoE checkpoints spanning five sequence-mixer designs under a 5 by 6 grid of data conditions plus a matched EP scan from 4 to 32 ranks on H100s. Key findings are that EP scaling changes the per-expert max/mean token ratio by at most 5% (straggler intrinsic to routing not placement), mock tokens overestimate routing Gini by up to 2.35x and fabricate spurious scaling trends, and the architectures cleave into two stable bands (MHA/Mamba-2 data-resilient with low Gini on real text; MLA/GDN persistently concentrated).

Significance. If the measurements hold, the results would meaningfully inform AlltoAll and dispatch design for MoE systems by showing that common mitigations rest on flawed assumptions about correctability via placement and fidelity of mock benchmarks. The cross-architecture empirical matrix and real-text workloads constitute a useful observatory contribution.

major comments (2)
  1. [Abstract and EP scaling results] Abstract and EP scaling results: The stability of the per-expert max/mean token ratio (≤5% change) under EP scaling from 4 to 32 is expected by construction, since this ratio is computed solely from the model's token-to-expert routing assignments before any expert-to-rank placement occurs. This metric therefore does not provide evidence that routing imbalance cannot be mitigated by system-layer techniques such as adaptive expert relayout or predictive sample placement, weakening the claim that the first assumption fails.
  2. [DODOCO description and experimental protocol] DODOCO description and experimental protocol: The manuscript reports concrete quantitative outcomes (5% bound, 2.35 factor, Gini bands) but supplies no details on instrumentation implementation, statistical controls for token distributions, or selection of the 5 by 6 grid of data conditions and real-text workloads. This is load-bearing for verification of the central measurements.
minor comments (1)
  1. [Notation and metrics] The manuscript would benefit from an explicit definition of how the per-expert max/mean token ratio is calculated from routing decisions and how it maps to observed rank-level dispatch stragglers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review. The comments raise valid points about the scope of our EP scaling claims and the need for greater methodological transparency. We address each major comment below and will revise the manuscript to strengthen clarity and reproducibility while preserving the core empirical contributions.

read point-by-point responses
  1. Referee: [Abstract and EP scaling results] Abstract and EP scaling results: The stability of the per-expert max/mean token ratio (≤5% change) under EP scaling from 4 to 32 is expected by construction, since this ratio is computed solely from the model's token-to-expert routing assignments before any expert-to-rank placement occurs. This metric therefore does not provide evidence that routing imbalance cannot be mitigated by system-layer techniques such as adaptive expert relayout or predictive sample placement, weakening the claim that the first assumption fails.

    Authors: We appreciate the referee's clarification on the pre-placement nature of the metric. The per-expert max/mean token ratio is computed from routing assignments alone, and its stability (≤5% change) across EP degrees from 4 to 32 was intended to show that the imbalance is a persistent property of the model's routing decisions rather than an artifact of a particular EP configuration or expert-to-rank mapping. This still informs the first assumption by demonstrating that simply scaling the degree of expert parallelism does not correct the observed straggler. That said, we agree the results do not directly test adaptive relayout or predictive placement, which could in principle rebalance rank loads even with fixed per-expert token counts. We will revise the abstract, introduction, and discussion to more precisely delimit the claim, stating that the EP scan rules out correction via EP scaling while leaving open other system-layer mitigations. revision: yes

  2. Referee: [DODOCO description and experimental protocol] DODOCO description and experimental protocol: The manuscript reports concrete quantitative outcomes (5% bound, 2.35 factor, Gini bands) but supplies no details on instrumentation implementation, statistical controls for token distributions, or selection of the 5 by 6 grid of data conditions and real-text workloads. This is load-bearing for verification of the central measurements.

    Authors: We agree that the current manuscript lacks sufficient detail on these aspects, which are essential for independent verification. In the revised version we will add a new subsection (and corresponding appendix) that describes: (1) the instrumentation hooks used to capture per-expert token counts during forward passes, (2) the statistical controls applied (including number of samples per condition, variance reporting, and any filtering of sequence lengths), and (3) the rationale and exact composition of the 5×6 data-condition grid together with the specific real-text corpora. This addition will make the quantitative outcomes reproducible without altering the reported results. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on direct empirical measurements

full rationale

The paper's central results derive from instrumenting real MoE checkpoints under a 5×6 grid of data conditions and an EP scan, directly computing token-to-expert assignments, per-expert max/mean ratios, and Gini coefficients from observed routing decisions. These quantities are measured prior to any expert-to-rank mapping, so the reported stability (≤5% change) and mock-token discrepancies are independent observations rather than quantities forced by definition, fitted parameters renamed as predictions, or self-citation chains. No equations, ansatzes, or uniqueness theorems appear that reduce the findings to their inputs; the architecture-band split likewise emerges from the data matrix without prior assumptions that presuppose the outcome. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on standard definitions of routing imbalance and the representativeness of the chosen data conditions and architectures; no new physical entities or fitted constants are introduced beyond the measurement framework itself.

axioms (1)
  • domain assumption Gini coefficient and max/mean token ratio are appropriate quantifiers of routing imbalance for AlltoAll dispatch analysis
    Invoked throughout the description of results on real and mock data.
invented entities (1)
  • DODOCO no independent evidence
    purpose: Cross-architecture observatory instrumenting MoE checkpoints for dispatch measurements
    New tool introduced to test the two workload assumptions.

pith-pipeline@v0.9.0 · 5899 in / 1261 out tokens · 42232 ms · 2026-05-21T02:06:56.239594+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 5 internal anchors

  1. [1]

    Accelerating frontier moe training with 3d integrated optics,

    M. Bernadskiy, P. Carson, T. Graham, T. Groves, H. J. Lee, and E. Yeh, “Accelerating frontier moe training with 3d integrated optics,” in2025 IEEE Symposium on High-Performance Interconnects (HOTI), 2025, pp. 25–35

  2. [2]

    NetMoE: Accelerating moe training through dynamic sample placement,

    X. Liu, Y . Wang, F. Fu, X. Miao, S. Zhu, X. Nie, and B. Cui, “NetMoE: Accelerating moe training through dynamic sample placement,” in International Conference on Learning Representations, 2025. [Online]. Available: https://api.semanticscholar.org/CorpusID:278498309

  3. [3]

    LAER-MoE: Load-adaptive expert re-layout for efficient mixture- of-experts training,

    X. Liu, Y . Wang, F. Fu, X. Xiao, H. Li, J. Li, and B. Cui, “LAER-MoE: Load-adaptive expert re-layout for efficient mixture- of-experts training,” inProceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ser. ASPLOS ’26. New York, NY , USA: Association for Computing Machiner...

  4. [4]

    OHIO: Improving rdma network scalability in mpi alltoall through optimized hierarchical and intra/inter- node communication overlap design,

    T. Tran, G. K. R. Kuncham, B. Ramesh, S. Xu, H. Subramoni, M. Abduljabbar, and D. K. D. Panda, “OHIO: Improving rdma network scalability in mpi alltoall through optimized hierarchical and intra/inter- node communication overlap design,” in2024 IEEE Symposium on High-Performance Interconnects (HOTI), 2024, pp. 47–56

  5. [5]

    Unified collective communica- tion (ucc): An unified library for cpu, gpu, and dpu collectives,

    M. G. Venkata, V . Petrov, S. Lebedev, D. Bureddy, F. Aderholdt, J. Ladd, G. Bloch, M. Dubman, and G. Shainer, “Unified collective communica- tion (ucc): An unified library for cpu, gpu, and dpu collectives,” in2024 IEEE Symposium on High-Performance Interconnects (HOTI), 2024, pp. 37–46

  6. [6]

    Rail-only: A low-cost high-performance network for training llms with trillion parameters,

    W. Wang, M. Ghobadi, K. Shakeri, Y . Zhang, and N. Hasani, “Rail-only: A low-cost high-performance network for training llms with trillion parameters,” 2024. [Online]. Available: https://arxiv.org/abs/2307.12169

  7. [7]

    HeterMoE: Efficient training of mixture-of-experts models on heterogeneous gpus,

    Y . Wu, X. Liu, S. Jin, C. Xu, F. Qian, Z. M. Mao, M. Lentz, D. Zhuo, and I. Stoica, “HeterMoE: Efficient training of mixture-of-experts models on heterogeneous gpus,”CoRR, vol. abs/2504.03871, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2504.03871

  8. [8]

    Tutel: Adaptive mixture-of-experts at scale,

    C. Hwang, W. Cui, Y . Xiong, Z. Yang, Z. Liu, H. Hu, Z. Wang, R. Salas, J. Jose, P. Ram, J. Chau, P. Cheng, F. Yang, M. Yang, and Y . Xiong, “Tutel: Adaptive mixture-of-experts at scale,”CoRR, vol. abs/2206.03382, Jun. 2022. [Online]. Available: https://arxiv.org/pdf/ 2206.03382.pdf

  9. [9]

    DeepSeek-V2: A strong, economical, and efficient mixture-of-experts language model,

    DeepSeek-AI, “DeepSeek-V2: A strong, economical, and efficient mixture-of-experts language model,” 2024

  10. [11]
  11. [12]

    Qwen3 technical report,

    A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zhenget al., “Qwen3 technical report,” 2025. [Online]. Available: https://arxiv.org/abs/2505. 09388

  12. [13]

    Nemotron 3 nano: Open, efficient mixture-of- experts hybrid mamba-transformer model for agentic reasoning,

    NVIDIAet al., “Nemotron 3 nano: Open, efficient mixture-of- experts hybrid mamba-transformer model for agentic reasoning,” 2025. [Online]. Available: https://arxiv.org/abs/2512.20848

  13. [14]

    Demystifying the communication characteristics for distributed transformer models,

    Q. Anthony, B. Michalowicz, J. Hatef, L. Xu, M. Abduljabbar, A. Shafi, H. Subramoni, and D. Panda, “Demystifying the communication characteristics for distributed transformer models,” 2024. [Online]. Available: https://arxiv.org/abs/2408.10197

  14. [15]

    From score distributions to balance: Plug-and-play mixture-of-experts routing,

    R. Shahout, C. Cai, Y . Du, M. Yu, and M. Mitzenmacher, “From score distributions to balance: Plug-and-play mixture-of-experts routing,”

  15. [16]

    Available: https://arxiv.org/abs/2510.03293

    [Online]. Available: https://arxiv.org/abs/2510.03293

  16. [17]

    Latent prototype routing: Achieving near-perfect load balancing in mixture-of-experts,

    J. Yang, “Latent prototype routing: Achieving near-perfect load balancing in mixture-of-experts,” 2025. [Online]. Available: https: //arxiv.org/abs/2506.21328

  17. [18]

    Moe-inference-bench: Performance evaluation of mixture of expert large language and vision models,

    K. T. Chitty-Venkata, S. Howland, G. Azar, D. Soboleva, N. Vassilieva, S. Raskar, M. Emani, and V . Vishwanath, “Moe-inference-bench: Performance evaluation of mixture of expert large language and vision models,” inProceedings of the SC ’25 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ...

  18. [19]

    Switch transformers: scaling to trillion parameter models with simple and efficient sparsity,

    W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: scaling to trillion parameter models with simple and efficient sparsity,”J. Mach. Learn. Res., vol. 23, no. 1, Jan. 2022

  19. [20]

    GShard: Scaling giant models with conditional computation and automatic sharding,

    D. Lepikhin, H. Lee, Y . Xu, D. Chen, O. Firat, Y . Huang, M. Krikun, N. Shazeer, and Z. Chen, “GShard: Scaling giant models with conditional computation and automatic sharding,” in9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. [Online]. Available: https://openreview.net/fo...

  20. [21]

    FasterMoE: modeling and optimizing training of large- scale dynamic pre-trained models,

    J. He, J. Zhai, T. Antunes, H. Wang, F. Luo, S. Shi, and Q. Li, “FasterMoE: modeling and optimizing training of large- scale dynamic pre-trained models,” inProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, ser. PPoPP ’22. New York, NY , USA: Association for Computing Machinery, 2022, p. 120–134. [Online]. Av...

  21. [22]

    Three Phases of Expert Routing: How Load Balance Evolves During Mixture-of-Experts Training

    C. Mouzouni, “Three phases of expert routing: How load balance evolves during mixture-of-experts training,” 2026. [Online]. Available: https://arxiv.org/abs/2604.04230

  22. [23]

    Apertus: Democratizing open and compliant llms for global language environments

    Project Apertuset al., “Apertus: Democratizing open and compliant llms for global language environments,” 2025. [Online]. Available: https://arxiv.org/abs/2509.14233

  23. [24]

    Pointer sentinel mixture models,

    S. Merity, C. Xiong, J. Bradbury, and R. Socher, “Pointer sentinel mixture models,” 2016. [Online]. Available: https://arxiv.org/abs/1609. 07843

  24. [25]

    Least-loaded expert parallelism: Load balancing an imbalanced mixture-of-experts,

    X.-P. Nguyen, S. Pandit, A. Xu, C. Xiong, and S. Joty, “Least-loaded expert parallelism: Load balancing an imbalanced mixture-of-experts,”

  25. [26]

    Available: https://arxiv.org/abs/2601.17111

    [Online]. Available: https://arxiv.org/abs/2601.17111

  26. [27]

    MoETuner: Optimized mixture of expert serving with balanced expert placement and token routing,

    S. Go and D. Mahajan, “MoETuner: Optimized mixture of expert serving with balanced expert placement and token routing,” 2025. [Online]. Available: https://arxiv.org/abs/2502.06643

  27. [28]

    On the spatial structure of mixture-of-experts in transformers,

    D. Bershatsky and I. Oseledets, “On the spatial structure of mixture-of-experts in transformers,” 2025. [Online]. Available: https: //arxiv.org/abs/2504.04444

  28. [29]

    Deepseek-v4: Towards highly efficient million- token context intelligence,

    DeepSeek-AI, “Deepseek-v4: Towards highly efficient million- token context intelligence,” DeepSeek-AI, Tech. Rep., 2026, technical report accompanying DeepSeek-V4-Pro model. [Online]. Available: https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/ main/DeepSeek V4.pdf

  29. [30]

    Geometric Routing Enables Causal Expert Control in Mixture of Experts

    I. Ternovtsii and Y . Bilak, “Geometric routing enables causal expert control in mixture of experts,” 2026. [Online]. Available: https://arxiv.org/abs/2604.14434

  30. [31]

    Hash layers for large sparse models,

    S. Roller, S. Sukhbaatar, A. Szlam, and J. Weston, “Hash layers for large sparse models,” inProceedings of the 35th International Conference on Neural Information Processing Systems, ser. NIPS ’21. Red Hook, NY , USA: Curran Associates Inc., 2021

  31. [32]

    Patterns behind Chaos: Forecasting Data Movement for Efficient Large-Scale MoE LLM Inference

    Z. Yu, Y . Guan, Z. Yu, C. Zhou, Z. Hu, S. Pei, Y . Kang, Y . Ding, and P.-A. Tsai, “Patterns behind chaos: Forecasting data movement for efficient large-scale moe llm inference,” 2026. [Online]. Available: https://arxiv.org/abs/2510.05497

  32. [33]

    Advancing moe efficiency: A collaboration-constrained routing (c2r) strategy for better expert parallelism design,

    M. Zhang, P. Li, J. Peng, M. Qiu, and T. Chen, “Advancing moe efficiency: A collaboration-constrained routing (c2r) strategy for better expert parallelism design,” 2025. [Online]. Available: https://arxiv.org/abs/2504.01337

  33. [34]

    Every activation boosted: Scaling general reasoner to 1 trillion open language foundation,

    Ling Teamet al., “Every activation boosted: Scaling general reasoner to 1 trillion open language foundation,” Tech. Rep., 2025. [Online]. Available: https://arxiv.org/abs/2510.22115

  34. [35]

    Accelerating distributed moe training and inference with lina,

    J. Li, Y . Jiang, Y . Zhu, C. Wang, and H. Xu, “Accelerating distributed moe training and inference with lina,” 2024. [Online]. Available: https://arxiv.org/abs/2210.17223

  35. [36]

    Janus: A unified distributed training framework for sparse mixture-of-experts models,

    J. Liu, J. H. Wang, and Y . Jiang, “Janus: A unified distributed training framework for sparse mixture-of-experts models,” inProceedings of the ACM SIGCOMM 2023 Conference, ser. ACM SIGCOMM ’23. New York, NY , USA: Association for Computing Machinery, 2023, p. 486–498. [Online]. Available: https://doi.org/10.1145/3603269.3604869

  36. [37]

    Towards a standardized representation for deep learning collective algorithms,

    J. Yoo, W. Won, M. Cowan, N. Jiang, B. Klenk, S. Sridharan, and T. Krishna, “Towards a standardized representation for deep learning collective algorithms,” in2024 IEEE Symposium on High-Performance Interconnects (HOTI). IEEE, Aug. 2024, p. 33–36. [Online]. Available: http://dx.doi.org/10.1109/HOTI63208.2024.00017

  37. [38]

    Characterizing communication in distributed parameter-efficient fine- tuning for large language models,

    N. Alnaasan, H.-R. Huang, A. Shafi, H. Subramoni, and D. K. Panda, “Characterizing communication in distributed parameter-efficient fine- tuning for large language models,” in2024 IEEE Symposium on High- Performance Interconnects (HOTI), 2024, pp. 11–19

  38. [39]

    Multilingual routing in mixture-of-experts,

    L. Bandarkar, C. Yang, M. Fayyaz, J. Hu, and N. Peng, “Multilingual routing in mixture-of-experts,” 2026. [Online]. Available: https: //arxiv.org/abs/2510.04694

  39. [40]

    ST-MoE: Designing Stable and Transferable Sparse Expert Models

    B. Zoph, I. Bello, S. Kumar, N. Du, Y . Huang, J. Dean, N. Shazeer, and W. Fedus, “ST-MoE: Designing stable and transferable sparse expert models,” 2022. [Online]. Available: https://arxiv.org/abs/2202.08906

  40. [41]

    Mixture-of-experts with expert choice routing,

    Y . Zhou, T. Lei, H. Liu, N. Du, Y . Huang, V . Y . Zhao, A. Dai, Z. Chen, Q. Le, and J. Laudon, “Mixture-of-experts with expert choice routing,” in Proceedings of the 36th International Conference on Neural Information Processing Systems, ser. NIPS ’22. Red Hook, NY , USA: Curran Associates Inc., 2022

  41. [42]

    EPLB: Expert parallelism load balancer,

    DeepSeek-AI, “EPLB: Expert parallelism load balancer,” Tech. Rep., 2025, GitHub repository. [Online]. Available: https://github.com/ deepseek-ai/eplb