Diagnosing Overhead in Dispatch Operations: Cross-architecture Observatory

Bole Ma; Gerhard Wellein; Harald Koestler; Jan Eitzinger

arxiv: 2605.20982 · v1 · pith:FXHQIZXMnew · submitted 2026-05-20 · 💻 cs.DC · cs.AI· cs.LG

Diagnosing Overhead in Dispatch Operations: Cross-architecture Observatory

Bole Ma , Jan Eitzinger , Harald Koestler , Gerhard Wellein This is my paper

Pith reviewed 2026-05-21 02:06 UTC · model grok-4.3

classification 💻 cs.DC cs.AIcs.LG

keywords mixture of expertsexpert parallelismAlltoAll dispatchrouting imbalanceMoE diagnosticstoken distributionGini coefficient

0 comments

The pith

MoE routing imbalance is intrinsic to model decisions, not fixed by expert placement or EP scaling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests two common assumptions behind efforts to speed up AlltoAll dispatch in mixture-of-experts models: that system-level fixes like expert relayout can correct routing imbalance, and that mock-token benchmarks accurately reflect real workloads. By running five different MoE architectures across real text and mock data plus an EP scan from 4 to 32 ranks, the measurements show that scaling expert parallelism changes the max-to-mean token ratio by at most 5 percent. Real text produces far lower and more stable imbalance than mock data, and the five models fall into two consistent groups based on how concentrated their routing stays.

Core claim

Scaling EP changes the per-expert max/mean token ratio by at most 5% within every architecture's measurable range: the straggler is intrinsic to the routing decision the model makes, not to how its experts land on ranks. Mock tokens overestimate routing Gini by up to a factor of 2.35 and fabricate a batch-size scaling trend that vanishes the moment real text replaces random IDs. The five architectures cleave into two stable bands: MHA and Mamba-2 drop to Gini 0.105 and 0.150 on wikitext while MLA and GDN stay above 0.24 on every real-text condition.

What carries the argument

DODOCO, an observatory that instruments multiple MoE checkpoints under a 5-by-6 grid of data conditions and an EP sweep to measure per-expert token ratios and Gini coefficients directly.

If this is right

AlltoAll-aware interconnect designs should treat routing bands as fixed workload inputs instead of tuning for EP degree.
Mock-token benchmarks are unsuitable for evaluating dispatch mitigations because they exaggerate imbalance and invent nonexistent scaling trends.
Adaptive expert relayout and predictive placement offer at most marginal relief once the model's routing decisions are fixed.
Dispatch optimizations should prioritize the persistently concentrated band (MLA, GDN) as the worst-case input.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Model training or fine-tuning could target routing algorithms directly to shrink the high-Gini band rather than relying on runtime fixes.
The two-band split may appear in larger models or different tasks, offering a quick way to classify new architectures before hardware tuning.
Interconnect vendors could build separate collective paths tuned to the resilient versus concentrated routing patterns observed here.

Load-bearing premise

The 5 by 6 grid of data conditions and real-text workloads used in DODOCO faithfully capture the routing behavior of production MoE deployments across the tested sequence-mixer designs.

What would settle it

A production MoE run on real user traffic that shows more than 5% change in per-expert max/mean token ratio when expert parallelism is scaled from 8 to 64 ranks, or whose routing Gini matches the mock-token values rather than the real-text band.

Figures

Figures reproduced from arXiv: 2605.20982 by Bole Ma, Gerhard Wellein, Harald Koestler, Jan Eitzinger.

**Figure 2.** Figure 2: Routing Gini across five architectures and six data conditions, EP=16, [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: MLA vs MHA, the architecture-controlled pair (same expert count, [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Routing Gini at each MoE layer under wikitext (solid) and mock (dotted) input. MHA stays below [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Random-init routing Gini under (a) mock tokens vs (b) wikitext [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

AlltoAll dispatch is the dominant bottleneck of MoE expert parallelism, and the interconnect community has responded with four families of mitigations: predictive sample placement, adaptive expert relayout, hierarchical collectives, and EP-aware topology. All four rest on two assumptions about the workload. The first is that routing imbalance is correctable by the system layer. The second is that the mock-token benchmarks evaluating them faithfully represent production routing. We introduce DODOCO to test both assumptions. We instrument five MoE checkpoints spanning five sequence-mixer designs (DeepSeek-V2-Lite MLA, DeepSeek-MoE-16B MHA, Qwen3-30B GQA, Nemotron-30B Mamba-2, Qwen3.5-35B GDN) under a 5 by 6 grid of data conditions plus a matched EP scan from 4 to 32 ranks on H100s; both assumptions fail. Scaling EP changes the per-expert max/mean token ratio by at most 5% within every architecture's measurable range: the straggler is intrinsic to the routing decision the model makes, not to how its experts land on ranks. Mock tokens overestimate routing Gini by up to a factor of 2.35 and fabricate a batch-size scaling trend that vanishes the moment real text replaces random IDs. A third pattern, unexpected, emerges from the same matrix: the five architectures cleave into two stable bands. MHA and Mamba-2 (data-resilient) drop to Gini 0.105 and 0.150 on wikitext. MLA and GDN (persistently concentrated) stay above 0.24 on every real-text condition and reach 0.29 to 0.38 on mock. GQA is the intermediate case. These bands, not the EP degree or the mock-data profile, are the right workload input to AlltoAll-aware interconnect and dispatch design.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MoE routing imbalance stays stable with EP scaling and mock tokens badly overestimate it, with architectures splitting into two clear workload bands.

read the letter

The punchline from this paper is that scaling expert parallelism from 4 to 32 ranks barely moves the per-expert max/mean token ratio—by 5% at most—across the architectures they tested. That means the straggler effect is baked into how the model routes tokens, not into how you map experts to ranks. They also show that mock tokens overestimate the Gini coefficient by up to 2.35 times and create scaling trends that disappear with real text. What stands out as new is the cross-architecture view under matched conditions. They looked at five MoE models with different sequence mixers like MLA, MHA, GQA, Mamba-2, and GDN. The emergence of two stable bands—one where Gini drops low on real data for MHA and Mamba-2, and another where MLA and GDN stay concentrated—is a pattern that could help set better inputs for interconnect design. The paper does a solid job providing quantitative outcomes from real checkpoints and a 5 by 6 grid of data conditions. The measurements look direct from the routing decisions, with no obvious circularity. They credit the full instrumentation in the manuscript, which includes the EP scan on H100s. One softer area is that the stability under EP scaling is what you would predict if the ratio is calculated before any rank assignment happens. It's confirmatory rather than a big twist, though it still usefully questions the assumptions behind predictive placement and adaptive relayout. The choice of workloads seems reasonable but the paper's claim that they capture production behavior could use more justification in review. This is aimed at people working on dispatch and AlltoAll for large AI models, especially those dealing with MoE parallelism. A reader interested in practical efficiency gains at scale would get value from the workload characterization and the warning about mock benchmarks. It has enough empirical grounding to deserve a serious referee. I'd recommend sending this to peer review. The data challenges some standard practices in a measurable way, and the architecture bands are worth discussing even if the EP part is expected.

Referee Report

2 major / 1 minor

Summary. The paper introduces DODOCO to diagnose overhead in AlltoAll dispatch operations for MoE models. It instruments five MoE checkpoints spanning five sequence-mixer designs under a 5 by 6 grid of data conditions plus a matched EP scan from 4 to 32 ranks on H100s. Key findings are that EP scaling changes the per-expert max/mean token ratio by at most 5% (straggler intrinsic to routing not placement), mock tokens overestimate routing Gini by up to 2.35x and fabricate spurious scaling trends, and the architectures cleave into two stable bands (MHA/Mamba-2 data-resilient with low Gini on real text; MLA/GDN persistently concentrated).

Significance. If the measurements hold, the results would meaningfully inform AlltoAll and dispatch design for MoE systems by showing that common mitigations rest on flawed assumptions about correctability via placement and fidelity of mock benchmarks. The cross-architecture empirical matrix and real-text workloads constitute a useful observatory contribution.

major comments (2)

[Abstract and EP scaling results] Abstract and EP scaling results: The stability of the per-expert max/mean token ratio (≤5% change) under EP scaling from 4 to 32 is expected by construction, since this ratio is computed solely from the model's token-to-expert routing assignments before any expert-to-rank placement occurs. This metric therefore does not provide evidence that routing imbalance cannot be mitigated by system-layer techniques such as adaptive expert relayout or predictive sample placement, weakening the claim that the first assumption fails.
[DODOCO description and experimental protocol] DODOCO description and experimental protocol: The manuscript reports concrete quantitative outcomes (5% bound, 2.35 factor, Gini bands) but supplies no details on instrumentation implementation, statistical controls for token distributions, or selection of the 5 by 6 grid of data conditions and real-text workloads. This is load-bearing for verification of the central measurements.

minor comments (1)

[Notation and metrics] The manuscript would benefit from an explicit definition of how the per-expert max/mean token ratio is calculated from routing decisions and how it maps to observed rank-level dispatch stragglers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review. The comments raise valid points about the scope of our EP scaling claims and the need for greater methodological transparency. We address each major comment below and will revise the manuscript to strengthen clarity and reproducibility while preserving the core empirical contributions.

read point-by-point responses

Referee: [Abstract and EP scaling results] Abstract and EP scaling results: The stability of the per-expert max/mean token ratio (≤5% change) under EP scaling from 4 to 32 is expected by construction, since this ratio is computed solely from the model's token-to-expert routing assignments before any expert-to-rank placement occurs. This metric therefore does not provide evidence that routing imbalance cannot be mitigated by system-layer techniques such as adaptive expert relayout or predictive sample placement, weakening the claim that the first assumption fails.

Authors: We appreciate the referee's clarification on the pre-placement nature of the metric. The per-expert max/mean token ratio is computed from routing assignments alone, and its stability (≤5% change) across EP degrees from 4 to 32 was intended to show that the imbalance is a persistent property of the model's routing decisions rather than an artifact of a particular EP configuration or expert-to-rank mapping. This still informs the first assumption by demonstrating that simply scaling the degree of expert parallelism does not correct the observed straggler. That said, we agree the results do not directly test adaptive relayout or predictive placement, which could in principle rebalance rank loads even with fixed per-expert token counts. We will revise the abstract, introduction, and discussion to more precisely delimit the claim, stating that the EP scan rules out correction via EP scaling while leaving open other system-layer mitigations. revision: yes
Referee: [DODOCO description and experimental protocol] DODOCO description and experimental protocol: The manuscript reports concrete quantitative outcomes (5% bound, 2.35 factor, Gini bands) but supplies no details on instrumentation implementation, statistical controls for token distributions, or selection of the 5 by 6 grid of data conditions and real-text workloads. This is load-bearing for verification of the central measurements.

Authors: We agree that the current manuscript lacks sufficient detail on these aspects, which are essential for independent verification. In the revised version we will add a new subsection (and corresponding appendix) that describes: (1) the instrumentation hooks used to capture per-expert token counts during forward passes, (2) the statistical controls applied (including number of samples per condition, variance reporting, and any filtering of sequence lengths), and (3) the rationale and exact composition of the 5×6 data-condition grid together with the specific real-text corpora. This addition will make the quantitative outcomes reproducible without altering the reported results. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on direct empirical measurements

full rationale

The paper's central results derive from instrumenting real MoE checkpoints under a 5×6 grid of data conditions and an EP scan, directly computing token-to-expert assignments, per-expert max/mean ratios, and Gini coefficients from observed routing decisions. These quantities are measured prior to any expert-to-rank mapping, so the reported stability (≤5% change) and mock-token discrepancies are independent observations rather than quantities forced by definition, fitted parameters renamed as predictions, or self-citation chains. No equations, ansatzes, or uniqueness theorems appear that reduce the findings to their inputs; the architecture-band split likewise emerges from the data matrix without prior assumptions that presuppose the outcome. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on standard definitions of routing imbalance and the representativeness of the chosen data conditions and architectures; no new physical entities or fitted constants are introduced beyond the measurement framework itself.

axioms (1)

domain assumption Gini coefficient and max/mean token ratio are appropriate quantifiers of routing imbalance for AlltoAll dispatch analysis
Invoked throughout the description of results on real and mock data.

invented entities (1)

DODOCO no independent evidence
purpose: Cross-architecture observatory instrumenting MoE checkpoints for dispatch measurements
New tool introduced to test the two workload assumptions.

pith-pipeline@v0.9.0 · 5899 in / 1261 out tokens · 42232 ms · 2026-05-21T02:06:56.239594+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Scaling EP changes the per-expert max/mean token ratio by at most 5% within every architecture's measurable range: the straggler is intrinsic to the routing decision the model makes, not to how its experts land on ranks.
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_strictMono_of_one_lt unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the five architectures cleave into two stable bands. MHA and Mamba-2 (data-resilient) drop to Gini 0.105 and 0.150 on wikitext. MLA and GDN (persistently concentrated) stay above 0.24

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 5 internal anchors

[1]

Accelerating frontier moe training with 3d integrated optics,

M. Bernadskiy, P. Carson, T. Graham, T. Groves, H. J. Lee, and E. Yeh, “Accelerating frontier moe training with 3d integrated optics,” in2025 IEEE Symposium on High-Performance Interconnects (HOTI), 2025, pp. 25–35

work page 2025
[2]

NetMoE: Accelerating moe training through dynamic sample placement,

X. Liu, Y . Wang, F. Fu, X. Miao, S. Zhu, X. Nie, and B. Cui, “NetMoE: Accelerating moe training through dynamic sample placement,” in International Conference on Learning Representations, 2025. [Online]. Available: https://api.semanticscholar.org/CorpusID:278498309

work page 2025
[3]

LAER-MoE: Load-adaptive expert re-layout for efficient mixture- of-experts training,

X. Liu, Y . Wang, F. Fu, X. Xiao, H. Li, J. Li, and B. Cui, “LAER-MoE: Load-adaptive expert re-layout for efficient mixture- of-experts training,” inProceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ser. ASPLOS ’26. New York, NY , USA: Association for Computing Machiner...

work page doi:10.1145/3779212.3790180 2026
[4]

OHIO: Improving rdma network scalability in mpi alltoall through optimized hierarchical and intra/inter- node communication overlap design,

T. Tran, G. K. R. Kuncham, B. Ramesh, S. Xu, H. Subramoni, M. Abduljabbar, and D. K. D. Panda, “OHIO: Improving rdma network scalability in mpi alltoall through optimized hierarchical and intra/inter- node communication overlap design,” in2024 IEEE Symposium on High-Performance Interconnects (HOTI), 2024, pp. 47–56

work page 2024
[5]

Unified collective communica- tion (ucc): An unified library for cpu, gpu, and dpu collectives,

M. G. Venkata, V . Petrov, S. Lebedev, D. Bureddy, F. Aderholdt, J. Ladd, G. Bloch, M. Dubman, and G. Shainer, “Unified collective communica- tion (ucc): An unified library for cpu, gpu, and dpu collectives,” in2024 IEEE Symposium on High-Performance Interconnects (HOTI), 2024, pp. 37–46

work page 2024
[6]

Rail-only: A low-cost high-performance network for training llms with trillion parameters,

W. Wang, M. Ghobadi, K. Shakeri, Y . Zhang, and N. Hasani, “Rail-only: A low-cost high-performance network for training llms with trillion parameters,” 2024. [Online]. Available: https://arxiv.org/abs/2307.12169

work page arXiv 2024
[7]

HeterMoE: Efficient training of mixture-of-experts models on heterogeneous gpus,

Y . Wu, X. Liu, S. Jin, C. Xu, F. Qian, Z. M. Mao, M. Lentz, D. Zhuo, and I. Stoica, “HeterMoE: Efficient training of mixture-of-experts models on heterogeneous gpus,”CoRR, vol. abs/2504.03871, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2504.03871

work page doi:10.48550/arxiv.2504.03871 2025
[8]

Tutel: Adaptive mixture-of-experts at scale,

C. Hwang, W. Cui, Y . Xiong, Z. Yang, Z. Liu, H. Hu, Z. Wang, R. Salas, J. Jose, P. Ram, J. Chau, P. Cheng, F. Yang, M. Yang, and Y . Xiong, “Tutel: Adaptive mixture-of-experts at scale,”CoRR, vol. abs/2206.03382, Jun. 2022. [Online]. Available: https://arxiv.org/pdf/ 2206.03382.pdf

work page arXiv 2022
[9]

DeepSeek-V2: A strong, economical, and efficient mixture-of-experts language model,

DeepSeek-AI, “DeepSeek-V2: A strong, economical, and efficient mixture-of-experts language model,” 2024

work page 2024
[11]

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

[Online]. Available: https://arxiv.org/abs/2401.06066

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Qwen3 technical report,

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zhenget al., “Qwen3 technical report,” 2025. [Online]. Available: https://arxiv.org/abs/2505. 09388

work page 2025
[13]

Nemotron 3 nano: Open, efficient mixture-of- experts hybrid mamba-transformer model for agentic reasoning,

NVIDIAet al., “Nemotron 3 nano: Open, efficient mixture-of- experts hybrid mamba-transformer model for agentic reasoning,” 2025. [Online]. Available: https://arxiv.org/abs/2512.20848

work page arXiv 2025
[14]

Demystifying the communication characteristics for distributed transformer models,

Q. Anthony, B. Michalowicz, J. Hatef, L. Xu, M. Abduljabbar, A. Shafi, H. Subramoni, and D. Panda, “Demystifying the communication characteristics for distributed transformer models,” 2024. [Online]. Available: https://arxiv.org/abs/2408.10197

work page arXiv 2024
[15]

From score distributions to balance: Plug-and-play mixture-of-experts routing,

R. Shahout, C. Cai, Y . Du, M. Yu, and M. Mitzenmacher, “From score distributions to balance: Plug-and-play mixture-of-experts routing,”

work page
[16]

Available: https://arxiv.org/abs/2510.03293

[Online]. Available: https://arxiv.org/abs/2510.03293

work page arXiv
[17]

Latent prototype routing: Achieving near-perfect load balancing in mixture-of-experts,

J. Yang, “Latent prototype routing: Achieving near-perfect load balancing in mixture-of-experts,” 2025. [Online]. Available: https: //arxiv.org/abs/2506.21328

work page arXiv 2025
[18]

Moe-inference-bench: Performance evaluation of mixture of expert large language and vision models,

K. T. Chitty-Venkata, S. Howland, G. Azar, D. Soboleva, N. Vassilieva, S. Raskar, M. Emani, and V . Vishwanath, “Moe-inference-bench: Performance evaluation of mixture of expert large language and vision models,” inProceedings of the SC ’25 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ...

work page doi:10.1145/3731599.3767706 2025
[19]

Switch transformers: scaling to trillion parameter models with simple and efficient sparsity,

W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: scaling to trillion parameter models with simple and efficient sparsity,”J. Mach. Learn. Res., vol. 23, no. 1, Jan. 2022

work page 2022
[20]

GShard: Scaling giant models with conditional computation and automatic sharding,

D. Lepikhin, H. Lee, Y . Xu, D. Chen, O. Firat, Y . Huang, M. Krikun, N. Shazeer, and Z. Chen, “GShard: Scaling giant models with conditional computation and automatic sharding,” in9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. [Online]. Available: https://openreview.net/fo...

work page 2021
[21]

FasterMoE: modeling and optimizing training of large- scale dynamic pre-trained models,

J. He, J. Zhai, T. Antunes, H. Wang, F. Luo, S. Shi, and Q. Li, “FasterMoE: modeling and optimizing training of large- scale dynamic pre-trained models,” inProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, ser. PPoPP ’22. New York, NY , USA: Association for Computing Machinery, 2022, p. 120–134. [Online]. Av...

work page doi:10.1145/3503221.3508418 2022
[22]

Three Phases of Expert Routing: How Load Balance Evolves During Mixture-of-Experts Training

C. Mouzouni, “Three phases of expert routing: How load balance evolves during mixture-of-experts training,” 2026. [Online]. Available: https://arxiv.org/abs/2604.04230

work page internal anchor Pith review Pith/arXiv arXiv 2026
[23]

Apertus: Democratizing open and compliant llms for global language environments

Project Apertuset al., “Apertus: Democratizing open and compliant llms for global language environments,” 2025. [Online]. Available: https://arxiv.org/abs/2509.14233

work page arXiv 2025
[24]

Pointer sentinel mixture models,

S. Merity, C. Xiong, J. Bradbury, and R. Socher, “Pointer sentinel mixture models,” 2016. [Online]. Available: https://arxiv.org/abs/1609. 07843

work page 2016
[25]

Least-loaded expert parallelism: Load balancing an imbalanced mixture-of-experts,

X.-P. Nguyen, S. Pandit, A. Xu, C. Xiong, and S. Joty, “Least-loaded expert parallelism: Load balancing an imbalanced mixture-of-experts,”

work page
[26]

Available: https://arxiv.org/abs/2601.17111

[Online]. Available: https://arxiv.org/abs/2601.17111

work page arXiv
[27]

MoETuner: Optimized mixture of expert serving with balanced expert placement and token routing,

S. Go and D. Mahajan, “MoETuner: Optimized mixture of expert serving with balanced expert placement and token routing,” 2025. [Online]. Available: https://arxiv.org/abs/2502.06643

work page arXiv 2025
[28]

On the spatial structure of mixture-of-experts in transformers,

D. Bershatsky and I. Oseledets, “On the spatial structure of mixture-of-experts in transformers,” 2025. [Online]. Available: https: //arxiv.org/abs/2504.04444

work page arXiv 2025
[29]

Deepseek-v4: Towards highly efficient million- token context intelligence,

DeepSeek-AI, “Deepseek-v4: Towards highly efficient million- token context intelligence,” DeepSeek-AI, Tech. Rep., 2026, technical report accompanying DeepSeek-V4-Pro model. [Online]. Available: https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/ main/DeepSeek V4.pdf

work page 2026
[30]

Geometric Routing Enables Causal Expert Control in Mixture of Experts

I. Ternovtsii and Y . Bilak, “Geometric routing enables causal expert control in mixture of experts,” 2026. [Online]. Available: https://arxiv.org/abs/2604.14434

work page internal anchor Pith review Pith/arXiv arXiv 2026
[31]

Hash layers for large sparse models,

S. Roller, S. Sukhbaatar, A. Szlam, and J. Weston, “Hash layers for large sparse models,” inProceedings of the 35th International Conference on Neural Information Processing Systems, ser. NIPS ’21. Red Hook, NY , USA: Curran Associates Inc., 2021

work page 2021
[32]

Patterns behind Chaos: Forecasting Data Movement for Efficient Large-Scale MoE LLM Inference

Z. Yu, Y . Guan, Z. Yu, C. Zhou, Z. Hu, S. Pei, Y . Kang, Y . Ding, and P.-A. Tsai, “Patterns behind chaos: Forecasting data movement for efficient large-scale moe llm inference,” 2026. [Online]. Available: https://arxiv.org/abs/2510.05497

work page internal anchor Pith review Pith/arXiv arXiv 2026
[33]

Advancing moe efficiency: A collaboration-constrained routing (c2r) strategy for better expert parallelism design,

M. Zhang, P. Li, J. Peng, M. Qiu, and T. Chen, “Advancing moe efficiency: A collaboration-constrained routing (c2r) strategy for better expert parallelism design,” 2025. [Online]. Available: https://arxiv.org/abs/2504.01337

work page arXiv 2025
[34]

Every activation boosted: Scaling general reasoner to 1 trillion open language foundation,

Ling Teamet al., “Every activation boosted: Scaling general reasoner to 1 trillion open language foundation,” Tech. Rep., 2025. [Online]. Available: https://arxiv.org/abs/2510.22115

work page arXiv 2025
[35]

Accelerating distributed moe training and inference with lina,

J. Li, Y . Jiang, Y . Zhu, C. Wang, and H. Xu, “Accelerating distributed moe training and inference with lina,” 2024. [Online]. Available: https://arxiv.org/abs/2210.17223

work page arXiv 2024
[36]

Janus: A unified distributed training framework for sparse mixture-of-experts models,

J. Liu, J. H. Wang, and Y . Jiang, “Janus: A unified distributed training framework for sparse mixture-of-experts models,” inProceedings of the ACM SIGCOMM 2023 Conference, ser. ACM SIGCOMM ’23. New York, NY , USA: Association for Computing Machinery, 2023, p. 486–498. [Online]. Available: https://doi.org/10.1145/3603269.3604869

work page doi:10.1145/3603269.3604869 2023
[37]

Towards a standardized representation for deep learning collective algorithms,

J. Yoo, W. Won, M. Cowan, N. Jiang, B. Klenk, S. Sridharan, and T. Krishna, “Towards a standardized representation for deep learning collective algorithms,” in2024 IEEE Symposium on High-Performance Interconnects (HOTI). IEEE, Aug. 2024, p. 33–36. [Online]. Available: http://dx.doi.org/10.1109/HOTI63208.2024.00017

work page doi:10.1109/hoti63208.2024.00017 2024
[38]

Characterizing communication in distributed parameter-efficient fine- tuning for large language models,

N. Alnaasan, H.-R. Huang, A. Shafi, H. Subramoni, and D. K. Panda, “Characterizing communication in distributed parameter-efficient fine- tuning for large language models,” in2024 IEEE Symposium on High- Performance Interconnects (HOTI), 2024, pp. 11–19

work page 2024
[39]

Multilingual routing in mixture-of-experts,

L. Bandarkar, C. Yang, M. Fayyaz, J. Hu, and N. Peng, “Multilingual routing in mixture-of-experts,” 2026. [Online]. Available: https: //arxiv.org/abs/2510.04694

work page arXiv 2026
[40]

ST-MoE: Designing Stable and Transferable Sparse Expert Models

B. Zoph, I. Bello, S. Kumar, N. Du, Y . Huang, J. Dean, N. Shazeer, and W. Fedus, “ST-MoE: Designing stable and transferable sparse expert models,” 2022. [Online]. Available: https://arxiv.org/abs/2202.08906

work page internal anchor Pith review Pith/arXiv arXiv 2022
[41]

Mixture-of-experts with expert choice routing,

Y . Zhou, T. Lei, H. Liu, N. Du, Y . Huang, V . Y . Zhao, A. Dai, Z. Chen, Q. Le, and J. Laudon, “Mixture-of-experts with expert choice routing,” in Proceedings of the 36th International Conference on Neural Information Processing Systems, ser. NIPS ’22. Red Hook, NY , USA: Curran Associates Inc., 2022

work page 2022
[42]

EPLB: Expert parallelism load balancer,

DeepSeek-AI, “EPLB: Expert parallelism load balancer,” Tech. Rep., 2025, GitHub repository. [Online]. Available: https://github.com/ deepseek-ai/eplb

work page 2025

[1] [1]

Accelerating frontier moe training with 3d integrated optics,

M. Bernadskiy, P. Carson, T. Graham, T. Groves, H. J. Lee, and E. Yeh, “Accelerating frontier moe training with 3d integrated optics,” in2025 IEEE Symposium on High-Performance Interconnects (HOTI), 2025, pp. 25–35

work page 2025

[2] [2]

NetMoE: Accelerating moe training through dynamic sample placement,

X. Liu, Y . Wang, F. Fu, X. Miao, S. Zhu, X. Nie, and B. Cui, “NetMoE: Accelerating moe training through dynamic sample placement,” in International Conference on Learning Representations, 2025. [Online]. Available: https://api.semanticscholar.org/CorpusID:278498309

work page 2025

[3] [3]

LAER-MoE: Load-adaptive expert re-layout for efficient mixture- of-experts training,

X. Liu, Y . Wang, F. Fu, X. Xiao, H. Li, J. Li, and B. Cui, “LAER-MoE: Load-adaptive expert re-layout for efficient mixture- of-experts training,” inProceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ser. ASPLOS ’26. New York, NY , USA: Association for Computing Machiner...

work page doi:10.1145/3779212.3790180 2026

[4] [4]

OHIO: Improving rdma network scalability in mpi alltoall through optimized hierarchical and intra/inter- node communication overlap design,

T. Tran, G. K. R. Kuncham, B. Ramesh, S. Xu, H. Subramoni, M. Abduljabbar, and D. K. D. Panda, “OHIO: Improving rdma network scalability in mpi alltoall through optimized hierarchical and intra/inter- node communication overlap design,” in2024 IEEE Symposium on High-Performance Interconnects (HOTI), 2024, pp. 47–56

work page 2024

[5] [5]

Unified collective communica- tion (ucc): An unified library for cpu, gpu, and dpu collectives,

M. G. Venkata, V . Petrov, S. Lebedev, D. Bureddy, F. Aderholdt, J. Ladd, G. Bloch, M. Dubman, and G. Shainer, “Unified collective communica- tion (ucc): An unified library for cpu, gpu, and dpu collectives,” in2024 IEEE Symposium on High-Performance Interconnects (HOTI), 2024, pp. 37–46

work page 2024

[6] [6]

Rail-only: A low-cost high-performance network for training llms with trillion parameters,

W. Wang, M. Ghobadi, K. Shakeri, Y . Zhang, and N. Hasani, “Rail-only: A low-cost high-performance network for training llms with trillion parameters,” 2024. [Online]. Available: https://arxiv.org/abs/2307.12169

work page arXiv 2024

[7] [7]

HeterMoE: Efficient training of mixture-of-experts models on heterogeneous gpus,

Y . Wu, X. Liu, S. Jin, C. Xu, F. Qian, Z. M. Mao, M. Lentz, D. Zhuo, and I. Stoica, “HeterMoE: Efficient training of mixture-of-experts models on heterogeneous gpus,”CoRR, vol. abs/2504.03871, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2504.03871

work page doi:10.48550/arxiv.2504.03871 2025

[8] [8]

Tutel: Adaptive mixture-of-experts at scale,

C. Hwang, W. Cui, Y . Xiong, Z. Yang, Z. Liu, H. Hu, Z. Wang, R. Salas, J. Jose, P. Ram, J. Chau, P. Cheng, F. Yang, M. Yang, and Y . Xiong, “Tutel: Adaptive mixture-of-experts at scale,”CoRR, vol. abs/2206.03382, Jun. 2022. [Online]. Available: https://arxiv.org/pdf/ 2206.03382.pdf

work page arXiv 2022

[9] [9]

DeepSeek-V2: A strong, economical, and efficient mixture-of-experts language model,

DeepSeek-AI, “DeepSeek-V2: A strong, economical, and efficient mixture-of-experts language model,” 2024

work page 2024

[10] [11]

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

[Online]. Available: https://arxiv.org/abs/2401.06066

work page internal anchor Pith review Pith/arXiv arXiv

[11] [12]

Qwen3 technical report,

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zhenget al., “Qwen3 technical report,” 2025. [Online]. Available: https://arxiv.org/abs/2505. 09388

work page 2025

[12] [13]

Nemotron 3 nano: Open, efficient mixture-of- experts hybrid mamba-transformer model for agentic reasoning,

NVIDIAet al., “Nemotron 3 nano: Open, efficient mixture-of- experts hybrid mamba-transformer model for agentic reasoning,” 2025. [Online]. Available: https://arxiv.org/abs/2512.20848

work page arXiv 2025

[13] [14]

Demystifying the communication characteristics for distributed transformer models,

Q. Anthony, B. Michalowicz, J. Hatef, L. Xu, M. Abduljabbar, A. Shafi, H. Subramoni, and D. Panda, “Demystifying the communication characteristics for distributed transformer models,” 2024. [Online]. Available: https://arxiv.org/abs/2408.10197

work page arXiv 2024

[14] [15]

From score distributions to balance: Plug-and-play mixture-of-experts routing,

R. Shahout, C. Cai, Y . Du, M. Yu, and M. Mitzenmacher, “From score distributions to balance: Plug-and-play mixture-of-experts routing,”

work page

[15] [16]

Available: https://arxiv.org/abs/2510.03293

[Online]. Available: https://arxiv.org/abs/2510.03293

work page arXiv

[16] [17]

Latent prototype routing: Achieving near-perfect load balancing in mixture-of-experts,

J. Yang, “Latent prototype routing: Achieving near-perfect load balancing in mixture-of-experts,” 2025. [Online]. Available: https: //arxiv.org/abs/2506.21328

work page arXiv 2025

[17] [18]

Moe-inference-bench: Performance evaluation of mixture of expert large language and vision models,

K. T. Chitty-Venkata, S. Howland, G. Azar, D. Soboleva, N. Vassilieva, S. Raskar, M. Emani, and V . Vishwanath, “Moe-inference-bench: Performance evaluation of mixture of expert large language and vision models,” inProceedings of the SC ’25 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ...

work page doi:10.1145/3731599.3767706 2025

[18] [19]

Switch transformers: scaling to trillion parameter models with simple and efficient sparsity,

W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: scaling to trillion parameter models with simple and efficient sparsity,”J. Mach. Learn. Res., vol. 23, no. 1, Jan. 2022

work page 2022

[19] [20]

GShard: Scaling giant models with conditional computation and automatic sharding,

D. Lepikhin, H. Lee, Y . Xu, D. Chen, O. Firat, Y . Huang, M. Krikun, N. Shazeer, and Z. Chen, “GShard: Scaling giant models with conditional computation and automatic sharding,” in9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. [Online]. Available: https://openreview.net/fo...

work page 2021

[20] [21]

FasterMoE: modeling and optimizing training of large- scale dynamic pre-trained models,

J. He, J. Zhai, T. Antunes, H. Wang, F. Luo, S. Shi, and Q. Li, “FasterMoE: modeling and optimizing training of large- scale dynamic pre-trained models,” inProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, ser. PPoPP ’22. New York, NY , USA: Association for Computing Machinery, 2022, p. 120–134. [Online]. Av...

work page doi:10.1145/3503221.3508418 2022

[21] [22]

Three Phases of Expert Routing: How Load Balance Evolves During Mixture-of-Experts Training

C. Mouzouni, “Three phases of expert routing: How load balance evolves during mixture-of-experts training,” 2026. [Online]. Available: https://arxiv.org/abs/2604.04230

work page internal anchor Pith review Pith/arXiv arXiv 2026

[22] [23]

Apertus: Democratizing open and compliant llms for global language environments

Project Apertuset al., “Apertus: Democratizing open and compliant llms for global language environments,” 2025. [Online]. Available: https://arxiv.org/abs/2509.14233

work page arXiv 2025

[23] [24]

Pointer sentinel mixture models,

S. Merity, C. Xiong, J. Bradbury, and R. Socher, “Pointer sentinel mixture models,” 2016. [Online]. Available: https://arxiv.org/abs/1609. 07843

work page 2016

[24] [25]

Least-loaded expert parallelism: Load balancing an imbalanced mixture-of-experts,

X.-P. Nguyen, S. Pandit, A. Xu, C. Xiong, and S. Joty, “Least-loaded expert parallelism: Load balancing an imbalanced mixture-of-experts,”

work page

[25] [26]

Available: https://arxiv.org/abs/2601.17111

[Online]. Available: https://arxiv.org/abs/2601.17111

work page arXiv

[26] [27]

MoETuner: Optimized mixture of expert serving with balanced expert placement and token routing,

S. Go and D. Mahajan, “MoETuner: Optimized mixture of expert serving with balanced expert placement and token routing,” 2025. [Online]. Available: https://arxiv.org/abs/2502.06643

work page arXiv 2025

[27] [28]

On the spatial structure of mixture-of-experts in transformers,

D. Bershatsky and I. Oseledets, “On the spatial structure of mixture-of-experts in transformers,” 2025. [Online]. Available: https: //arxiv.org/abs/2504.04444

work page arXiv 2025

[28] [29]

Deepseek-v4: Towards highly efficient million- token context intelligence,

DeepSeek-AI, “Deepseek-v4: Towards highly efficient million- token context intelligence,” DeepSeek-AI, Tech. Rep., 2026, technical report accompanying DeepSeek-V4-Pro model. [Online]. Available: https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/ main/DeepSeek V4.pdf

work page 2026

[29] [30]

Geometric Routing Enables Causal Expert Control in Mixture of Experts

I. Ternovtsii and Y . Bilak, “Geometric routing enables causal expert control in mixture of experts,” 2026. [Online]. Available: https://arxiv.org/abs/2604.14434

work page internal anchor Pith review Pith/arXiv arXiv 2026

[30] [31]

Hash layers for large sparse models,

S. Roller, S. Sukhbaatar, A. Szlam, and J. Weston, “Hash layers for large sparse models,” inProceedings of the 35th International Conference on Neural Information Processing Systems, ser. NIPS ’21. Red Hook, NY , USA: Curran Associates Inc., 2021

work page 2021

[31] [32]

Patterns behind Chaos: Forecasting Data Movement for Efficient Large-Scale MoE LLM Inference

Z. Yu, Y . Guan, Z. Yu, C. Zhou, Z. Hu, S. Pei, Y . Kang, Y . Ding, and P.-A. Tsai, “Patterns behind chaos: Forecasting data movement for efficient large-scale moe llm inference,” 2026. [Online]. Available: https://arxiv.org/abs/2510.05497

work page internal anchor Pith review Pith/arXiv arXiv 2026

[32] [33]

Advancing moe efficiency: A collaboration-constrained routing (c2r) strategy for better expert parallelism design,

M. Zhang, P. Li, J. Peng, M. Qiu, and T. Chen, “Advancing moe efficiency: A collaboration-constrained routing (c2r) strategy for better expert parallelism design,” 2025. [Online]. Available: https://arxiv.org/abs/2504.01337

work page arXiv 2025

[33] [34]

Every activation boosted: Scaling general reasoner to 1 trillion open language foundation,

Ling Teamet al., “Every activation boosted: Scaling general reasoner to 1 trillion open language foundation,” Tech. Rep., 2025. [Online]. Available: https://arxiv.org/abs/2510.22115

work page arXiv 2025

[34] [35]

Accelerating distributed moe training and inference with lina,

J. Li, Y . Jiang, Y . Zhu, C. Wang, and H. Xu, “Accelerating distributed moe training and inference with lina,” 2024. [Online]. Available: https://arxiv.org/abs/2210.17223

work page arXiv 2024

[35] [36]

Janus: A unified distributed training framework for sparse mixture-of-experts models,

J. Liu, J. H. Wang, and Y . Jiang, “Janus: A unified distributed training framework for sparse mixture-of-experts models,” inProceedings of the ACM SIGCOMM 2023 Conference, ser. ACM SIGCOMM ’23. New York, NY , USA: Association for Computing Machinery, 2023, p. 486–498. [Online]. Available: https://doi.org/10.1145/3603269.3604869

work page doi:10.1145/3603269.3604869 2023

[36] [37]

Towards a standardized representation for deep learning collective algorithms,

J. Yoo, W. Won, M. Cowan, N. Jiang, B. Klenk, S. Sridharan, and T. Krishna, “Towards a standardized representation for deep learning collective algorithms,” in2024 IEEE Symposium on High-Performance Interconnects (HOTI). IEEE, Aug. 2024, p. 33–36. [Online]. Available: http://dx.doi.org/10.1109/HOTI63208.2024.00017

work page doi:10.1109/hoti63208.2024.00017 2024

[37] [38]

Characterizing communication in distributed parameter-efficient fine- tuning for large language models,

N. Alnaasan, H.-R. Huang, A. Shafi, H. Subramoni, and D. K. Panda, “Characterizing communication in distributed parameter-efficient fine- tuning for large language models,” in2024 IEEE Symposium on High- Performance Interconnects (HOTI), 2024, pp. 11–19

work page 2024

[38] [39]

Multilingual routing in mixture-of-experts,

L. Bandarkar, C. Yang, M. Fayyaz, J. Hu, and N. Peng, “Multilingual routing in mixture-of-experts,” 2026. [Online]. Available: https: //arxiv.org/abs/2510.04694

work page arXiv 2026

[39] [40]

ST-MoE: Designing Stable and Transferable Sparse Expert Models

B. Zoph, I. Bello, S. Kumar, N. Du, Y . Huang, J. Dean, N. Shazeer, and W. Fedus, “ST-MoE: Designing stable and transferable sparse expert models,” 2022. [Online]. Available: https://arxiv.org/abs/2202.08906

work page internal anchor Pith review Pith/arXiv arXiv 2022

[40] [41]

Mixture-of-experts with expert choice routing,

Y . Zhou, T. Lei, H. Liu, N. Du, Y . Huang, V . Y . Zhao, A. Dai, Z. Chen, Q. Le, and J. Laudon, “Mixture-of-experts with expert choice routing,” in Proceedings of the 36th International Conference on Neural Information Processing Systems, ser. NIPS ’22. Red Hook, NY , USA: Curran Associates Inc., 2022

work page 2022

[41] [42]

EPLB: Expert parallelism load balancer,

DeepSeek-AI, “EPLB: Expert parallelism load balancer,” Tech. Rep., 2025, GitHub repository. [Online]. Available: https://github.com/ deepseek-ai/eplb

work page 2025