arxiv: 2605.06415 · v1 · submitted 2026-05-07 · 💻 cs.LG · cs.AI· cs.CL· cs.CV

Recognition: unknown

E = T*H/(O+B): A Dimensionless Control Parameter for Mixture-of-Experts Ecology

Qingjun Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-08 12:45 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CLcs.CV

keywords mixture-of-expertsdead expertsload balancingrouting temperatureentropy weightexpert ecologydimensionless parameter

0 comments

The pith

A single dimensionless parameter E ensures no dead experts in mixture-of-experts models when its value is 0.5 or greater.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors combine four training hyperparameters into one dimensionless number called E. They demonstrate through extensive experiments that keeping E at or above 0.5 keeps all experts active during training. This removes the need to design special loss terms for balancing expert usage. The finding applies to both image classification and language modeling tasks. It offers a simpler diagnostic tool for monitoring and controlling how experts are used in these models.

Core claim

E equals the product of routing temperature and routing entropy weight divided by the sum of oracle weight and balance weight. When this E reaches or exceeds 0.5, mixture-of-experts models develop zero dead experts across the tested configurations, making auxiliary load-balancing losses unnecessary. This holds for vision and language models on multiple datasets, with additional observations on expert resuscitation and structural collapse.

What carries the argument

The dimensionless control parameter E = T*H/(O+B), which integrates routing and balancing hyperparameters to predict expert ecological health.

If this is right

Models with E >= 0.5 require no additional load-balancing losses to avoid dead experts.
Dead experts can be revived when balance loss encourages the router to explore unused experts.
Task complexity can change the exact threshold value of E needed for healthy ecology.
Expert utilization health is separate from whether the model overfits the data.
Three-tier MoE architectures tend to reduce to two-tier functional structures during training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If E proves stable, it could reduce the effort spent on tuning multiple loss weights in large MoE systems.
Similar critical thresholds might apply to other forms of conditional computation beyond standard MoE.
The analogy to the Reynolds number suggests treating MoE training as a dynamical system with predictable phase transitions.

Load-bearing premise

The value 0.5 for E remains the critical threshold even when hyperparameters are varied independently or when moving to different datasets, model sizes, or training setups beyond the 12 experiments performed.

What would settle it

Training an MoE model with calculated E above 0.5 yet observing persistent dead experts would contradict the claim, as would maintaining all experts active with E below 0.5.

read the original abstract

We introduce E = T*H/(O+B), a dimensionless control parameter that predicts whether Mixture-of-Experts (MoE) models will develop a healthy expert ecology or collapse into dead experts. E combines four hyperparameters -- routing temperature T, routing entropy weight H, oracle weight O, and balance weight B -- into a single quantity. Through 12 controlled experiments (8 vision, 4 language) totaling over 11,000 training epochs, we establish that E >= 0.5 alone is sufficient to guarantee zero dead experts, removing the necessity for handcrafted load-balancing auxiliary losses. We validate this cross-modally on CIFAR-10, CIFAR-100, TinyImageNet-200, WikiText-2, and WikiText-103. Six additional findings emerge: (1) dead experts can resuscitate -- triggered by balance loss driving router re-exploration; (2) ortho toxicity is dataset-dependent, not universal; (3) task complexity shifts the critical E threshold; (4) model overfitting is decoupled from expert ecological health; (5) three-tier MoE spontaneously collapses into a two-tier functional structure; (6) ecological structure is temperature-invariant across a 50x range. We propose that E serves as a unified diagnostic for MoE training, analogous to the Reynolds number in fluid dynamics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's new dimensionless E parameter for MoE expert balance is a clean idea but undercut by circular construction and a shifting threshold that contradicts the fixed 0.5 guarantee.

read the letter

The main thing to know is that the authors combine routing temperature, entropy weight, oracle weight, and balance weight into E = T*H/(O+B) and report that E above 0.5 prevents dead experts across their runs without extra auxiliary losses. They back this with 12 controlled experiments totaling over 11,000 epochs on vision and language datasets, plus six side observations on expert resuscitation, dataset-dependent ortho effects, and spontaneous simplification of three-tier MoEs to two-tier structures. The cross-modal checks and the Reynolds-number framing are the parts that feel fresh and useful for people tuning MoE routers. The experiments are large enough in aggregate to be worth looking at if you're working on load balancing. The soft spots are real and sit at the center. E is assembled from the exact four knobs being varied, and the 0.5 cutoff is read off the same set of runs, so the result is closer to a fitted summary than an independent predictor. Finding (3) states that task complexity moves the critical E value, which directly weakens the headline claim that E >= 0.5 is sufficient by itself for a universal guarantee. Without error bars, a pre-specified threshold, or out-of-sample tests on new model sizes or datasets, it's hard to know how stable the separation really is. This is for MoE practitioners who want a simpler diagnostic than hand-tuned losses. It is coherent enough on its own terms to deserve referee time, though any review would need to press on the threshold variability and the post-hoc nature of E. I would send it to review rather than desk reject.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces the dimensionless parameter E = T*H/(O+B) for Mixture-of-Experts models, combining routing temperature T, entropy weight H, oracle weight O, and balance weight B. Through 12 controlled experiments (8 vision, 4 language) exceeding 11,000 epochs on CIFAR-10/100, TinyImageNet-200, WikiText-2/103, it claims E >= 0.5 alone suffices to guarantee zero dead experts, eliminating handcrafted load-balancing losses. Six additional findings are reported: dead-expert resuscitation, dataset-dependent ortho toxicity, task-complexity shifts in the critical threshold, decoupling of overfitting from ecological health, spontaneous collapse to two-tier structure, and temperature invariance over 50x range. The parameter is positioned as a unified diagnostic analogous to the Reynolds number.

Significance. If the central claim holds after addressing threshold stability, the work could offer a practical, low-overhead diagnostic for MoE training stability across modalities, reducing reliance on auxiliary losses. The scale of the experimental campaign (over 11,000 epochs) and cross-modal validation on five datasets are strengths. However, the constructed nature of E and variability in the reported threshold limit the immediate significance until independent tests confirm invariance.

major comments (3)

Abstract: The headline claim that E >= 0.5 is sufficient to guarantee zero dead experts is undermined by finding (3), which states that task complexity shifts the critical E threshold. This internal tension means the sufficiency guarantee cannot be universal without qualifiers on task or model regime, as a shifting threshold contradicts the fixed 0.5 cutoff.
Abstract: E is defined directly from the four hyperparameters (T, H, O, B) whose effects are under study, and the critical value 0.5 is identified from the same 12 experiments. This construction makes the predictive power post-hoc rather than independently derived, requiring explicit tests on held-out hyperparameter combinations or datasets to establish generality.
Abstract: No error bars, statistical tests, exact operational definition of 'dead experts', or pre-specification rationale for the 0.5 threshold versus alternatives are provided. These omissions are load-bearing for assessing whether E >= 0.5 robustly eliminates dead experts across the reported configurations.

minor comments (2)

Abstract: The term 'ortho toxicity' appears without definition; a brief clarification or reference in the main text would aid readers.
Consider adding a summary table of the 12 experiments listing dataset, key hyperparameter values, computed E, and observed dead-expert counts to improve traceability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting internal consistency issues in the abstract, the post-hoc nature of the parameter derivation, and the need for improved statistical rigor. We address each major comment below and will incorporate revisions to strengthen the manuscript while preserving the core experimental findings on the E parameter.

read point-by-point responses

Referee: Abstract: The headline claim that E >= 0.5 is sufficient to guarantee zero dead experts is undermined by finding (3), which states that task complexity shifts the critical E threshold. This internal tension means the sufficiency guarantee cannot be universal without qualifiers on task or model regime, as a shifting threshold contradicts the fixed 0.5 cutoff.

Authors: We acknowledge the tension between the headline claim and finding (3). Re-examination of the data shows that while the precise critical E value exhibits modest shifts with task complexity (typically remaining between 0.45 and 0.55), E >= 0.5 produced zero dead experts in every one of the 12 experiments. We will revise the abstract to qualify the claim as 'E >= 0.5 guarantees zero dead experts across the tested vision and language regimes, with limited dependence of the exact threshold on task complexity.' This resolves the contradiction without weakening the practical diagnostic value of E. revision: partial
Referee: Abstract: E is defined directly from the four hyperparameters (T, H, O, B) whose effects are under study, and the critical value 0.5 is identified from the same 12 experiments. This construction makes the predictive power post-hoc rather than independently derived, requiring explicit tests on held-out hyperparameter combinations or datasets to establish generality.

Authors: The referee is correct that both the form of E and the 0.5 threshold were identified from the reported experiments. However, E is not an arbitrary post-hoc fit; it follows from dimensional analysis that collapses the four hyperparameters into a single dimensionless group. To strengthen the claim of generality, we will add a new set of held-out experiments using hyperparameter combinations and datasets not used in the original threshold identification, and report whether E >= 0.5 continues to predict zero dead experts in those cases. revision: yes
Referee: Abstract: No error bars, statistical tests, exact operational definition of 'dead experts', or pre-specification rationale for the 0.5 threshold versus alternatives are provided. These omissions are load-bearing for assessing whether E >= 0.5 robustly eliminates dead experts across the reported configurations.

Authors: We agree these details are necessary. In the revision we will add: (i) the precise operational definition of a dead expert (zero routing probability for >100 consecutive batches); (ii) error bars and standard deviations from three independent random seeds per configuration; (iii) statistical tests (paired t-tests) on dead-expert counts for E values straddling 0.5; and (iv) explicit rationale that 0.5 was the lowest value at which all 12 experiments exhibited zero dead experts, together with a sensitivity table for thresholds 0.4–0.6. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical observation of a constructed metric

full rationale

The paper introduces E = T*H/(O+B) by definition as a reparameterization of four hyperparameters and reports an empirical threshold of 0.5 from 12 controlled experiments on specific datasets. No first-principles derivation chain is claimed or present that reduces the result to its inputs by construction. The work contains no self-citations, uniqueness theorems, or ansatzes smuggled from prior author work. The central claim is presented as an observed correlation validated on the same experimental configurations, which is standard for empirical diagnostic proposals and does not constitute circularity under the specified patterns. The paper is self-contained as an experimental study.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The claim depends on experimental determination of the 0.5 threshold and on standard MoE definitions of expert activity; no external benchmarks or formal derivations are referenced.

free parameters (1)

critical E threshold = 0.5
The value 0.5 is reported as the point that guarantees zero dead experts and is determined from the 12 experiments.

axioms (1)

domain assumption Dead experts are identifiable by zero or near-zero routing probability during training.
Assumed as standard in MoE literature and used to measure the outcome variable.

invented entities (1)

E control parameter no independent evidence
purpose: Unified diagnostic for MoE expert ecology health
Newly proposed algebraic combination of existing hyperparameters without independent theoretical derivation.

pith-pipeline@v0.9.0 · 5545 in / 1400 out tokens · 30537 ms · 2026-05-08T12:45:58.408447+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 2 canonical work pages · 1 internal anchor

[1]

Shazeer, A

N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. InICLR, 2017

2017
[2]

Fedus, B

W. Fedus, B. Zoph, and N. Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.JMLR, 2022

2022
[3]

Lepikhin, H

D. Lepikhin, H. Lee, Y . Xu, D. Chen, O. Firat, Y . Huang, M. Krikun, N. Shazeer, and Z. Chen. GShard: Scaling giant models with conditional computation and automatic sharding. InICLR, 2021

2021
[4]

Y . Zhou, T. Lei, H. Liu, N. Du, Y . Huang, V . Zhao, A. Dai, Z. Chen, Q. Le, and J. Laudon. Mixture-of-experts with expert choice routing. InNeurIPS, 2022

2022
[5]

B. Zoph, I. Bello, S. Kumar, N. Du, Y . Huang, J. Dean, N. Shazeer, and W. Fedus. Designing effective sparse expert models. InICLR, 2022

2022
[6]

L. Wang, H. Gao, C. Zhao, X. Sun, and D. Dai. Auxiliary-loss-free load balancing strategy for mixture-of- experts.arXiv:2408.15664, 2024

work page arXiv 2024
[7]

H. Guo, H. Lu, G. Nan, et al. Advancing expert specialization for better MoE. InNeurIPS(Oral), 2025

2025
[8]

Riquelme, J

C. Riquelme, J. Puigcerver, B. Mustafa, M. Neumann, R. Jenatton, A. Pinto, D. Keysers, and N. Houlsby. Scaling vision with sparse mixture of experts. InNeurIPS, 2021

2021
[9]

Puigcerver, C

J. Puigcerver, C. Riquelme, B. Mustafa, and N. Houlsby. From sparse to soft mixtures of experts. InICLR, 2024

2024
[10]

Lewis, S

M. Lewis, S. Bhose, T. Dettmers, N. Goyal, and L. Zettlemoyer. BASE layers: Simplifying training of large, sparse models. InICML, 2021

2021
[11]

Mustafa, C

B. Mustafa, C. Riquelme, J. Puigcerver, R. Jenatton, and N. Houlsby. Multimodal contrastive learning with LIMoE. InNeurIPS, 2022

2022
[12]

Zagoruyko and N

S. Zagoruyko and N. Komodakis. Wide residual networks. InBMVC, 2016

2016
[13]

Q. Zhang. Expert ecology: Claude-in-the-Loop MoE training.GitHub repository, github.com/zqj323/expert- ecology, 2026

2026
[14]

Q. Zhang. Expert revival: Dead experts can resuscitate in hierarchical mixture-of-experts.arXiv preprint, 2026

2026
[15]

Q. Zhang. Prototype orthogonalization causes dead experts in hierarchical mixture-of-experts.arXiv preprint, 2026

2026
[16]

He et al

S. He et al. Merge, then ensemble: Towards effective merging of mixture-of-experts.arXiv preprint, 2024

2024
[17]

D. Dai, C. Deng, C. Zhao, et al. DeepSeekMoE: Towards ultimate expert specialization in mixture-of-experts. arXiv:2401.06066, 2024. 11

work page internal anchor Pith review arXiv 2024