D-INL reduces training exchange by 70.4% while keeping accuracy within standard deviation of dense INL, with finite-rate regularization cutting estimated latent rate by 45.7% in a distributed classification experiment.
Expert Routing for Communication-Efficient MoE via Finite Expert Banks
1 Pith paper cite this work. Polarity classification is still indexing.
abstract
Resource-efficient machine learning increasingly uses sparse Mixture-of-Experts (MoE) architectures, where the gate acts as both a learning component and a routing interface controlling computation, communication, and accuracy. Motivated by finite-rate interpretations of MoE gating, we treat the gate as a stochastic channel and use $I(X;T)$ to quantify the routing information available to the selected expert. To make the associated information quantities tractable beyond synthetic examples, we develop a finite-bank MNIST construction using pretrained CNN experts and a discrete, data-dependent selection rule. Since the selected model belongs to a finite candidate set, the algorithmic mutual information $I(S;W)$ admits a closed-form discrete-entropy estimator from the empirical posterior $q(W|S)$. Sweeping a data-dependence parameter $\alpha$, we observe that $\widehat I(S;W)$ monotonically tracks the generalization gap, while the Xu-Raginsky bound exhibits the expected looseness. We also compare with a uniform union-bound baseline and introduce an empirical estimator of $I(X;T)$ together with a Blahut-Arimoto procedure for tracing an accuracy-rate curve over the expert bank. The proposed framework provides a practical tool for analyzing resource-aware MoE inference systems and for interpreting $I(X;T)$ and $D(R_g)$ as design proxies for efficient expert routing.
fields
cs.IT 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Sparse In-Network Learning via Shortest-Path Backpropagation and Finite-Rate Gating
D-INL reduces training exchange by 70.4% while keeping accuracy within standard deviation of dense INL, with finite-rate regularization cutting estimated latent rate by 45.7% in a distributed classification experiment.