Sparsity is Combinatorial Depth: Quantifying MoE Expressivity via Tropical Geometry

Huayi Tang; Ye Su; Yong Liu; Zixuan Gong

arxiv: 2602.03204 · v2 · submitted 2026-02-03 · 💻 cs.LG

Sparsity is Combinatorial Depth: Quantifying MoE Expressivity via Tropical Geometry

Ye Su , Huayi Tang , Zixuan Gong , Yong Liu This is my paper

Pith reviewed 2026-05-16 07:44 UTC · model grok-4.3

classification 💻 cs.LG

keywords Mixture-of-ExpertsTropical GeometryTop-k RoutingCombinatorial DepthHypersimplexManifold HypothesisSparse NetworksEffective Capacity

0 comments

The pith

Top-k routing in Mixture-of-Experts is algebraically identical to the k-th elementary symmetric tropical polynomial, turning sparsity into combinatorial depth that multiplies geometric capacity by the binomial coefficient binom(N,k).

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper applies tropical geometry to Mixture-of-Experts networks and shows that the Top-k routing function is exactly the same as the k-th elementary symmetric tropical polynomial. This equivalence carves the input space into the normal fan of a hypersimplex, so the number of active experts directly sets a combinatorial multiplier on the model's capacity. Under the manifold hypothesis the authors prove that MoE keeps high expressivity on low-dimensional data where dense networks lose capacity, a property they term combinatorial resilience that rests on the transversality of routing cones. They further derive asymptotic limits on how fine expert granularity can be and show that shared experts are required to stop routing collapse.

Core claim

The Top-k routing mechanism is algebraically isomorphic to the k-th elementary symmetric tropical polynomial. This isomorphism partitions the input space into the Normal Fan of a Hypersimplex, revealing that sparsity is combinatorial depth which scales geometric capacity by the binomial coefficient binom(N,k). Moving beyond ambient bounds, the authors introduce Effective Capacity under the Manifold Hypothesis and prove that MoE architectures exhibit Combinatorial Resilience on low-dimensional data while dense networks suffer capacity collapse; they also show that shared experts are geometrically necessary to prevent routing collapse.

What carries the argument

The algebraic isomorphism between Top-k routing and the k-th elementary symmetric tropical polynomial, which partitions the input space according to the normal fan of a hypersimplex.

If this is right

MoE networks achieve strictly higher effective capacity than dense networks of equal size when data lie on low-dimensional manifolds.
There exist asymptotic upper bounds on useful expert granularity that follow directly from the binomial scaling of combinatorial depth.
Shared experts must be included in the architecture to avoid geometric routing collapse.
Combinatorial resilience allows sparse MoE models to retain expressivity that dense models lose on manifold data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Routing policies could be designed by choosing different symmetric tropical polynomials to target specific capacity multipliers.
The same tropical isomorphism may apply to other forms of conditional computation beyond Top-k gating.
If real high-dimensional datasets violate the transversality condition, the predicted resilience could degrade sharply and would need empirical checks on actual training manifolds.

Load-bearing premise

The training data lies on a low-dimensional manifold and the routing cones remain transverse so that the combinatorial capacity scaling is actually realized.

What would settle it

Train MoE and dense networks of matched parameter count on points sampled from an explicitly known low-dimensional manifold, then measure whether the observed effective capacity ratio between the two architectures matches the predicted binomial factor binom(N,k).

Figures

Figures reproduced from arXiv: 2602.03204 by Huayi Tang, Ye Su, Yong Liu, Zixuan Gong.

**Figure 1.** Figure 1: Recursive spatial partitioning in a 2-layer MLP (d = 2). (Left) Layer 1 (n = 3 neurons) induces Nmlp1 = 7 linear regions via Definition 3.2. (Middle) Structural space-folding: the ReLU activation enables subsequent hyperplanes to intersect the pre-activated regions. (Right) Layer 2 (n = 3) achieves a multiplicative expansion, yielding a theoretical maximum of Nmlp2 = 49 regions. Definition 3.5 (Tropical Po… view at source ↗

**Figure 2.** Figure 2: Schematic of a Top-k Mixture-of-Experts Layer. The architecture separates routing logic from data processing. The router selects a sparse subset of experts (Active Set Ix), ensuring that inactive experts (grey, dashed) consume no computational resources. The final output is the weighted sum of the active experts. as the sparse weighted sum of the expert outputs [Su and Liu, 2026]: y = X N i=1 G(x)iEi(x), w… view at source ↗

**Figure 3.** Figure 3: Geometric interpretation of the Top-1 MoE router. (Left) The router partitions the input space R 2 into a Voronoi diagram, where each convex polyhedral cell Ωi corresponds to the active domain of Expert i. (Right) The maximum logit score function Stop1(x) = maxk zk(x) constitutes a piecewise-linear tropical hypersurface. The singular locus of this hypersurface (its non-differentiable creases) orthogonally … view at source ↗

**Figure 4.** Figure 4: Geometric Interpretation of Top-k Routing. Visualized for an MoE layer with N = 3 experts and k = 2 active selections. (a) The activation region for a specific expert coalition (e.g., Ω{1,2}) is a convex polyhedral cone, defined by the intersection of half-spaces where z1 > z3 and z2 > z3. (b) The total support region for a single expert (e.g., Expert 1) spans multiple adjacent cones (Ω{1,2} ∪ Ω{1,3}). Thi… view at source ↗

**Figure 5.** Figure 5: Geometric impact of General Position on Capacity. (a) In general position (din = 2, n = 3), hyperplanes form a closed cell (red shadow, Ω1), maximizing the number of regions (Φ(3, 2) = 7). (b & c) Degenerate arrangements fail to enclose this maximal cell structure, reducing the combinatorial capacity. In MoE routers, these degeneracies cause a direct collapse of the Tropical Hypersurface. Example 4.6 (Top-… view at source ↗

**Figure 6.** Figure 6: Geometric Regularization via Affine Carrier Isolation. (a) Angular Collapse: When the input distribution is uncentered (∥µ∥ ≫ Var(x)), the manifold M subtends a vanishingly small solid angle. This leads to Tropical Saturation, a failure mode where the manifold is contained within the interior of a single Weyl chamber, rendering routing decisions locally constant. (b) Affine-Residual Reparameterization: The… view at source ↗

**Figure 7.** Figure 7: Geometric Capacity via Duality and Activation Patterns. (Left) The Primal input space R d is partitioned into convex polyhedral cells by a hyperplane arrangement A = {H1, H2, H3}, where each hyperplane corresponds to the decision boundary of a ReLU neuron. A specific linear region Rv is defined by its activation pattern s ∈ {0, 1} 3 , representing the binary state of all neurons. For the shaded region Rv, … view at source ↗

read the original abstract

While Mixture-of-Experts (MoE) architectures define the state-of-the-art, their theoretical success is often attributed to heuristic efficiency rather than geometric expressivity. In this work, we present the first analysis of MoE through the lens of tropical geometry, establishing that the Top-$k$ routing mechanism is algebraically isomorphic to the $k$-th elementary symmetric tropical polynomial. This isomorphism partitions the input space into the Normal Fan of a Hypersimplex, revealing that \textbf{sparsity is combinatorial depth} which scales geometric capacity by the binomial coefficient $\binom{N}{k}$. Moving beyond ambient bounds, we introduce the concept of \textit{Effective Capacity} under the Manifold Hypothesis. We prove that while dense networks suffer from capacity collapse on low-dimensional data, MoE architectures exhibit \textit{Combinatorial Resilience}, maintaining high expressivity via the transversality of routing cones. Translating these theoretical bounds into architectural principles, we derive asymptotic capacity limits for optimal expert granularity and prove that shared experts are geometrically necessary to prevent routing collapse.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper maps Top-k MoE routing to the k-th tropical elementary symmetric polynomial and claims this gives binomial capacity scaling, but the manifold resilience part rests on unverified transversality.

read the letter

The central claim is that Top-k routing is algebraically the same as the k-th elementary symmetric tropical polynomial, so the input space gets partitioned by the normal fan of a hypersimplex and the number of distinct linear regions scales with binom(N,k) rather than just N. That combinatorial interpretation of sparsity is the clearest new piece. It also introduces Effective Capacity and Combinatorial Resilience to argue that MoE keeps expressivity on low-dimensional manifolds where dense layers lose capacity. The geometric framing is clean and could explain why sparse routing has been empirically strong. The derivations for the isomorphism and the asymptotic limits on expert granularity look like they could be useful if they hold up. The main weakness is that Combinatorial Resilience is asserted under the Manifold Hypothesis plus transversality of the routing cones, yet the text gives no argument or check that real data manifolds (token embeddings, image patches) actually meet the general-position conditions needed to avoid cone overlaps or degeneracies. Without that, the binomial scaling may not survive in practice. There are also no visible error bounds, toy examples, or empirical checks in the provided material, so the architectural conclusions about shared experts and optimal granularity remain provisional. This is for readers who already work on geometric or algebraic views of neural nets and want a new angle on MoE design. It is worth sending to peer review because the tropical mapping itself is a concrete, checkable claim even if the manifold part needs tightening.

Referee Report

3 major / 2 minor

Summary. The manuscript analyzes Mixture-of-Experts (MoE) models via tropical geometry, claiming that Top-k routing is algebraically isomorphic to the k-th elementary symmetric tropical polynomial. This isomorphism partitions the input space according to the normal fan of a hypersimplex, so that sparsity functions as combinatorial depth and scales geometric capacity by the binomial coefficient binom(N,k). The paper introduces Effective Capacity under the Manifold Hypothesis, proves that MoE exhibits Combinatorial Resilience (maintaining expressivity via transverse routing cones) while dense networks suffer capacity collapse on low-dimensional data, and derives asymptotic limits on expert granularity together with the geometric necessity of shared experts.

Significance. If the central isomorphism and transversality arguments hold, the work supplies a geometric explanation for the empirical advantages of sparse MoE architectures and yields concrete architectural prescriptions. The tropical framing is novel for this setting and the distinction between ambient and effective capacity directly addresses a limitation of prior expressivity bounds.

major comments (3)

[Manifold Hypothesis and Combinatorial Resilience section] The binom(N,k) scaling of distinct linear regions is asserted to follow from the hypersimplex normal fan, yet this count requires the routing cones to remain transverse when restricted to the data manifold. The manuscript invokes the Manifold Hypothesis to claim Combinatorial Resilience but supplies no argument or empirical check that typical high-dimensional training manifolds (e.g., image patches or token embeddings) satisfy the general-position conditions needed to avoid degeneracies that would reduce the number of regions below the combinatorial maximum.
[Tropical isomorphism derivation] The algebraic isomorphism between Top-k routing and the k-th elementary symmetric tropical polynomial is load-bearing for the entire capacity claim. The derivation must be shown to preserve the fan structure under the specific piecewise-linear routing function used in practice; any deviation in the tropical semiring operations or in the handling of ties would invalidate the hypersimplex identification.
[Architectural principles and shared-expert necessity] The necessity of shared experts is derived from a routing-collapse argument. The proof should quantify the measure of the degenerate set on which collapse occurs and demonstrate that this set has positive measure under standard initialization and training dynamics; otherwise the architectural recommendation rests on a measure-zero phenomenon.

minor comments (2)

[Preliminaries] Notation for the tropical polynomial and the hypersimplex fan should be introduced with explicit reference to standard tropical geometry texts to aid readers unfamiliar with the semiring.
[Effective Capacity definition] The abstract states that dense networks suffer capacity collapse on low-dimensional data; a brief comparison table contrasting the effective-capacity formulas for dense versus MoE networks would clarify the quantitative difference.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the insightful and constructive feedback, which highlights important points for strengthening the rigor of our tropical analysis of MoE expressivity. We address each major comment below, indicating the revisions planned for the next version of the manuscript.

read point-by-point responses

Referee: [Manifold Hypothesis and Combinatorial Resilience section] The binom(N,k) scaling of distinct linear regions is asserted to follow from the hypersimplex normal fan, yet this count requires the routing cones to remain transverse when restricted to the data manifold. The manuscript invokes the Manifold Hypothesis to claim Combinatorial Resilience but supplies no argument or empirical check that typical high-dimensional training manifolds (e.g., image patches or token embeddings) satisfy the general-position conditions needed to avoid degeneracies that would reduce the number of regions below the combinatorial maximum.

Authors: We agree that an explicit justification for transversality on the data manifold is required to support the full combinatorial scaling. In the revised manuscript we will insert a new lemma (Lemma 4.2) proving that, for a d-dimensional manifold embedded in general position within R^n, the intersection with the hypersimplex normal fan realizes the complete binom(N,k) regions almost everywhere; the degeneracy locus is a lower-dimensional algebraic variety of measure zero. We will also add a short empirical illustration on synthetic low-dimensional manifolds to confirm that the observed number of regions matches the combinatorial prediction. revision: yes
Referee: [Tropical isomorphism derivation] The algebraic isomorphism between Top-k routing and the k-th elementary symmetric tropical polynomial is load-bearing for the entire capacity claim. The derivation must be shown to preserve the fan structure under the specific piecewise-linear routing function used in practice; any deviation in the tropical semiring operations or in the handling of ties would invalidate the hypersimplex identification.

Authors: The isomorphism in Theorem 3.1 is obtained by direct substitution of the top-k selection into the tropical elementary symmetric polynomial, where the max-plus operations correspond exactly to the routing logits. We will expand the proof to include an explicit verification that the piecewise-linear routing map (argmax over expert scores) induces the same normal fan as the hypersimplex; ties form a measure-zero set and therefore do not alter the fan partition. A supplementary paragraph clarifying the semiring operations under the practical routing function will be added. revision: yes
Referee: [Architectural principles and shared-expert necessity] The necessity of shared experts is derived from a routing-collapse argument. The proof should quantify the measure of the degenerate set on which collapse occurs and demonstrate that this set has positive measure under standard initialization and training dynamics; otherwise the architectural recommendation rests on a measure-zero phenomenon.

Authors: We acknowledge that the current argument identifies collapse on a lower-codimension set but does not supply an explicit measure. In the revision we will add a proposition bounding the Lebesgue measure of the degenerate routing set under standard Gaussian initialization, showing that this set has positive (though exponentially small) measure. We will also discuss how the presence of shared experts perturbs the dynamics away from this set during training, thereby converting the geometric necessity into a practical architectural guideline. revision: partial

Circularity Check

0 steps flagged

No significant circularity in tropical isomorphism or capacity claims

full rationale

The paper's core chain begins with an algebraic isomorphism between Top-k routing and the k-th elementary symmetric tropical polynomial, which partitions space according to the hypersimplex normal fan and yields the binomial(N,k) count of linear regions as a direct combinatorial consequence. This is a standard result in tropical geometry and does not reduce to any fitted parameter or self-referential definition. Effective Capacity is introduced explicitly under the Manifold Hypothesis as an assumption, and Combinatorial Resilience is derived from the geometric transversality of the resulting cones rather than being presupposed by the result itself. No self-citation chains, ansatz smuggling, or renaming of known empirical patterns occur; the derivation remains independent of the target claims and is grounded in external tropical geometry.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claims rest on the Manifold Hypothesis plus an unstated assumption that routing cones remain transverse on real data; no free parameters or invented entities are quantified in the abstract.

axioms (1)

domain assumption Manifold Hypothesis: real data lies on a low-dimensional manifold embedded in high-dimensional space
Invoked to contrast dense-network capacity collapse with MoE Combinatorial Resilience

invented entities (2)

Effective Capacity no independent evidence
purpose: Capacity measure that accounts for manifold dimension rather than ambient dimension
Introduced to replace ambient bounds; no independent evidence supplied in abstract
Combinatorial Resilience no independent evidence
purpose: Property that MoE maintains expressivity via transversality of routing cones
New term defined from the tropical partition; no falsifiable handle given

pith-pipeline@v0.9.0 · 5488 in / 1281 out tokens · 21429 ms · 2026-05-16T07:44:26.862791+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Expressivity of Transformers: A Tropical Geometry Perspective
cs.LG 2026-04 unverdicted novelty 6.0

Self-attention in transformers corresponds exactly to Power Voronoi diagrams under tropical geometry, yielding tight bounds of Theta(N to the power of d_model times L) linear regions.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · cited by 1 Pith paper · 8 internal anchors

[1]

Quadratic gating functions in mixture of experts: A statistical insight.arXiv preprint arXiv:2410.11222,

Pedram Akbarian, Huy Nguyen, Xing Han, and Nhat Ho. Quadratic gating functions in mixture of experts: A statistical insight.arXiv preprint arXiv:2410.11222,

work page arXiv
[2]

The real tropical geometry of neural networks.arXiv preprint arXiv:2403.11871,

Marie-Charlotte Brandenburg, Georg Loho, and Guido Montúfar. The real tropical geometry of neural networks.arXiv preprint arXiv:2403.11871,

work page arXiv
[3]

A Tropical Approach to Neural Networks with Piecewise Linear Activations

Vasileios Charisopoulos and Petros Maragos. A tropical approach to neural networks with piecewise linear activations.arXiv preprint arXiv:1805.08749,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Damai Dai, Chengqi Deng, Chenggang Zhao, RX Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Yu Wu, et al

doi: 10.1109/PGEC.1965.264137. Damai Dai, Chengqi Deng, Chenggang Zhao, RX Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Yu Wu, et al. Deepseekmoe: Towards ultimate expert specialization in mixture- of-experts language models.arXiv preprint arXiv:2401.06066,

work page doi:10.1109/pgec.1965.264137 1965
[5]

Tropnnc: Structured neural network compression using tropical geometry.arXiv preprint arXiv:2409.03945,

Konstantinos Fotopoulos, Petros Maragos, and Panagiotis Misiakos. Tropnnc: Structured neural network compression using tropical geometry.arXiv preprint arXiv:2409.03945,

work page arXiv
[6]

Mixtral of Experts

Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361,

work page internal anchor Pith review Pith/arXiv arXiv 2001
[8]

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with conditional computation and automatic sharding.arXiv preprint arXiv:2006.16668,

work page internal anchor Pith review Pith/arXiv arXiv 2006
[9]

doi:10.48550/arxiv.2109.02008 , arxivId =

Yuxuan Lou, Fuzhao Xue, Zangwei Zheng, and Yang You. Cross-token modeling with conditional computation.arXiv preprint arXiv:2109.02008,

work page arXiv
[10]

A comprehensive survey of mixture-of-experts: Algorithms, theory, and applications,

21 Siyuan Mu and Sen Lin. A comprehensive survey of mixture-of-experts: Algorithms, theory, and applications.arXiv preprint arXiv:2503.07137,

work page arXiv
[11]

A general theory for softmax gating multinomial logistic mixture of experts.arXiv preprint arXiv:2310.14188,

Huy Nguyen, Pedram Akbarian, TrungTin Nguyen, and Nhat Ho. A general theory for softmax gating multinomial logistic mixture of experts.arXiv preprint arXiv:2310.14188,

work page arXiv
[12]

Statistical advantages of perturbing cosine router in mixture of experts.arXiv preprint arXiv:2405.14131, 2024a

Huy Nguyen, Pedram Akbarian, Trang Pham, Trang Nguyen, Shujian Zhang, and Nhat Ho. Statistical advantages of perturbing cosine router in mixture of experts.arXiv preprint arXiv:2405.14131, 2024a. Huy Nguyen, TrungTin Nguyen, Khai Nguyen, and Nhat Ho. Towards convergence rates for parameter estimation in gaussian-gated mixture of experts. InInternational C...

work page arXiv
[13]

On the number of response regions of deep feed forward networks with piece-wise linear activations

Razvan Pascanu, Guido Montufar, and Yoshua Bengio. On the number of response regions of deep feed forward networks with piece-wise linear activations.arXiv preprint arXiv:1312.6098,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

The intrinsic dimension of images and its impact on learning.arXiv preprint arXiv:2104.08894,

Phillip Pope, Chen Zhu, Ahmed Abdelkader, Micah Goldblum, and Tom Goldstein. The intrinsic dimension of images and its impact on learning.arXiv preprint arXiv:2104.08894,

work page arXiv
[15]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Congruences on tropical rational function semifields and tropical curves.arXiv preprint arXiv:2305.04204,

JuAe Song. Congruences on tropical rational function semifields and tropical curves.arXiv preprint arXiv:2305.04204,

work page arXiv
[17]

The Tropical Grassmannian

David Speyer and Bernd Sturmfels. The tropical grassmannian.arXiv preprint math/0304218,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Variational inference, entropy, and orthogonality: A unified theory of mixture- of-experts.arXiv preprint arXiv:2601.03577,

Ye Su and Yong Liu. Variational inference, entropy, and orthogonality: A unified theory of mixture- of-experts.arXiv preprint arXiv:2601.03577,

work page arXiv
[19]

Minimum volumes of tropical rational functions.arXiv preprint arXiv:2409.18348,

Masayuki Sukenaga. Minimum volumes of tropical rational functions.arXiv preprint arXiv:2409.18348,

work page arXiv
[20]

Symmetric and r-Symmetric Tropical Polynomials and Rational Functions

22 Sara Kalisnik Verovsek and Gunnar Carlsson. Symmetric and r-symmetric tropical polynomials and rational functions.arXiv preprint arXiv:1405.2268,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

The inequality Eq

LetI\J={u 1, u2}andJ\I={v 1, v2}. The inequality Eq. (4) becomes: zu1(x) +z u2(x)≥z v1(x) +z v2(x). However, ifx∈ V(I) , then coalition I must also defeat the intermediate coalitionsL1 = (I\ {u1})∪ {v1} and L2 = (I\ {u 2})∪ {v 2}. Note that L1, L2 ∈ N k are valid coalitions that differ from I by onlym= 1element. The constraints imposed byL 1 andL 2 are: S...

work page 2014
[22]

The Top-1 router assigns each input x∈R din to exactly one expert based on the logit scores z1(x),

While the exact counts differ between the affine arrangement and the dual zonotope, this framework demonstrates that the total geometric capacity is governed bythe vertex complexity of the zonotope, sharing the same asymptotic growth rate Θ(H d). The Top-1 router assigns each input x∈R din to exactly one expert based on the logit scores z1(x), . . . , zN(...

work page 1975
[23]

= (kH) din − din(din −1) 2 (kH) din−1 +O (kH) din−2

= (kH) din − din−1X m=0 m ! (kH) din−1 +. . . = (kH) din − din(din −1) 2 (kH) din−1 +O (kH) din−2 . Substituting this back into the capacity bound: Φ(kH, din) = (kH) din din! +O((kH) din−1). Next, we expand the combinatorial multiplier N k forN≫k: N k = N! k!(N−k)! = 1 k! k−1Y m=0 (N−m) = 1 k! N k − k(k−1) 2 N k−1 +O(N k−2) . Combining the asymptotic expa...

work page 1975

[1] [1]

Quadratic gating functions in mixture of experts: A statistical insight.arXiv preprint arXiv:2410.11222,

Pedram Akbarian, Huy Nguyen, Xing Han, and Nhat Ho. Quadratic gating functions in mixture of experts: A statistical insight.arXiv preprint arXiv:2410.11222,

work page arXiv

[2] [2]

The real tropical geometry of neural networks.arXiv preprint arXiv:2403.11871,

Marie-Charlotte Brandenburg, Georg Loho, and Guido Montúfar. The real tropical geometry of neural networks.arXiv preprint arXiv:2403.11871,

work page arXiv

[3] [3]

A Tropical Approach to Neural Networks with Piecewise Linear Activations

Vasileios Charisopoulos and Petros Maragos. A tropical approach to neural networks with piecewise linear activations.arXiv preprint arXiv:1805.08749,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Damai Dai, Chengqi Deng, Chenggang Zhao, RX Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Yu Wu, et al

doi: 10.1109/PGEC.1965.264137. Damai Dai, Chengqi Deng, Chenggang Zhao, RX Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Yu Wu, et al. Deepseekmoe: Towards ultimate expert specialization in mixture- of-experts language models.arXiv preprint arXiv:2401.06066,

work page doi:10.1109/pgec.1965.264137 1965

[5] [5]

Tropnnc: Structured neural network compression using tropical geometry.arXiv preprint arXiv:2409.03945,

Konstantinos Fotopoulos, Petros Maragos, and Panagiotis Misiakos. Tropnnc: Structured neural network compression using tropical geometry.arXiv preprint arXiv:2409.03945,

work page arXiv

[6] [6]

Mixtral of Experts

Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361,

work page internal anchor Pith review Pith/arXiv arXiv 2001

[8] [8]

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with conditional computation and automatic sharding.arXiv preprint arXiv:2006.16668,

work page internal anchor Pith review Pith/arXiv arXiv 2006

[9] [9]

doi:10.48550/arxiv.2109.02008 , arxivId =

Yuxuan Lou, Fuzhao Xue, Zangwei Zheng, and Yang You. Cross-token modeling with conditional computation.arXiv preprint arXiv:2109.02008,

work page arXiv

[10] [10]

A comprehensive survey of mixture-of-experts: Algorithms, theory, and applications,

21 Siyuan Mu and Sen Lin. A comprehensive survey of mixture-of-experts: Algorithms, theory, and applications.arXiv preprint arXiv:2503.07137,

work page arXiv

[11] [11]

A general theory for softmax gating multinomial logistic mixture of experts.arXiv preprint arXiv:2310.14188,

Huy Nguyen, Pedram Akbarian, TrungTin Nguyen, and Nhat Ho. A general theory for softmax gating multinomial logistic mixture of experts.arXiv preprint arXiv:2310.14188,

work page arXiv

[12] [12]

Statistical advantages of perturbing cosine router in mixture of experts.arXiv preprint arXiv:2405.14131, 2024a

Huy Nguyen, Pedram Akbarian, Trang Pham, Trang Nguyen, Shujian Zhang, and Nhat Ho. Statistical advantages of perturbing cosine router in mixture of experts.arXiv preprint arXiv:2405.14131, 2024a. Huy Nguyen, TrungTin Nguyen, Khai Nguyen, and Nhat Ho. Towards convergence rates for parameter estimation in gaussian-gated mixture of experts. InInternational C...

work page arXiv

[13] [13]

On the number of response regions of deep feed forward networks with piece-wise linear activations

Razvan Pascanu, Guido Montufar, and Yoshua Bengio. On the number of response regions of deep feed forward networks with piece-wise linear activations.arXiv preprint arXiv:1312.6098,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

The intrinsic dimension of images and its impact on learning.arXiv preprint arXiv:2104.08894,

Phillip Pope, Chen Zhu, Ahmed Abdelkader, Micah Goldblum, and Tom Goldstein. The intrinsic dimension of images and its impact on learning.arXiv preprint arXiv:2104.08894,

work page arXiv

[15] [15]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

Congruences on tropical rational function semifields and tropical curves.arXiv preprint arXiv:2305.04204,

JuAe Song. Congruences on tropical rational function semifields and tropical curves.arXiv preprint arXiv:2305.04204,

work page arXiv

[17] [17]

The Tropical Grassmannian

David Speyer and Bernd Sturmfels. The tropical grassmannian.arXiv preprint math/0304218,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Variational inference, entropy, and orthogonality: A unified theory of mixture- of-experts.arXiv preprint arXiv:2601.03577,

Ye Su and Yong Liu. Variational inference, entropy, and orthogonality: A unified theory of mixture- of-experts.arXiv preprint arXiv:2601.03577,

work page arXiv

[19] [19]

Minimum volumes of tropical rational functions.arXiv preprint arXiv:2409.18348,

Masayuki Sukenaga. Minimum volumes of tropical rational functions.arXiv preprint arXiv:2409.18348,

work page arXiv

[20] [20]

Symmetric and r-Symmetric Tropical Polynomials and Rational Functions

22 Sara Kalisnik Verovsek and Gunnar Carlsson. Symmetric and r-symmetric tropical polynomials and rational functions.arXiv preprint arXiv:1405.2268,

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

The inequality Eq

LetI\J={u 1, u2}andJ\I={v 1, v2}. The inequality Eq. (4) becomes: zu1(x) +z u2(x)≥z v1(x) +z v2(x). However, ifx∈ V(I) , then coalition I must also defeat the intermediate coalitionsL1 = (I\ {u1})∪ {v1} and L2 = (I\ {u 2})∪ {v 2}. Note that L1, L2 ∈ N k are valid coalitions that differ from I by onlym= 1element. The constraints imposed byL 1 andL 2 are: S...

work page 2014

[22] [22]

The Top-1 router assigns each input x∈R din to exactly one expert based on the logit scores z1(x),

While the exact counts differ between the affine arrangement and the dual zonotope, this framework demonstrates that the total geometric capacity is governed bythe vertex complexity of the zonotope, sharing the same asymptotic growth rate Θ(H d). The Top-1 router assigns each input x∈R din to exactly one expert based on the logit scores z1(x), . . . , zN(...

work page 1975

[23] [23]

= (kH) din − din(din −1) 2 (kH) din−1 +O (kH) din−2

= (kH) din − din−1X m=0 m ! (kH) din−1 +. . . = (kH) din − din(din −1) 2 (kH) din−1 +O (kH) din−2 . Substituting this back into the capacity bound: Φ(kH, din) = (kH) din din! +O((kH) din−1). Next, we expand the combinatorial multiplier N k forN≫k: N k = N! k!(N−k)! = 1 k! k−1Y m=0 (N−m) = 1 k! N k − k(k−1) 2 N k−1 +O(N k−2) . Combining the asymptotic expa...

work page 1975