pith. sign in

arxiv: 2606.17952 · v1 · pith:SINZQWDOnew · submitted 2026-06-16 · 💻 cs.LG · cs.AI

SoftMoE: Soft Differentiable Routing for Mixture-of-Experts in LLMs

Pith reviewed 2026-06-27 01:55 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords mixture of expertsdifferentiable routinglarge language modelssoft top-kexpert allocationautoregressive modelssparse activation
0
0 comments X

The pith

SoftMoE replaces discrete top-k routing with a differentiable soft relaxation so mixture-of-experts models can learn variable expert counts per layer under a global budget.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper replaces the non-differentiable top-k selection in sparse mixture-of-experts layers with a truncated soft top-k LapSum relaxation. This change makes routing decisions trainable by gradient descent while preserving compatibility with autoregressive language modeling. The approach also introduces a per-layer parameterization of the average number of active experts together with a global computation budget, so the network itself decides how to spread capacity across depth. Experiments show that the resulting models reach language-modeling and downstream performance comparable to or better than fixed sparse MoE while activating fewer experts on average. The learned pattern is strongly non-uniform, with later layers using more experts than early ones.

Core claim

SoftMoE introduces a truncated soft top-k LapSum relaxation that turns expert selection into a differentiable operation. Combined with per-layer mean-expert-count parameters and a global budget constraint, the method lets the model optimize both which experts to use and how many to activate at each layer. The resulting networks remain fully causal, match or exceed the performance of standard sparse MoE on language modeling and downstream tasks, and do so while activating significantly fewer experts overall. The learned allocation is highly non-uniform, concentrating more active experts in later layers.

What carries the argument

truncated soft top-k LapSum relaxation that approximates discrete expert selection with a differentiable operator while enforcing sparsity

If this is right

  • Expert counts can vary across layers without being set by hand.
  • Total computation can be controlled by a single global budget rather than per-layer fixed k.
  • Routing decisions become part of the end-to-end optimization rather than a fixed preprocessing step.
  • Later layers can receive more expert capacity when the data reward it.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same relaxation technique could be tested on other discrete selection problems inside sequence models.
  • Non-uniform layer-wise allocations may indicate where additional capacity yields the largest returns during pre-training.
  • Dynamic per-input expert counts could emerge if the global budget were made input-dependent.

Load-bearing premise

The soft top-k relaxation approximates discrete routing closely enough that sparsity benefits and training stability are retained without performance loss.

What would settle it

A controlled comparison on a standard autoregressive language-modeling benchmark in which SoftMoE produces higher perplexity than a fixed top-k sparse MoE while activating the same or greater average number of experts.

Figures

Figures reproduced from arXiv: 2606.17952 by Jacek Tabor, {\L}ukasz Struski, Marcin Kurdziel, Miko{\l}aj Zasada.

Figure 1
Figure 1. Figure 1: Learned expert allocation reveals non-uniform capac￾ity needs across layers of a MoE model. Each color shows the distribution of activated experts in one layer of 10-layer model. Top: standard sparse MoE fixes the number of activated experts a priori. Middle: under soft top-k routing, the number of activated experts depends on the input token. Bottom: learned allocation re￾distributes active experts budget… view at source ↗
Figure 2
Figure 2. Figure 2: Architecture of a SoftMoE layer. The dense MLP layers in each transformer block are replaced by smaller expert MLPs with soft routing (light green box). The router computes scores over all experts, which are transformed by the SoftTopK operator into soft selection weights pi ∈ [0, 1]. Weights below threshold are truncated to zero, yielding sparse but input-dependent expert activation. both the routing scor… view at source ↗
Figure 3
Figure 3. Figure 3: illustrates the SoftMoE transformer architecture with learned allocation parameters. Computational Overhead. Operations in SoftMoE router add minimal overhead compared to sparse MoE routing. LapSum has O(n) computational cost, where n is the num￾ber of experts. Memory usage is also O(n) per token vs. O(k) for sparse MoE. Given typical expert counts in MoE layers (n = 32–64), this is negligible compared to … view at source ↗
Figure 4
Figure 4. Figure 4: Training loss (left) and average number of active experts per token per layer (right) for Sparse MoE and SoftMoE* routing (no 1 learned expert allocation across layers). Top-k selection fixes the number of active experts a priori. Under soft routing, active experts count depends on the distribution of soft weights and evolves during training. SoftMoE*-1 SoftMoE*-1.5 SoftMoE-1.5 SoftMoE*-2 SoftMoE-2 20k 40k… view at source ↗
Figure 5
Figure 5. Figure 5: Training loss (left) and average number of active experts per token per layer (right) for soft routing without learning expert 1 allocation (SoftMoE*) and with learned allocation (SoftMoE). pare SoftMoE against the Switch Transformer (Fedus et al., 2022) architecture, the de-facto standard MoE architecture for autoregressive language models. Except for different routing networks, all evaluated models share… view at source ↗
Figure 6
Figure 6. Figure 6: Evolution of the expert allocation parameters 1 kl during training. SoftMoE architecture redistributes the active expert bud￾get towards the final layers of the model. 5.1. Results SoftMoE* vs. Sparse MoE. We begin evaluation with comparison of sparse MoE baselines against SoftMoE rout￾ing without learning expert allocation (SoftMoE* models) [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Evolution of the expert allocation parameters ( 1 kl) during SoftMoE pretraining on OWT. Layer 1 2 3 4 5 6 7 8 9 10 0 25k 50k 75k 100k 125k 150k 1.0 1.5 2.0 2.5 3.0 3.5 SoftMoE-1.5, α=2 0 25k 50k 75k 100k 125k 150k 1 2 3 4 5 SoftMoE-2, α=2 Step Expert Allocation [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Evolution of the expert allocation parameters ( 1 kl) during SoftMoE pretraining on C4. B. Training Hyperparameters Configuration of models used in experiments is summarized in [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
read the original abstract

Sparse Mixture-of-Experts (MoE) architectures enable scaling LLM parameters under a fixed inference budget by activating only a small subset of experts via top-$k$ routing. While this preserves causality and suits autoregressive language models, the discrete top-$k$ operator is not differentiable, forcing a fixed number of active experts per input and resulting in inefficient use of computation. We propose SoftMoE, which replaces discrete routing with a truncated soft top-$k$ LapSum relaxation, allowing gradient-based optimization of expert routing. We further parameterize the mean number of active experts per layer and impose a global budget constraint, enabling the model to learn how to allocate expert capacity across layers. SoftMoE remains fully compatible with autoregressive modeling and achieves performance comparable to or better than sparse MoE on language modeling and downstream tasks, while activating significantly fewer experts. Notably, the learned allocation is highly non-uniform, with later layers activating more experts. The source code is publicly available$^\dagger$.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces SoftMoE, which replaces the non-differentiable discrete top-k routing in sparse Mixture-of-Experts models for LLMs with a truncated soft top-k LapSum relaxation. This enables gradient-based optimization of routing decisions while parameterizing the mean number of active experts per layer under a global budget constraint. The work claims full compatibility with autoregressive language modeling, performance that is comparable to or better than standard sparse MoE while activating significantly fewer experts, and a learned non-uniform allocation in which later layers activate more experts. Source code is stated to be publicly available.

Significance. If the performance and stability claims hold, the approach could meaningfully advance efficient scaling of LLMs by allowing models to learn variable expert budgets across layers instead of relying on fixed top-k sparsity. This has potential to improve compute allocation in deep transformer stacks. Public code release aids reproducibility.

major comments (2)
  1. [Abstract] Abstract: the performance claims (comparable or better results than sparse MoE while activating significantly fewer experts) are presented with no experimental results, ablation studies, tables, figures, or implementation details anywhere in the provided manuscript text; this absence is load-bearing for the central contribution.
  2. [Abstract] Abstract: no quantitative bound, error analysis, or empirical validation is supplied for how accurately the truncated soft top-k LapSum relaxation approximates discrete top-k routing or whether it preserves sparsity benefits and autoregressive compatibility without introducing bias or instability; this is the key unverified assumption underlying all downstream claims.
minor comments (1)
  1. [Abstract] The LapSum relaxation is referenced by name but its mathematical definition, truncation mechanism, and relation to standard soft top-k operators are not supplied in the abstract, impeding immediate assessment of the method.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point-by-point below. The full manuscript contains the supporting experiments and analysis referenced in the abstract; we will revise to improve visibility and explicit cross-references.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the performance claims (comparable or better results than sparse MoE while activating significantly fewer experts) are presented with no experimental results, ablation studies, tables, figures, or implementation details anywhere in the provided manuscript text; this absence is load-bearing for the central contribution.

    Authors: The manuscript includes Section 4 (Experiments) containing tables and figures that directly compare perplexity and downstream task accuracy against standard sparse MoE baselines, ablations on the routing mechanism and budget parameterization, and plots of learned expert allocation per layer. Implementation details, hyperparameters, and training setup appear in the appendix. We will revise the abstract to include a one-sentence summary of the key quantitative results and add explicit pointers to Section 4 and the relevant figures/tables. revision: yes

  2. Referee: [Abstract] Abstract: no quantitative bound, error analysis, or empirical validation is supplied for how accurately the truncated soft top-k LapSum relaxation approximates discrete top-k routing or whether it preserves sparsity benefits and autoregressive compatibility without introducing bias or instability; this is the key unverified assumption underlying all downstream claims.

    Authors: Section 3.2 derives approximation error bounds for the truncated LapSum operator relative to exact top-k and proves that the soft relaxation preserves per-token autoregressive causality. Section 4.2 reports empirical comparisons (including activation counts and stability metrics) showing that the learned soft routing closely matches discrete top-k performance while enabling end-to-end optimization. We will expand the abstract to reference this analysis and move the error-bound derivation from the appendix into the main text. revision: yes

Circularity Check

0 steps flagged

No significant circularity; proposal is a novel relaxation with empirical validation

full rationale

The paper introduces SoftMoE as a new truncated soft top-k LapSum relaxation to enable differentiable routing, with a parameterized mean active experts and global budget constraint. No step reduces a claimed prediction or first-principles result to its own inputs by construction. The central claims rest on experimental comparisons to sparse MoE baselines rather than self-citations, fitted parameters renamed as predictions, or ansatzes smuggled via prior work. The derivation chain is self-contained: the relaxation is defined explicitly, training proceeds via standard gradients, and performance/non-uniform allocation results are reported from runs, not forced by the method's definition. This matches the expected non-circular case for a methods paper presenting an approximation technique.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Review is abstract-only; ledger entries are inferred directly from stated elements. The central free parameter is the mean number of active experts per layer. The key domain assumption is that the LapSum relaxation functions as a usable differentiable proxy for discrete routing.

free parameters (1)
  • mean number of active experts per layer
    Explicitly parameterized and optimized under the global budget constraint as described in the abstract.
axioms (1)
  • domain assumption The truncated soft top-k LapSum relaxation is a valid differentiable surrogate for discrete top-k selection that preserves autoregressive compatibility.
    Invoked to justify gradient-based optimization of routing.

pith-pipeline@v0.9.1-grok · 5715 in / 1473 out tokens · 43906 ms · 2026-06-27T01:55:13.183726+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

45 extracted references · 4 linked inside Pith

  1. [1]

    Pawan Kumar , title =

    Leonard Berrada and Andrew Zisserman and M. Pawan Kumar , title =. 6th International Conference on Learning Representations,

  2. [2]

    Chi , title =

    Minmin Chen and Alex Beutel and Paul Covington and Sagar Jain and Francois Belletti and Ed H. Chi , title =. Proceedings of the Twelfth

  3. [3]

    Xiang Chen and Hao Li and Mingqiang Li and Jinshan Pan , title =

  4. [4]

    Advances in Neural Information Processing Systems 37: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024 , year =

    Corinna Cortes and Anqi Mao and Christopher Mohri and Mehryar Mohri and Yutao Zhong , title =. Advances in Neural Information Processing Systems 37: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024 , year =

  5. [5]

    Differentiable Ranking and Sorting using Optimal Transport , booktitle =

    Marco Cuturi and Olivier Teboul and Jean. Differentiable Ranking and Sorting using Optimal Transport , booktitle =

  6. [6]

    The Thirty-Fourth

    Yonatan Bisk and Rowan Zellers and Ronan Le Bras and Jianfeng Gao and Yejin Choi , title =. The Thirty-Fourth

  7. [7]

    Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics,

    Rowan Zellers and Ari Holtzman and Yonatan Bisk and Ali Farhadi and Yejin Choi , title =. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics,

  8. [8]

    9th International Conference on Learning Representations,

    Dmitry Lepikhin and HyoukJoong Lee and Yuanzhong Xu and Dehao Chen and Orhan Firat and Yanping Huang and Maxim Krikun and Noam Shazeer and Zhifeng Chen , title =. 9th International Conference on Learning Representations,

  9. [9]

    The Eleventh International Conference on Learning Representations,

    Tianlin Liu and Joan Puigcerver and Mathieu Blondel , title =. The Eleventh International Conference on Learning Representations,

  10. [10]

    The Twelfth International Conference on Learning Representations,

    Xinyu Zhao and Xuxi Chen and Yu Cheng and Tianlong Chen , title =. The Twelfth International Conference on Learning Representations,

  11. [11]

    Hechtman and Trevor Cai and Sebastian Borgeaud and George van den Driessche and Eliza Rutherford and Tom Hennigan and Matthew J

    Aidan Clark and Diego de Las Casas and Aurelia Guy and Arthur Mensch and Michela Paganini and Jordan Hoffmann and Bogdan Damoc and Blake A. Hechtman and Trevor Cai and Sebastian Borgeaud and George van den Driessche and Eliza Rutherford and Tom Hennigan and Matthew J. Johnson and Albin Cassirer and Chris Jones and Elena Buchatskaya and David Budden and La...

  12. [12]

    Proceedings of the 38th International Conference on Machine Learning,

    Mike Lewis and Shruti Bhosale and Tim Dettmers and Naman Goyal and Luke Zettlemoyer , title =. Proceedings of the 38th International Conference on Machine Learning,

  13. [13]

    Scaling Laws for Fine-Grained Mixture of Experts , booktitle =

    Jan Ludziejewski and Jakub Krajewski and Kamil Adamczewski and Maciej Pi. Scaling Laws for Fine-Grained Mixture of Experts , booktitle =

  14. [14]

    Proceedings of the 42nd International Conference on Machine Learning,

  15. [15]

    Mixture of Tokens: Continuous

    Szymon Antoniak and Michal Krutul and Maciej Pi. Mixture of Tokens: Continuous. Advances in Neural Information Processing Systems 37: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024 , year =

  16. [16]

    Rush and Boaz Barak and Teven Le Scao and Nouamane Tazi and Aleksandra Piktus and Sampo Pyysalo and Thomas Wolf and Colin A

    Niklas Muennighoff and Alexander M. Rush and Boaz Barak and Teven Le Scao and Nouamane Tazi and Aleksandra Piktus and Sampo Pyysalo and Thomas Wolf and Colin A. Raffel , title =. Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023 , year =

  17. [17]

    Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021 , pages =

    Stephen Roller and Sainbayar Sukhbaatar and Arthur Szlam and Jason Weston , title =. Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021 , pages =

  18. [18]

    Zhao and Andrew M

    Yanqi Zhou and Tao Lei and Hanxiao Liu and Nan Du and Yanping Huang and Vincent Y. Zhao and Andrew M. Dai and Zhifeng Chen and Quoc V. Le and James Laudon , title =. Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022 , year =

  19. [19]

    arXiv preprint arXiv:1909.08053 , year =

    Mohammad Shoeybi and Mostofa Patwary and Raul Puri and Patrick LeGresley and Jared Casper and Bryan Catanzaro , title =. arXiv preprint arXiv:1909.08053 , year =

  20. [20]

    Journal of Machine Learning Research , year =

    William Fedus and Barret Zoph and Noam Shazeer , title =. Journal of Machine Learning Research , year =

  21. [21]

    arXiv preprint arXiv:2102.03315 , year =

    Sumithra Bhakthavatsalam and Daniel Khashabi and Tushar Khot and Bhavana Dalvi Mishra and Kyle Richardson and Ashish Sabharwal and Carissa Schoenick and Oyvind Tafjord and Peter Clark , title =. arXiv preprint arXiv:2102.03315 , year =

  22. [22]

    Scaling Vision with Sparse Mixture of Experts , booktitle =

    Carlos Riquelme and Joan Puigcerver and Basil Mustafa and Maxim Neumann and Rodolphe Jenatton and Andr. Scaling Vision with Sparse Mixture of Experts , booktitle =

  23. [23]

    Rae and Oriol Vinyals and Laurent Sifre , title =

    Jordan Hoffmann and Sebastian Borgeaud and Arthur Mensch and Elena Buchatskaya and Trevor Cai and Eliza Rutherford and Diego de Las Casas and Lisa Anne Hendricks and Johannes Welbl and Aidan Clark and Tom Hennigan and Eric Noland and Katie Millican and George van den Driessche and Bogdan Damoc and Aurelia Guy and Simon Osindero and Karen Simonyan and Eric...

  24. [24]

    arXiv preprint arXiv:2404.05567 , year =

    Bowen Pan and Yikang Shen and Haokun Liu and Mayank Mishra and Gaoyuan Zhang and Aude Oliva and Colin Raffel and Rameswar Panda , title =. arXiv preprint arXiv:2404.05567 , year =

  25. [25]

    Conditional Computation in Neural Networks for faster models , journal =

    Emmanuel Bengio and Pierre. Conditional Computation in Neural Networks for faster models , journal =

  26. [26]

    Le and Geoffrey E

    Noam Shazeer and Azalia Mirhoseini and Krzysztof Maziarz and Andy Davis and Quoc V. Le and Geoffrey E. Hinton and Jeff Dean , title =. 5th International Conference on Learning Representations,

  27. [27]

    Liu , title =

    Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu , title =. Journal of Machine Learning Research , volume =

  28. [28]

    Jacobs and Michael I

    Robert A. Jacobs and Michael I. Jordan and Steven J. Nowlan and Geoffrey E. Hinton , title =. Neural Comput. , volume =

  29. [29]

    Jordan and Robert A

    Michael I. Jordan and Robert A. Jacobs , title =. Neural Comput. , volume =

  30. [30]

    Weilin Cai and Juyong Jiang and Fan Wang and Jing Tang and Sunghun Kim and Jiayi Huang , title =

  31. [31]

    Proceedings of the 39th International Conference on Machine Learning,

    Camille Garcin and Maximilien Servajean and Alexis Joly and Joseph Salmon , title =. Proceedings of the 39th International Conference on Machine Learning,

  32. [32]

    Gokaslan, Aaron and Cohen, Vanya , howpublished =

  33. [33]

    Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks , journal =

    Torsten Hoefler and Dan Alistarh and Tal Ben. Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks , journal =. 2021 , volume =

  34. [34]

    Brown and Benjamin Chess and Rewon Child and Scott Gray and Alec Radford and Jeffrey Wu and Dario Amodei , title =

    Jared Kaplan and Sam McCandlish and Tom Henighan and Tom B. Brown and Benjamin Chess and Rewon Child and Scott Gray and Alec Radford and Jeffrey Wu and Dario Amodei , title =. arXiv preprint arXiv:2001.08361 , year =

  35. [35]

    Maksim Lapin and Matthias Hein and Bernt Schiele , title =. 2016

  36. [36]

    arXiv preprint arXiv:2403.19625 , year =

    Anqi Mao and Mehryar Mohri and Yutao Zhong , title =. arXiv preprint arXiv:2403.19625 , year =

  37. [37]

    Murphy and Shuchin Aeron , title =

    Shoaib Bin Masud and Matthew Werenski and James M. Murphy and Shuchin Aeron , title =. Journal of Machine Learning Research , volume=

  38. [38]

    The Twelfth International Conference on Learning Representations,

    Joan Puigcerver and Carlos Riquelme Ruiz and Basil Mustafa and Neil Houlsby , title =. The Twelfth International Conference on Learning Representations,

  39. [39]

    2019 , howpublished =

    Language Models are Unsupervised Multitask Learners , author =. 2019 , howpublished =

  40. [40]

    Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020 , year =

    Yujia Xie and Hanjun Dai and Minshuo Chen and Bo Dai and Tuo Zhao and Hongyuan Zha and Wei Wei and Tomas Pfister , title =. Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020 , year =

  41. [41]

    Thirty-Seventh

    Lei Xu and Rong Wang and Feiping Nie and Xuelong Li , title =. Thirty-Seventh

  42. [42]

    arXiv preprint arXiv:1912.11637 , year =

    Guangxiang Zhao and Junyang Lin and Zhiyuan Zhang and Xuancheng Ren and Qi Su and Xu Sun , title =. arXiv preprint arXiv:1912.11637 , year =

  43. [43]

    arXiv preprint arXiv:2202.08906 , year =

    Barret Zoph and Irwan Bello and Sameer Kumar and Nan Du and Yanping Huang and Jeff Dean and Noam Shazeer and William Fedus , title =. arXiv preprint arXiv:2202.08906 , year =

  44. [44]

    2019 , publisher =

    Tenney, Ian and Das, Dipanjan and Pavlick, Ellie , booktitle =. 2019 , publisher =

  45. [45]

    What Does

    Jawahar, Ganesh and Sagot, Beno. What Does. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics,. 2019 , publisher =