pith. sign in

arxiv: 2605.26647 · v1 · pith:KDOTSNB6new · submitted 2026-05-26 · 💻 cs.LG · cs.AI· stat.ML

More Expressive Feedforward Layers: Part I. Token-Adaptive Mixing of Activations

Pith reviewed 2026-06-29 19:34 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML
keywords feedforward networksactivation functionstransformer expressivitymixture modelslanguage model pretrainingtoken-adaptive layers
0
0 comments X

The pith

Mixture of Activations strictly increases the expressive power of feedforward layers by making nonlinearity selection depend on the input token.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Mixture of Activations (MoA), a feedforward design that mixes several activation functions per token through lightweight input-dependent gates while keeping the linear projections shared. It also defines Learnable Activations (LA) as the input-independent version that forms fixed linear combinations of the same functions. Theory proves finite-width separations: any fixed-activation FFN can be realized inside LA, but not vice versa, and any LA can be realized inside MoA, with the extra power coming from token-specific nonlinear mixing. Pretraining runs on language models from 0.12B to 2B parameters show MoA reaching lower terminal loss and better scaling than tuned baselines at almost no added cost. The work focuses on the large parameter and nonlinearity share that FFN layers hold inside transformers.

Core claim

Mixture of Activations (MoA) mixes a dictionary of activation functions using lightweight input-dependent gates while sharing the same linear projections. This yields strict finite-width expressive separations where fixed-activation FFNs are contained in learnable activations (LA), which are contained in MoA. The added expressivity comes from input-dependent nonlinear hybridization. Pretraining experiments confirm lower terminal loss and better scaling.

What carries the argument

Mixture of Activations (MoA) with input-dependent gates that select and mix from multiple activation functions per token

Load-bearing premise

The lightweight input-dependent gates realize genuine input-dependent nonlinear hybridization without optimization difficulties or capacity limits erasing the theoretical separation in practice.

What would settle it

A finite-width counterexample showing that some MoA network can be exactly reproduced by a fixed-activation FFN of comparable width, or a set of pretraining runs in which MoA fails to reach lower terminal loss than well-tuned fixed or LA baselines.

read the original abstract

Feedforward network (FFN) layers account for a large fraction of parameters and nonlinear expressivity in Transformer-based large language models (LLMs). Despite the evolution from ReLU and GELU to gated variants such as SwiGLU, most FFN designs still use a single fixed activation function, applying the same nonlinear transformation to all tokens. In this work, we propose Mixture of Activations (MoA), a token-adaptive FFN design that mixes a dictionary of activation functions using lightweight input-dependent gates while sharing the same linear projections. As an input-independent counterpart, we also introduce learnable activations (LA), which form linear combinations of activation functions for both ReLU-type and SwiGLU-type FFNs. Theoretically, we establish strict finite-width expressive separations among fixed-activation FFNs, LA, and MoA: LA strictly contains fixed-activation FFNs, while MoA strictly contains LA, with the additional expressivity arising from input-dependent nonlinear hybridization. Empirically, we evaluate MoA through extensive pre-training experiments on dense and MoE language models ranging from 0.12B to 2B parameters under different token budgets, optimizers, and learning rate schedules. MoA consistently achieves lower terminal loss and exhibits more favorable scaling behavior than well-tuned baselines, with minimal parameter and computational overhead. These results suggest that token-adaptive activation mixing is a simple and effective mechanism for improving FFN expressivity in LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Mixture of Activations (MoA) for FFN layers, which mixes a dictionary of activation functions via lightweight input-dependent gates while sharing linear projections; it also introduces Learnable Activations (LA) as the input-independent linear-combination counterpart. It claims strict finite-width expressive separations (fixed-activation FFNs ⊂ LA ⊂ MoA) arising from input-dependent nonlinear hybridization, and reports that MoA yields lower terminal loss and better scaling than tuned baselines in pre-training runs on 0.12B–2B dense and MoE models across token budgets, optimizers, and schedules, with negligible overhead.

Significance. If the finite-width separations are realized by non-degenerate gates in trained models and the observed loss improvements are causally linked to the extra expressivity (rather than optimization artifacts or capacity differences), the approach offers a low-overhead route to increasing FFN expressivity. The empirical scope across scales and setups is a strength, but significance hinges on confirming that the claimed hybridization occurs in practice.

major comments (2)
  1. [theoretical analysis] Theoretical separation claims (abstract and theory section): the strict containment MoA ⊃ LA is asserted to arise from input-dependent nonlinear hybridization. However, this separation is realized only if the learned gates vary meaningfully across tokens and the resulting hybrids lie outside the LA function class. The manuscript provides no post-training analysis of gate statistics, variance, or effective rank, leaving open the possibility that gradient dynamics cause gates to converge to near-constant values and collapse MoA to LA behavior. This directly undermines attribution of empirical gains to the theoretical separation.
  2. [experiments] Empirical evaluation (experiments section): the claim of consistent gains 'across scales, optimizers, and token budgets' is load-bearing for the practical contribution. Without ablations that isolate gate-induced hybridization (e.g., freezing gates to constants, measuring per-token activation diversity, or comparing against an LA baseline with matched parameter count), it is impossible to rule out that observed improvements stem from other factors such as implicit regularization or optimization landscape changes rather than the asserted extra expressivity.
minor comments (2)
  1. [methods] Notation for the gate functions and the dictionary of activations should be introduced with explicit equations early in the methods section to avoid ambiguity when comparing LA and MoA.
  2. The abstract states 'Part I' but the manuscript contains no forward reference to planned follow-up work or limitations of the current scope.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and agree that targeted additions will strengthen the attribution of results to the claimed mechanism.

read point-by-point responses
  1. Referee: Theoretical separation claims (abstract and theory section): the strict containment MoA ⊃ LA is asserted to arise from input-dependent nonlinear hybridization. However, this separation is realized only if the learned gates vary meaningfully across tokens and the resulting hybrids lie outside the LA function class. The manuscript provides no post-training analysis of gate statistics, variance, or effective rank, leaving open the possibility that gradient dynamics cause gates to converge to near-constant values and collapse MoA to LA behavior. This directly undermines attribution of empirical gains to the theoretical separation.

    Authors: The theory section proves strict finite-width containment (fixed ⊂ LA ⊂ MoA) via explicit constructions showing input-dependent hybridization can realize functions outside the LA class. We agree that confirming non-degenerate gate behavior in trained models would strengthen the link to empirical gains. In revision we will add post-training gate statistics (variance, per-token diversity, effective rank) on the reported runs. revision: yes

  2. Referee: Empirical evaluation (experiments section): the claim of consistent gains 'across scales, optimizers, and token budgets' is load-bearing for the practical contribution. Without ablations that isolate gate-induced hybridization (e.g., freezing gates to constants, measuring per-token activation diversity, or comparing against an LA baseline with matched parameter count), it is impossible to rule out that observed improvements stem from other factors such as implicit regularization or optimization landscape changes rather than the asserted extra expressivity.

    Authors: The experiments show MoA outperforming tuned fixed-activation baselines across the stated range. LA is introduced primarily as the input-independent theoretical counterpart; direct matched-parameter LA comparisons and gate-freezing ablations are absent. We will add both in revision (LA baselines and controlled gate-freezing runs) to better isolate the contribution of input-dependent mixing. revision: yes

Circularity Check

0 steps flagged

No significant circularity in theoretical separations or empirical results

full rationale

The paper's central claims consist of a theoretical proof of strict finite-width expressive separations (LA contains fixed FFNs; MoA contains LA via input-dependent hybridization) and independent empirical observations of lower terminal loss and better scaling in pre-training runs. No equation, definition, or self-citation in the abstract or described content reduces any claimed separation or performance gain to a fitted quantity defined by the same experiment. The theoretical hierarchy is presented as an external mathematical result, and the empirical gains are reported as observations under varied training conditions, making the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the new MoA and LA constructions plus the finite-width separation theorems; no explicit free parameters, background axioms, or invented physical entities are stated in the abstract.

pith-pipeline@v0.9.1-grok · 5809 in / 1103 out tokens · 36481 ms · 2026-06-29T19:34:49.104473+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references · 22 canonical work pages · 16 internal anchors

  1. [1]

    Learning Activation Functions to Improve Deep Neural Networks

    Forest Agostinelli, Matthew Hoffman, Peter Sadowski, and Pierre Baldi. Learning activation functions to improve deep neural networks.arXiv preprint arXiv:1412.6830, 2014

  2. [2]

    A survey on modern trainable activation functions.Neural Networks, 138:14–32, 2021

    Andrea Apicella, Francesco Donnarumma, Francesco Isgrò, and Roberto Prevete. A survey on modern trainable activation functions.Neural Networks, 138:14–32, 2021

  3. [3]

    Breaking the curse of dimensionality with convex neural networks.The Journal of MachineLearning Research, 18(1):629–681, 2017

    Francis Bach. Breaking the curse of dimensionality with convex neural networks.The Journal of MachineLearning Research, 18(1):629–681, 2017

  4. [4]

    Neural net approximation

    Andrew R Barron. Neural net approximation. InProc. 7th YaleWorkshopon Adaptive and Learning Systems, volume 1, pages 69–72, 1992

  5. [5]

    Andrew R. Barron. Universal approximation bounds for superpositions of a sigmoidal function.IEEE Transactions on Information theory, 39(3):930–945, 1993

  6. [6]

    Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)

    Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning by exponential linear units (elus).arXiv preprint arXiv:1511.07289, 2015

  7. [7]

    Knowledge neurons in pretrained transformers

    Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, Baobao Chang, and Furu Wei. Knowledge neurons in pretrained transformers. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume1: Long Papers), pages 8493–8502, 2022

  8. [8]

    Dauphin, Angela Fan, Michael Auli, and David Grangier

    Yann N. Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with gated convolutional networks. InProceedings of the 34th International Conference on Machine Learning, pages 933–941, 2017

  9. [9]

    Bert: Pre-training of deep bidirectional transformers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of NAACL-HLT, pages 4171–4186, 2019

  10. [10]

    Sigmoid-weighted linear units for neural network function approxi- mation in reinforcement learning.Neural Networks, 107:3–11, 2018

    Stefan Elfwing, Eiji Uchibe, and Kenji Doya. Sigmoid-weighted linear units for neural network function approxi- mation in reinforcement learning.Neural Networks, 107:3–11, 2018

  11. [11]

    Deep sparse rectifier neural networks

    Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural networks. InProceedings of the FourteenthInternational Conference on Artificial Intelligence and Statistics, pages 315–323, 2011

  12. [12]

    Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio

    Ian J. Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. Maxout networks. In Proceedings of the 30th International Conference on Machine Learning, pages 1319–1327, 2013

  13. [13]

    Learning activation functions: A new paradigm for understanding neural networks.arXiv preprint arXiv:1906.09529, 2019

    Mohit Goyal, Rajan Goyal, and Brejesh Lall. Learning activation functions: A new paradigm for understanding neural networks.arXiv preprint arXiv:1906.09529, 2019

  14. [14]

    Delving deep into rectifiers: Surpassing human-level performance on imagenet classification

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. InProceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015

  15. [15]

    Masked autoencoders are scalable vision learners

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022

  16. [16]

    Gaussian Error Linear Units (GELUs)

    Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus).arXiv preprint arXiv:1606.08415, 2016

  17. [17]

    Training Compute-Optimal Large Language Models

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022

  18. [18]

    MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

    Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Weilin Zhao, et al. Minicpm: Unveiling the potential of small language models with scalable training strategies. arXiv preprint arXiv:2404.06395, 2024

  19. [19]

    Adaptive mixtures of local experts

    Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. Adaptive mixtures of local experts. Neural computation, 3(1):79–87, 1991

  20. [20]

    Hierarchical mixtures of experts and the em algorithm.Neuralcomputation, 6(2):181–214, 1994

    Michael I Jordan and Robert A Jacobs. Hierarchical mixtures of experts and the em algorithm.Neuralcomputation, 6(2):181–214, 1994

  21. [21]

    Andrej Karpathy. Nanogpt. https://github.com/karpathy/nanoGPT, 2022. 13

  22. [22]

    Muon optimizer

    Jordan Keller et al. Muon optimizer. https://github.com/KellerJordan/Muon?tab=readme-ov-file, 2024

  23. [23]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

  24. [24]

    Self-normalizing neural networks

    Günter Klambauer, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter. Self-normalizing neural networks. In Advancesin Neural Information Processing Systems, 2017

  25. [25]

    Muon is Scalable for LLM Training

    Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, et al. Muon is scalable for llm training.arXiv preprint arXiv:2502.16982, 2025

  26. [26]

    KAN: Kolmogorov-Arnold Networks

    Ziming Liu, Yixuan Wang, Sachin Vaidya, Fabian Ruehle, James Halverson, Marin Soljačić, Thomas Y Hou, and Max Tegmark. Kan: Kolmogorov-arnold networks.arXiv preprint arXiv:2404.19756, 2024

  27. [27]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

  28. [28]

    Learning Combinations of Activation Functions

    Franco Manessi and Alessandro Rozza. Learning combinations of activation functions. arXiv preprint arXiv:1801.09403, 2018

  29. [29]

    Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

    Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering.arXiv preprint arXiv:1809.02789, 2018

  30. [30]

    Mish: A self regularized non-monotonic activation function.arXiv preprint arXiv:1908.08681, 2019

    Diganta Misra. Mish: A self regularized non-monotonic activation function.arXiv preprint arXiv:1908.08681, 2019

  31. [31]

    Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

  32. [32]

    Prajit Ramachandran, Barret Zoph, and Quoc V. Le. Searching for activation functions. In International Conference on Learning Representations Workshop, 2018

  33. [33]

    Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

    Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

  34. [34]

    GLU Variants Improve Transformer

    Noam Shazeer. GLU variants improve transformer.arXiv preprint arXiv:2002.05202, 2020

  35. [35]

    Outrageously large neural networks: The sparsely-gated mixture-of-experts layer

    Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. InInternational Conference on Learning Representations, 2017

  36. [36]

    Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

  37. [37]

    Adaptive Blending Units: Trainable Activation Functions for Deep Neural Networks

    Leon René Sütfeld, Flemming Brieger, Holger Finger, Sonja Füllhase, and Gordon Pipa. Adaptive blending units: Trainable activation functions for deep neural networks.arXiv preprint arXiv:1806.10064, 2018

  38. [38]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, et al. LLaMA: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

  39. [39]

    Attention is all you need.Advancesin neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advancesin neural information processing systems, 30, 2017

  40. [40]

    Understanding the expressive power and mechanisms of transformer for sequence modeling

    Mingze Wang and Weinan E. Understanding the expressive power and mechanisms of transformer for sequence modeling. Advancesin Neural Information Processing Systems, 2024

  41. [41]

    Quick and (not so) dirty: Unsupervised selection of justification sentences for multi-hop question answering.arXiv preprint arXiv:1911.07176, 2019

    Vikas Yadav, Steven Bethard, and Mihai Surdeanu. Quick and (not so) dirty: Unsupervised selection of justification sentences for multi-hop question answering.arXiv preprint arXiv:1911.07176, 2019

  42. [42]

    Qwen2 Technical Report

    An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2024

  43. [43]

    HellaSwag: Can a Machine Really Finish Your Sentence?

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019

  44. [44]

    Scaling vision transformers

    Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer. Scaling vision transformers. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12104–12113, 2022. 14

  45. [45]

    Relu2 wins: Discovering efficient activation functions for sparse llms.arXiv preprint arXiv:2402.03804, 2024

    Zhengyan Zhang, Yixin Song, Guanghui Yu, Xu Han, Yankai Lin, Chaojun Xiao, Chenyang Song, Zhiyuan Liu, Zeyu Mi, and Maosong Sun. Relu2 wins: Discovering efficient activation functions for sparse llms.arXiv preprint arXiv:2402.03804, 2024

  46. [46]

    Understanding transformer from the perspective of associative memory.arXiv preprint arXiv:2505.19488, 2025

    Shu Zhong, Mingyu Xu, Tenglong Ao, and Guang Shi. Understanding transformer from the perspective of associative memory.arXiv preprint arXiv:2505.19488, 2025

  47. [47]

    Value residual learning

    Zhanchao Zhou, Tianyi Wu, Zhiyun Jiang, Fares Obeid, and Zhenzhong Lan. Value residual learning. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 28341–28356, 2025

  48. [48]

    Polynomial composition activations: Unleashing the dynamics of large language models.arXiv preprint arXiv:2411.03884, 2024

    Zhijian Zhuo, Ya Wang, Yutao Zeng, Xiaoqing Li, Xun Zhou, and Jinwen Ma. Polynomial composition activations: Unleashing the dynamics of large language models.arXiv preprint arXiv:2411.03884, 2024. 15 Appendix A Experimental Details. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16 A.1 Experimental Details for Section 5.2...