More Expressive Feedforward Layers: Part I. Token-Adaptive Mixing of Activations

Jinbo Wang; Kai Shen; Mingze Wang; Shu Zhong; Yikuan Xia

arxiv: 2605.26647 · v1 · pith:KDOTSNB6new · submitted 2026-05-26 · 💻 cs.LG · cs.AI· stat.ML

More Expressive Feedforward Layers: Part I. Token-Adaptive Mixing of Activations

Mingze Wang , Jinbo Wang , Yikuan Xia , Kai Shen , Shu Zhong This is my paper

Pith reviewed 2026-06-29 19:34 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML

keywords feedforward networksactivation functionstransformer expressivitymixture modelslanguage model pretrainingtoken-adaptive layers

0 comments

The pith

Mixture of Activations strictly increases the expressive power of feedforward layers by making nonlinearity selection depend on the input token.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Mixture of Activations (MoA), a feedforward design that mixes several activation functions per token through lightweight input-dependent gates while keeping the linear projections shared. It also defines Learnable Activations (LA) as the input-independent version that forms fixed linear combinations of the same functions. Theory proves finite-width separations: any fixed-activation FFN can be realized inside LA, but not vice versa, and any LA can be realized inside MoA, with the extra power coming from token-specific nonlinear mixing. Pretraining runs on language models from 0.12B to 2B parameters show MoA reaching lower terminal loss and better scaling than tuned baselines at almost no added cost. The work focuses on the large parameter and nonlinearity share that FFN layers hold inside transformers.

Core claim

Mixture of Activations (MoA) mixes a dictionary of activation functions using lightweight input-dependent gates while sharing the same linear projections. This yields strict finite-width expressive separations where fixed-activation FFNs are contained in learnable activations (LA), which are contained in MoA. The added expressivity comes from input-dependent nonlinear hybridization. Pretraining experiments confirm lower terminal loss and better scaling.

What carries the argument

Mixture of Activations (MoA) with input-dependent gates that select and mix from multiple activation functions per token

Load-bearing premise

The lightweight input-dependent gates realize genuine input-dependent nonlinear hybridization without optimization difficulties or capacity limits erasing the theoretical separation in practice.

What would settle it

A finite-width counterexample showing that some MoA network can be exactly reproduced by a fixed-activation FFN of comparable width, or a set of pretraining runs in which MoA fails to reach lower terminal loss than well-tuned fixed or LA baselines.

read the original abstract

Feedforward network (FFN) layers account for a large fraction of parameters and nonlinear expressivity in Transformer-based large language models (LLMs). Despite the evolution from ReLU and GELU to gated variants such as SwiGLU, most FFN designs still use a single fixed activation function, applying the same nonlinear transformation to all tokens. In this work, we propose Mixture of Activations (MoA), a token-adaptive FFN design that mixes a dictionary of activation functions using lightweight input-dependent gates while sharing the same linear projections. As an input-independent counterpart, we also introduce learnable activations (LA), which form linear combinations of activation functions for both ReLU-type and SwiGLU-type FFNs. Theoretically, we establish strict finite-width expressive separations among fixed-activation FFNs, LA, and MoA: LA strictly contains fixed-activation FFNs, while MoA strictly contains LA, with the additional expressivity arising from input-dependent nonlinear hybridization. Empirically, we evaluate MoA through extensive pre-training experiments on dense and MoE language models ranging from 0.12B to 2B parameters under different token budgets, optimizers, and learning rate schedules. MoA consistently achieves lower terminal loss and exhibits more favorable scaling behavior than well-tuned baselines, with minimal parameter and computational overhead. These results suggest that token-adaptive activation mixing is a simple and effective mechanism for improving FFN expressivity in LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MoA adds per-token activation mixing with clean finite-width separation theorems and reports consistent pretraining gains, but the practical value hinges on whether the gates actually stay input-dependent after training.

read the letter

The colleague should know two things about this paper. First, it defines MoA as a lightweight way to mix a dictionary of activations per token while sharing the linear weights, plus an input-independent version called LA. Second, it claims strict finite-width separations (fixed activations inside LA inside MoA) and shows lower terminal loss than tuned baselines across 0.12B to 2B models.

What is new is the explicit construction of input-dependent nonlinear hybridization and the accompanying separation results. The experiments run across model scales, optimizers, and token budgets, which is more than many architecture notes manage. The overhead is described as small, and the empirical pattern is consistent enough to be worth noticing.

The soft spot is exactly the one flagged in the stress test. The separation theorems require that the learned gates vary across tokens and produce hybrids that cannot be matched by any LA. The paper does not appear to report gate statistics, variance across tokens, or ablation on gate degeneracy after training. If the gates collapse toward constants under gradient flow, the extra expressivity is not realized and the loss improvements could come from the added parameters or other side effects. That gap is material because the central claim rests on it.

The math and citation pattern look standard for the subfield. No obvious circularity in the claims.

This paper is for people who work on small, testable changes to FFN blocks in LLMs. A reader who cares about expressivity hierarchies will find the theorems useful even if the empirical attribution stays open. It deserves a serious referee because the idea is concrete, the experiments are broad enough to be informative, and the theoretical part is checkable.

Referee Report

2 major / 2 minor

Summary. The paper proposes Mixture of Activations (MoA) for FFN layers, which mixes a dictionary of activation functions via lightweight input-dependent gates while sharing linear projections; it also introduces Learnable Activations (LA) as the input-independent linear-combination counterpart. It claims strict finite-width expressive separations (fixed-activation FFNs ⊂ LA ⊂ MoA) arising from input-dependent nonlinear hybridization, and reports that MoA yields lower terminal loss and better scaling than tuned baselines in pre-training runs on 0.12B–2B dense and MoE models across token budgets, optimizers, and schedules, with negligible overhead.

Significance. If the finite-width separations are realized by non-degenerate gates in trained models and the observed loss improvements are causally linked to the extra expressivity (rather than optimization artifacts or capacity differences), the approach offers a low-overhead route to increasing FFN expressivity. The empirical scope across scales and setups is a strength, but significance hinges on confirming that the claimed hybridization occurs in practice.

major comments (2)

[theoretical analysis] Theoretical separation claims (abstract and theory section): the strict containment MoA ⊃ LA is asserted to arise from input-dependent nonlinear hybridization. However, this separation is realized only if the learned gates vary meaningfully across tokens and the resulting hybrids lie outside the LA function class. The manuscript provides no post-training analysis of gate statistics, variance, or effective rank, leaving open the possibility that gradient dynamics cause gates to converge to near-constant values and collapse MoA to LA behavior. This directly undermines attribution of empirical gains to the theoretical separation.
[experiments] Empirical evaluation (experiments section): the claim of consistent gains 'across scales, optimizers, and token budgets' is load-bearing for the practical contribution. Without ablations that isolate gate-induced hybridization (e.g., freezing gates to constants, measuring per-token activation diversity, or comparing against an LA baseline with matched parameter count), it is impossible to rule out that observed improvements stem from other factors such as implicit regularization or optimization landscape changes rather than the asserted extra expressivity.

minor comments (2)

[methods] Notation for the gate functions and the dictionary of activations should be introduced with explicit equations early in the methods section to avoid ambiguity when comparing LA and MoA.
The abstract states 'Part I' but the manuscript contains no forward reference to planned follow-up work or limitations of the current scope.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and agree that targeted additions will strengthen the attribution of results to the claimed mechanism.

read point-by-point responses

Referee: Theoretical separation claims (abstract and theory section): the strict containment MoA ⊃ LA is asserted to arise from input-dependent nonlinear hybridization. However, this separation is realized only if the learned gates vary meaningfully across tokens and the resulting hybrids lie outside the LA function class. The manuscript provides no post-training analysis of gate statistics, variance, or effective rank, leaving open the possibility that gradient dynamics cause gates to converge to near-constant values and collapse MoA to LA behavior. This directly undermines attribution of empirical gains to the theoretical separation.

Authors: The theory section proves strict finite-width containment (fixed ⊂ LA ⊂ MoA) via explicit constructions showing input-dependent hybridization can realize functions outside the LA class. We agree that confirming non-degenerate gate behavior in trained models would strengthen the link to empirical gains. In revision we will add post-training gate statistics (variance, per-token diversity, effective rank) on the reported runs. revision: yes
Referee: Empirical evaluation (experiments section): the claim of consistent gains 'across scales, optimizers, and token budgets' is load-bearing for the practical contribution. Without ablations that isolate gate-induced hybridization (e.g., freezing gates to constants, measuring per-token activation diversity, or comparing against an LA baseline with matched parameter count), it is impossible to rule out that observed improvements stem from other factors such as implicit regularization or optimization landscape changes rather than the asserted extra expressivity.

Authors: The experiments show MoA outperforming tuned fixed-activation baselines across the stated range. LA is introduced primarily as the input-independent theoretical counterpart; direct matched-parameter LA comparisons and gate-freezing ablations are absent. We will add both in revision (LA baselines and controlled gate-freezing runs) to better isolate the contribution of input-dependent mixing. revision: yes

Circularity Check

0 steps flagged

No significant circularity in theoretical separations or empirical results

full rationale

The paper's central claims consist of a theoretical proof of strict finite-width expressive separations (LA contains fixed FFNs; MoA contains LA via input-dependent hybridization) and independent empirical observations of lower terminal loss and better scaling in pre-training runs. No equation, definition, or self-citation in the abstract or described content reduces any claimed separation or performance gain to a fitted quantity defined by the same experiment. The theoretical hierarchy is presented as an external mathematical result, and the empirical gains are reported as observations under varied training conditions, making the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the new MoA and LA constructions plus the finite-width separation theorems; no explicit free parameters, background axioms, or invented physical entities are stated in the abstract.

pith-pipeline@v0.9.1-grok · 5809 in / 1103 out tokens · 36481 ms · 2026-06-29T19:34:49.104473+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 22 canonical work pages · 16 internal anchors

[1]

Learning Activation Functions to Improve Deep Neural Networks

Forest Agostinelli, Matthew Hoffman, Peter Sadowski, and Pierre Baldi. Learning activation functions to improve deep neural networks.arXiv preprint arXiv:1412.6830, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[2]

A survey on modern trainable activation functions.Neural Networks, 138:14–32, 2021

Andrea Apicella, Francesco Donnarumma, Francesco Isgrò, and Roberto Prevete. A survey on modern trainable activation functions.Neural Networks, 138:14–32, 2021

2021
[3]

Breaking the curse of dimensionality with convex neural networks.The Journal of MachineLearning Research, 18(1):629–681, 2017

Francis Bach. Breaking the curse of dimensionality with convex neural networks.The Journal of MachineLearning Research, 18(1):629–681, 2017

2017
[4]

Neural net approximation

Andrew R Barron. Neural net approximation. InProc. 7th YaleWorkshopon Adaptive and Learning Systems, volume 1, pages 69–72, 1992

1992
[5]

Andrew R. Barron. Universal approximation bounds for superpositions of a sigmoidal function.IEEE Transactions on Information theory, 39(3):930–945, 1993

1993
[6]

Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)

Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning by exponential linear units (elus).arXiv preprint arXiv:1511.07289, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[7]

Knowledge neurons in pretrained transformers

Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, Baobao Chang, and Furu Wei. Knowledge neurons in pretrained transformers. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume1: Long Papers), pages 8493–8502, 2022

2022
[8]

Dauphin, Angela Fan, Michael Auli, and David Grangier

Yann N. Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with gated convolutional networks. InProceedings of the 34th International Conference on Machine Learning, pages 933–941, 2017

2017
[9]

Bert: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of NAACL-HLT, pages 4171–4186, 2019

2019
[10]

Sigmoid-weighted linear units for neural network function approxi- mation in reinforcement learning.Neural Networks, 107:3–11, 2018

Stefan Elfwing, Eiji Uchibe, and Kenji Doya. Sigmoid-weighted linear units for neural network function approxi- mation in reinforcement learning.Neural Networks, 107:3–11, 2018

2018
[11]

Deep sparse rectifier neural networks

Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural networks. InProceedings of the FourteenthInternational Conference on Artificial Intelligence and Statistics, pages 315–323, 2011

2011
[12]

Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio

Ian J. Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. Maxout networks. In Proceedings of the 30th International Conference on Machine Learning, pages 1319–1327, 2013

2013
[13]

Learning activation functions: A new paradigm for understanding neural networks.arXiv preprint arXiv:1906.09529, 2019

Mohit Goyal, Rajan Goyal, and Brejesh Lall. Learning activation functions: A new paradigm for understanding neural networks.arXiv preprint arXiv:1906.09529, 2019

work page arXiv 1906
[14]

Delving deep into rectifiers: Surpassing human-level performance on imagenet classification

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. InProceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015

2015
[15]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022

2022
[16]

Gaussian Error Linear Units (GELUs)

Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus).arXiv preprint arXiv:1606.08415, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[17]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[18]

MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Weilin Zhao, et al. Minicpm: Unveiling the potential of small language models with scalable training strategies. arXiv preprint arXiv:2404.06395, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Adaptive mixtures of local experts

Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. Adaptive mixtures of local experts. Neural computation, 3(1):79–87, 1991

1991
[20]

Hierarchical mixtures of experts and the em algorithm.Neuralcomputation, 6(2):181–214, 1994

Michael I Jordan and Robert A Jacobs. Hierarchical mixtures of experts and the em algorithm.Neuralcomputation, 6(2):181–214, 1994

1994
[21]

Andrej Karpathy. Nanogpt. https://github.com/karpathy/nanoGPT, 2022. 13

2022
[22]

Muon optimizer

Jordan Keller et al. Muon optimizer. https://github.com/KellerJordan/Muon?tab=readme-ov-file, 2024

2024
[23]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[24]

Self-normalizing neural networks

Günter Klambauer, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter. Self-normalizing neural networks. In Advancesin Neural Information Processing Systems, 2017

2017
[25]

Muon is Scalable for LLM Training

Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, et al. Muon is scalable for llm training.arXiv preprint arXiv:2502.16982, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

KAN: Kolmogorov-Arnold Networks

Ziming Liu, Yixuan Wang, Sachin Vaidya, Fabian Ruehle, James Halverson, Marin Soljačić, Thomas Y Hou, and Max Tegmark. Kan: Kolmogorov-arnold networks.arXiv preprint arXiv:2404.19756, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[28]

Learning Combinations of Activation Functions

Franco Manessi and Alessandro Rozza. Learning combinations of activation functions. arXiv preprint arXiv:1801.09403, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[29]

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering.arXiv preprint arXiv:1809.02789, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[30]

Mish: A self regularized non-monotonic activation function.arXiv preprint arXiv:1908.08681, 2019

Diganta Misra. Mish: A self regularized non-monotonic activation function.arXiv preprint arXiv:1908.08681, 2019

work page arXiv 1908
[31]

Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

2019
[32]

Prajit Ramachandran, Barret Zoph, and Quoc V. Le. Searching for activation functions. In International Conference on Learning Representations Workshop, 2018

2018
[33]

Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

2021
[34]

GLU Variants Improve Transformer

Noam Shazeer. GLU variants improve transformer.arXiv preprint arXiv:2002.05202, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2002
[35]

Outrageously large neural networks: The sparsely-gated mixture-of-experts layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. InInternational Conference on Learning Representations, 2017

2017
[36]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

2024
[37]

Adaptive Blending Units: Trainable Activation Functions for Deep Neural Networks

Leon René Sütfeld, Flemming Brieger, Holger Finger, Sonja Füllhase, and Gordon Pipa. Adaptive blending units: Trainable activation functions for deep neural networks.arXiv preprint arXiv:1806.10064, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[38]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, et al. LLaMA: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[39]

Attention is all you need.Advancesin neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advancesin neural information processing systems, 30, 2017

2017
[40]

Understanding the expressive power and mechanisms of transformer for sequence modeling

Mingze Wang and Weinan E. Understanding the expressive power and mechanisms of transformer for sequence modeling. Advancesin Neural Information Processing Systems, 2024

2024
[41]

Quick and (not so) dirty: Unsupervised selection of justification sentences for multi-hop question answering.arXiv preprint arXiv:1911.07176, 2019

Vikas Yadav, Steven Bethard, and Mihai Surdeanu. Quick and (not so) dirty: Unsupervised selection of justification sentences for multi-hop question answering.arXiv preprint arXiv:1911.07176, 2019

work page arXiv 1911
[42]

Qwen2 Technical Report

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[43]

HellaSwag: Can a Machine Really Finish Your Sentence?

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1905
[44]

Scaling vision transformers

Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer. Scaling vision transformers. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12104–12113, 2022. 14

2022
[45]

Relu2 wins: Discovering efficient activation functions for sparse llms.arXiv preprint arXiv:2402.03804, 2024

Zhengyan Zhang, Yixin Song, Guanghui Yu, Xu Han, Yankai Lin, Chaojun Xiao, Chenyang Song, Zhiyuan Liu, Zeyu Mi, and Maosong Sun. Relu2 wins: Discovering efficient activation functions for sparse llms.arXiv preprint arXiv:2402.03804, 2024

work page arXiv 2024
[46]

Understanding transformer from the perspective of associative memory.arXiv preprint arXiv:2505.19488, 2025

Shu Zhong, Mingyu Xu, Tenglong Ao, and Guang Shi. Understanding transformer from the perspective of associative memory.arXiv preprint arXiv:2505.19488, 2025

work page arXiv 2025
[47]

Value residual learning

Zhanchao Zhou, Tianyi Wu, Zhiyun Jiang, Fares Obeid, and Zhenzhong Lan. Value residual learning. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 28341–28356, 2025

2025
[48]

Polynomial composition activations: Unleashing the dynamics of large language models.arXiv preprint arXiv:2411.03884, 2024

Zhijian Zhuo, Ya Wang, Yutao Zeng, Xiaoqing Li, Xun Zhou, and Jinwen Ma. Polynomial composition activations: Unleashing the dynamics of large language models.arXiv preprint arXiv:2411.03884, 2024. 15 Appendix A Experimental Details. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16 A.1 Experimental Details for Section 5.2...

work page arXiv 2024

[1] [1]

Learning Activation Functions to Improve Deep Neural Networks

Forest Agostinelli, Matthew Hoffman, Peter Sadowski, and Pierre Baldi. Learning activation functions to improve deep neural networks.arXiv preprint arXiv:1412.6830, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[2] [2]

A survey on modern trainable activation functions.Neural Networks, 138:14–32, 2021

Andrea Apicella, Francesco Donnarumma, Francesco Isgrò, and Roberto Prevete. A survey on modern trainable activation functions.Neural Networks, 138:14–32, 2021

2021

[3] [3]

Breaking the curse of dimensionality with convex neural networks.The Journal of MachineLearning Research, 18(1):629–681, 2017

Francis Bach. Breaking the curse of dimensionality with convex neural networks.The Journal of MachineLearning Research, 18(1):629–681, 2017

2017

[4] [4]

Neural net approximation

Andrew R Barron. Neural net approximation. InProc. 7th YaleWorkshopon Adaptive and Learning Systems, volume 1, pages 69–72, 1992

1992

[5] [5]

Andrew R. Barron. Universal approximation bounds for superpositions of a sigmoidal function.IEEE Transactions on Information theory, 39(3):930–945, 1993

1993

[6] [6]

Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)

Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning by exponential linear units (elus).arXiv preprint arXiv:1511.07289, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[7] [7]

Knowledge neurons in pretrained transformers

Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, Baobao Chang, and Furu Wei. Knowledge neurons in pretrained transformers. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume1: Long Papers), pages 8493–8502, 2022

2022

[8] [8]

Dauphin, Angela Fan, Michael Auli, and David Grangier

Yann N. Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with gated convolutional networks. InProceedings of the 34th International Conference on Machine Learning, pages 933–941, 2017

2017

[9] [9]

Bert: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of NAACL-HLT, pages 4171–4186, 2019

2019

[10] [10]

Sigmoid-weighted linear units for neural network function approxi- mation in reinforcement learning.Neural Networks, 107:3–11, 2018

Stefan Elfwing, Eiji Uchibe, and Kenji Doya. Sigmoid-weighted linear units for neural network function approxi- mation in reinforcement learning.Neural Networks, 107:3–11, 2018

2018

[11] [11]

Deep sparse rectifier neural networks

Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural networks. InProceedings of the FourteenthInternational Conference on Artificial Intelligence and Statistics, pages 315–323, 2011

2011

[12] [12]

Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio

Ian J. Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. Maxout networks. In Proceedings of the 30th International Conference on Machine Learning, pages 1319–1327, 2013

2013

[13] [13]

Learning activation functions: A new paradigm for understanding neural networks.arXiv preprint arXiv:1906.09529, 2019

Mohit Goyal, Rajan Goyal, and Brejesh Lall. Learning activation functions: A new paradigm for understanding neural networks.arXiv preprint arXiv:1906.09529, 2019

work page arXiv 1906

[14] [14]

Delving deep into rectifiers: Surpassing human-level performance on imagenet classification

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. InProceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015

2015

[15] [15]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022

2022

[16] [16]

Gaussian Error Linear Units (GELUs)

Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus).arXiv preprint arXiv:1606.08415, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[17] [17]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[18] [18]

MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Weilin Zhao, et al. Minicpm: Unveiling the potential of small language models with scalable training strategies. arXiv preprint arXiv:2404.06395, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [19]

Adaptive mixtures of local experts

Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. Adaptive mixtures of local experts. Neural computation, 3(1):79–87, 1991

1991

[20] [20]

Hierarchical mixtures of experts and the em algorithm.Neuralcomputation, 6(2):181–214, 1994

Michael I Jordan and Robert A Jacobs. Hierarchical mixtures of experts and the em algorithm.Neuralcomputation, 6(2):181–214, 1994

1994

[21] [21]

Andrej Karpathy. Nanogpt. https://github.com/karpathy/nanoGPT, 2022. 13

2022

[22] [22]

Muon optimizer

Jordan Keller et al. Muon optimizer. https://github.com/KellerJordan/Muon?tab=readme-ov-file, 2024

2024

[23] [23]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[24] [24]

Self-normalizing neural networks

Günter Klambauer, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter. Self-normalizing neural networks. In Advancesin Neural Information Processing Systems, 2017

2017

[25] [25]

Muon is Scalable for LLM Training

Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, et al. Muon is scalable for llm training.arXiv preprint arXiv:2502.16982, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [26]

KAN: Kolmogorov-Arnold Networks

Ziming Liu, Yixuan Wang, Sachin Vaidya, Fabian Ruehle, James Halverson, Marin Soljačić, Thomas Y Hou, and Max Tegmark. Kan: Kolmogorov-arnold networks.arXiv preprint arXiv:2404.19756, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[27] [27]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[28] [28]

Learning Combinations of Activation Functions

Franco Manessi and Alessandro Rozza. Learning combinations of activation functions. arXiv preprint arXiv:1801.09403, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[29] [29]

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering.arXiv preprint arXiv:1809.02789, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[30] [30]

Mish: A self regularized non-monotonic activation function.arXiv preprint arXiv:1908.08681, 2019

Diganta Misra. Mish: A self regularized non-monotonic activation function.arXiv preprint arXiv:1908.08681, 2019

work page arXiv 1908

[31] [31]

Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

2019

[32] [32]

Prajit Ramachandran, Barret Zoph, and Quoc V. Le. Searching for activation functions. In International Conference on Learning Representations Workshop, 2018

2018

[33] [33]

Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

2021

[34] [34]

GLU Variants Improve Transformer

Noam Shazeer. GLU variants improve transformer.arXiv preprint arXiv:2002.05202, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2002

[35] [35]

Outrageously large neural networks: The sparsely-gated mixture-of-experts layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. InInternational Conference on Learning Representations, 2017

2017

[36] [36]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

2024

[37] [37]

Adaptive Blending Units: Trainable Activation Functions for Deep Neural Networks

Leon René Sütfeld, Flemming Brieger, Holger Finger, Sonja Füllhase, and Gordon Pipa. Adaptive blending units: Trainable activation functions for deep neural networks.arXiv preprint arXiv:1806.10064, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[38] [38]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, et al. LLaMA: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[39] [39]

Attention is all you need.Advancesin neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advancesin neural information processing systems, 30, 2017

2017

[40] [40]

Understanding the expressive power and mechanisms of transformer for sequence modeling

Mingze Wang and Weinan E. Understanding the expressive power and mechanisms of transformer for sequence modeling. Advancesin Neural Information Processing Systems, 2024

2024

[41] [41]

Quick and (not so) dirty: Unsupervised selection of justification sentences for multi-hop question answering.arXiv preprint arXiv:1911.07176, 2019

Vikas Yadav, Steven Bethard, and Mihai Surdeanu. Quick and (not so) dirty: Unsupervised selection of justification sentences for multi-hop question answering.arXiv preprint arXiv:1911.07176, 2019

work page arXiv 1911

[42] [42]

Qwen2 Technical Report

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[43] [43]

HellaSwag: Can a Machine Really Finish Your Sentence?

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1905

[44] [44]

Scaling vision transformers

Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer. Scaling vision transformers. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12104–12113, 2022. 14

2022

[45] [45]

Relu2 wins: Discovering efficient activation functions for sparse llms.arXiv preprint arXiv:2402.03804, 2024

Zhengyan Zhang, Yixin Song, Guanghui Yu, Xu Han, Yankai Lin, Chaojun Xiao, Chenyang Song, Zhiyuan Liu, Zeyu Mi, and Maosong Sun. Relu2 wins: Discovering efficient activation functions for sparse llms.arXiv preprint arXiv:2402.03804, 2024

work page arXiv 2024

[46] [46]

Understanding transformer from the perspective of associative memory.arXiv preprint arXiv:2505.19488, 2025

Shu Zhong, Mingyu Xu, Tenglong Ao, and Guang Shi. Understanding transformer from the perspective of associative memory.arXiv preprint arXiv:2505.19488, 2025

work page arXiv 2025

[47] [47]

Value residual learning

Zhanchao Zhou, Tianyi Wu, Zhiyun Jiang, Fares Obeid, and Zhenzhong Lan. Value residual learning. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 28341–28356, 2025

2025

[48] [48]

Polynomial composition activations: Unleashing the dynamics of large language models.arXiv preprint arXiv:2411.03884, 2024

Zhijian Zhuo, Ya Wang, Yutao Zeng, Xiaoqing Li, Xun Zhou, and Jinwen Ma. Polynomial composition activations: Unleashing the dynamics of large language models.arXiv preprint arXiv:2411.03884, 2024. 15 Appendix A Experimental Details. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16 A.1 Experimental Details for Section 5.2...

work page arXiv 2024