Activation-Free Backbones for Image Recognition: Polynomial Alternatives within MetaFormer-Style Vision Models

Grigorios G. Chrysos; Jeffrey Wang; Jonathan Gregory

arxiv: 2605.20839 · v1 · pith:PXX5CKPVnew · submitted 2026-05-20 · 💻 cs.CV · cs.LG

Activation-Free Backbones for Image Recognition: Polynomial Alternatives within MetaFormer-Style Vision Models

Jeffrey Wang , Jonathan Gregory , Grigorios G. Chrysos This is my paper

Pith reviewed 2026-05-21 04:44 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords activation-free backbonespolynomial networksMetaFormervision transformersimage classificationsemantic segmentation

0 comments

The pith

Vision backbones function effectively without pointwise activations by using polynomial alternatives.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that standard nonlinear activations like ReLU and GELU, along with softmax, are not essential in MetaFormer-style vision models. By replacing them with polynomial functions created through Hadamard products in MLPs, convolutions, and attention modules, the authors create PolyNeXt models. These models perform as well as or better than traditional activation-based versions on ImageNet classification, ADE20K segmentation, and robustness tests, while also surpassing earlier polynomial network designs at lower cost. A sympathetic reader would care because this challenges the assumption that specific nonlinearities are necessary for effective vision processing and suggests simpler, potentially more efficient architectures.

Core claim

Modern vision backbones do not require pointwise activations or exponential softmax as sources of nonlinearity. Instead, activation-free polynomial alternatives using Hadamard products for MLPs, convolutions, and attention can be integrated into MetaFormer-style architectures to produce models that match or exceed the performance of activation-based counterparts across scales on key vision tasks.

What carries the argument

Activation-free polynomial modules where Hadamard products replace standard nonlinearities to yield polynomial functions of the input, integrated into MetaFormer framework.

If this is right

PolyNeXt models achieve comparable or superior results on ImageNet classification across model scales.
They maintain strong performance on ADE20K semantic segmentation.
Improved or equivalent out-of-distribution robustness compared to standard models.
Substantial outperformance of prior polynomial networks with reduced computational cost.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Polynomial representations may capture sufficient complexity for image recognition without needing learned activation functions.
This approach could simplify deployment on hardware that benefits from polynomial computations.
Extensions might test these modules in other backbone families like Transformers or CNNs beyond MetaFormer.

Load-bearing premise

That Hadamard products can substitute for standard nonlinearities in MLPs, convolutions, and attention to create effective polynomial functions within MetaFormer architectures.

What would settle it

A direct comparison showing that PolyNeXt models underperform activation-based MetaFormer models by a significant margin on ImageNet top-1 accuracy at similar parameter counts or FLOPs.

Figures

Figures reproduced from arXiv: 2605.20839 by Grigorios G. Chrysos, Jeffrey Wang, Jonathan Gregory.

**Figure 1.** Figure 1: ImageNet-1K accuracy comparison. Our polynomial backbones (CPolyNeXt, APolyNeXt) consistently outperform MetaFormer instantiations (ConvFormer, CAFormer) and prior polynomial networks (MONet, DTTN) across model scales. Curves show accuracy vs. (a) parameters and (b) FLOPs. combine information across a particular dimension, either channels (features at each spatial location) or spatial positions. These ope… view at source ↗

**Figure 2.** Figure 2: Polynomial module primitives. We introduce activation-free replacements for three core vision operators: (a) PolyMLP for channel mixing, which forms a second-degree polynomial by Hadamard-multiplying two linear projections (with LayerNorm) before an output projection; (b) PolyConv for convolutional spatial mixing, which replaces the activation in separable convolutions with a Hadamard fusion of a coarse (d… view at source ↗

**Figure 3.** Figure 3: PolyNeXt cell and stabilization. A cell takes two inputs from the previous two cells, xt−1 and xt−2. In the multi-input skip module, learnable per-channel scalars s0 and s1 reweight xt−1 and xt−2 before summation and LayerNorm. The normalized state is processed by X repeated PolyNeXt stacks (spatial mixer then PolyMLP). Each residual branch uses Sigmoid-Scale gating, y = x + σ(λ) f(x), to bound residual up… view at source ↗

read the original abstract

Modern vision backbones treat pointwise activations (e.g., ReLU, GELU) and exponential softmax as essential sources of nonlinearity, but we demonstrate they are not required within MetaFormer-style vision backbones. We design activation-free polynomial alternatives for three core primitives (MLPs, convolutions, and attention), where Hadamard products replace standard nonlinearities to yield polynomial functions of the input. These modules integrate seamlessly into existing architectures: instantiated within MetaFormer, a modular framework for vision backbones, our PolyNeXt models match or exceed activation-based counterparts across model scales on ImageNet classification, ADE20K semantic segmentation, and out-of-distribution robustness. We also substantially outperform prior polynomial networks at reduced computational cost, showing that polynomial variants of standard modules beat complex custom architectures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PolyNeXt replaces activations and softmax with Hadamard-product polynomials in MetaFormer blocks and reports matching performance on ImageNet, ADE20K, and OOD tasks while beating prior polynomial nets at lower cost.

read the letter

The central result is that pointwise activations and exponential softmax are not required inside MetaFormer-style vision backbones. The authors give explicit polynomial replacements for the MLP, convolution, and attention primitives, all built from Hadamard products, and show that the resulting PolyNeXt models reach or exceed the accuracy of standard activation-based counterparts across scales on ImageNet classification, ADE20K segmentation, and out-of-distribution tests. They also improve on earlier polynomial networks while cutting compute. That combination of concrete module designs and multi-task empirical comparisons is the actual new piece here.

Referee Report

2 major / 2 minor

Summary. The paper proposes activation-free polynomial alternatives to standard nonlinearities (ReLU/GELU and softmax) within MetaFormer-style vision backbones. Hadamard products are used to construct polynomial functions for MLPs, convolutions, and attention modules. These are instantiated as PolyNeXt models that are claimed to match or exceed activation-based counterparts across scales on ImageNet classification, ADE20K semantic segmentation, and out-of-distribution robustness, while also outperforming prior polynomial networks at lower computational cost.

Significance. If the empirical claims hold under rigorous verification, the result would be significant for vision architecture design: it would demonstrate that pointwise activations and exponential normalization are not required for competitive performance in modular MetaFormer blocks. The multi-task evaluation (classification, segmentation, OOD) and direct comparison to both activation-based and prior polynomial baselines would strengthen the case for polynomial primitives as a viable, lower-complexity alternative. Reproducible code or parameter-free derivations would further elevate the contribution.

major comments (2)

[§3.2] §3.2 (Polynomial Attention): The replacement of softmax with a direct Hadamard-product polynomial on Q/K/V projections is presented without any explicit normalization, scaling factor, or LayerNorm placement to bound the scores or enforce summation to one. Standard attention relies on these properties for stability; their absence risks unbounded growth with input magnitude or depth, directly undermining the central claim of 'seamless integration' and matching performance on ImageNet/ADE20K. A concrete stability analysis or ablation on normalization variants is required.
[§4.1, Table 1] §4.1 and Table 1 (ImageNet results): The claim that PolyNeXt matches or exceeds activation-based MetaFormer variants is load-bearing, yet the manuscript supplies no details on training epochs, optimizer settings, data augmentation strength, or number of independent runs. Without these, it is impossible to determine whether the reported gains are statistically reliable or sensitive to hyperparameter choices that might implicitly re-introduce nonlinearity.

minor comments (2)

[§3] Notation for the polynomial degree and Hadamard-product order is introduced inconsistently between the MLP, convolution, and attention subsections; a single unified definition would improve clarity.
[Figure 2] Figure 2 (architecture diagram) does not annotate the exact placement of LayerNorm relative to the polynomial attention block, making it difficult to reproduce the claimed activation-free property.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below with point-by-point responses, indicating planned revisions to improve clarity and rigor while preserving the core contributions of activation-free polynomial modules in MetaFormer-style architectures.

read point-by-point responses

Referee: [§3.2] §3.2 (Polynomial Attention): The replacement of softmax with a direct Hadamard-product polynomial on Q/K/V projections is presented without any explicit normalization, scaling factor, or LayerNorm placement to bound the scores or enforce summation to one. Standard attention relies on these properties for stability; their absence risks unbounded growth with input magnitude or depth, directly undermining the central claim of 'seamless integration' and matching performance on ImageNet/ADE20K. A concrete stability analysis or ablation on normalization variants is required.

Authors: We appreciate the referee's emphasis on stability considerations for the polynomial attention module. Our design uses Hadamard products to construct a polynomial approximation that empirically replicates the selective weighting behavior of softmax while remaining entirely activation-free. The reported results across ImageNet, ADE20K, and robustness benchmarks were obtained with stable end-to-end training, suggesting that the polynomial form inherently limits explosive growth in practice for the tested model scales and depths. Nevertheless, to directly address the request, we will add a dedicated stability analysis subsection in §3.2. This will include mathematical bounds on the polynomial output range, gradient flow observations, and an ablation comparing variants with and without optional post-projection LayerNorm. These additions will strengthen the evidence for seamless integration without altering the activation-free property. revision: yes
Referee: [§4.1, Table 1] §4.1 and Table 1 (ImageNet results): The claim that PolyNeXt matches or exceeds activation-based MetaFormer variants is load-bearing, yet the manuscript supplies no details on training epochs, optimizer settings, data augmentation strength, or number of independent runs. Without these, it is impossible to determine whether the reported gains are statistically reliable or sensitive to hyperparameter choices that might implicitly re-introduce nonlinearity.

Authors: We agree that comprehensive experimental details are essential for reproducibility and for confirming that performance differences arise from the polynomial substitutions rather than training variations. All PolyNeXt and baseline MetaFormer models were trained for 300 epochs using the AdamW optimizer with the exact hyperparameter schedule and data augmentations from the original MetaFormer implementation (including RandAugment, Mixup, and CutMix at standard strengths). Results in Table 1 reflect single-run reporting consistent with prior MetaFormer papers, but we will expand §4.1 to include the full training recipe, optimizer settings, and augmentation details. We will also report results from three independent runs with mean and standard deviation to demonstrate statistical reliability. These revisions will eliminate any ambiguity regarding implicit nonlinearities introduced via hyperparameters. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical design and validation of polynomial modules

full rationale

The paper proposes replacing pointwise activations and softmax with Hadamard-product polynomial alternatives for MLPs, convolutions, and attention, then instantiates these within the existing MetaFormer framework and reports empirical results on ImageNet, ADE20K, and OOD benchmarks. No derivation chain reduces a claimed prediction or uniqueness result to a fitted parameter or self-citation by construction; the central claims rest on experimental matching or exceeding of activation-based baselines rather than tautological definitions or internally forced quantities. The approach is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so the ledger is limited to the core design assumption stated in the text.

axioms (1)

domain assumption Hadamard products can replace standard nonlinearities to yield polynomial functions of the input that maintain model performance
This premise underpins the activation-free module designs for MLPs, convolutions, and attention.

pith-pipeline@v0.9.0 · 5666 in / 1301 out tokens · 42683 ms · 2026-05-21T04:44:56.059861+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 2 internal anchors

[1]

(Leveled) Fully Homomorphic Encryption without Bootstrapping

ISSN 1942-3454. doi: 10.1145/2633600. URL https: //doi.org/10.1145/2633600. Center for High Throughput Computing. Center for high throughput computing,

work page doi:10.1145/2633600 1942
[2]

Generating Long Sequences with Sparse Transformers

Child, R., Gray, S., Radford, A., and Sutskever, I. Gen- erating long sequences with sparse transformers.arXiv preprint arXiv:1904.10509,

work page internal anchor Pith review Pith/arXiv arXiv 1904
[3]

The American Statistician 36(3a):153–157 Charoenphakdee N, Cui Z, Zhang Y, et al (2021) Classification with rejection based on cost- sensitive classification

doi: 10.1109/CVPR.2009.5206848. Dong, X., Bao, J., Chen, D., Zhang, W., Yu, N., Yuan, L., Chen, D., and Guo, B. Cswin transformer: A general vision transformer backbone with cross-shaped windows. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12124–12134,

work page doi:10.1109/cvpr.2009.5206848 2009
[4]

URL https://doi.org/10

doi: 10.1109/ TPAMI.2023.3282631. URL https://doi.org/10. 1109/TPAMI.2023.3282631. Li, S., Wang, Z., Liu, Z., Tan, C., Lin, H., Wu, D., Chen, Z., Zheng, J., and Li, S. Z. Moganet: Multi-order gated aggregation network. InInternational Conference on Learning Representations (ICLR),

work page arXiv 2023
[5]

GLU Variants Improve Transformer

Shazeer, N. GLU variants improve transformer.arXiv preprint arXiv:2002.05202,

work page internal anchor Pith review Pith/arXiv arXiv 2002
[6]

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A

URLhttps://arxiv.org/abs/2204.01697. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser,Ł., and Polosukhin, I. Attention is all you need. InAdvances in Neural Information Pro- cessing Systems (NeurIPS), volume 30, pp. 5998–6008,

work page arXiv
[7]

C., and Xing, E

Wang, H., Ge, S., Lipton, Z. C., and Xing, E. P. Learning robust global representations by penaliz- ing local predictive power. InAdvances in Neu- ral Information Processing Systems 32 (NeurIPS 2019),

work page 2019
[8]

neurips.cc/paper/2019/hash/ 3eefceb8087e964f89c2d59e8a249915-Abstract

URL https://proceedings. neurips.cc/paper/2019/hash/ 3eefceb8087e964f89c2d59e8a249915-Abstract. html. Wang, W., Dai, J., Chen, Z., Huang, Z., Li, Z., Zhu, X., Hu, X., Lu, T., Lu, L., Li, H., Wang, X., and Qiao, Y . Intern- image: Exploring large-scale vision foundation models with deformable convolutions. InProceedings of the IEEE/CVF Conference on Comput...

work page 2019
[9]

12 Activation-Free Backbones for Image Recognition Linear DW Conv DW Conv Linear - Output Dimension PolyAttn Linear - Input Dimension DW Conv DW Conv Learnable scalar Polynomial Degree Norm - Hidden Dimension - Hidden Dimension Linear Linear - Output Dimension Standard Multi-Head Self-Attention Linear - Input Dimension Softmax - Hidden Dimension Linear - ...

work page 2024
[10]

and NASNet (Zoph et al., 2018), provide additional gradient pathways by connecting layers to multiple predecessors. Polynomial networks present particular stability challenges because Hadamard products can amplify activation magnitudes: unlike ReLU, which does not magnify outputs, the multiplication of two large values in a Hadamard product produces an ev...

work page 2018
[11]

We introduce Sigmoid-Scale, which bounds residual contributions through a sigmoid-parameterized learnable scalar, and adopt multi-input skip connections following NASNet

or architectural constraints (Cheng et al., 2024). We introduce Sigmoid-Scale, which bounds residual contributions through a sigmoid-parameterized learnable scalar, and adopt multi-input skip connections following NASNet. This lightweight approach preserves the core polynomial operations while enabling training at around 200 layers. C. Training Details C....

work page 2024
[12]

in several ways, with some modifications following MONet (Cheng et al., 2024). The key insight is that polynomial networks have higher capacity than their activation-based counterparts and therefore benefit from stronger regularization and modified optimization: • Batch size:We use 1024 instead of

work page 2024
[13]

Computational Cost D.1

Params (M) 5.5 Other components Same as CPolyNeXt-T Optimization and augmentation Input resolution32 2 322 642 Optimizer AdamW LR schedule Cosine decay Batch size 96 Learning rate 0.001 Weight decay 0.05 SVHN augmentation – No horizontal flip – Other hyperparameters Same as Table S8 (CPolyNeXt) D. Computational Cost D.1. Inference Throughput and Memory Ta...

work page 1900

[1] [1]

(Leveled) Fully Homomorphic Encryption without Bootstrapping

ISSN 1942-3454. doi: 10.1145/2633600. URL https: //doi.org/10.1145/2633600. Center for High Throughput Computing. Center for high throughput computing,

work page doi:10.1145/2633600 1942

[2] [2]

Generating Long Sequences with Sparse Transformers

Child, R., Gray, S., Radford, A., and Sutskever, I. Gen- erating long sequences with sparse transformers.arXiv preprint arXiv:1904.10509,

work page internal anchor Pith review Pith/arXiv arXiv 1904

[3] [3]

The American Statistician 36(3a):153–157 Charoenphakdee N, Cui Z, Zhang Y, et al (2021) Classification with rejection based on cost- sensitive classification

doi: 10.1109/CVPR.2009.5206848. Dong, X., Bao, J., Chen, D., Zhang, W., Yu, N., Yuan, L., Chen, D., and Guo, B. Cswin transformer: A general vision transformer backbone with cross-shaped windows. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12124–12134,

work page doi:10.1109/cvpr.2009.5206848 2009

[4] [4]

URL https://doi.org/10

doi: 10.1109/ TPAMI.2023.3282631. URL https://doi.org/10. 1109/TPAMI.2023.3282631. Li, S., Wang, Z., Liu, Z., Tan, C., Lin, H., Wu, D., Chen, Z., Zheng, J., and Li, S. Z. Moganet: Multi-order gated aggregation network. InInternational Conference on Learning Representations (ICLR),

work page arXiv 2023

[5] [5]

GLU Variants Improve Transformer

Shazeer, N. GLU variants improve transformer.arXiv preprint arXiv:2002.05202,

work page internal anchor Pith review Pith/arXiv arXiv 2002

[6] [6]

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A

URLhttps://arxiv.org/abs/2204.01697. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser,Ł., and Polosukhin, I. Attention is all you need. InAdvances in Neural Information Pro- cessing Systems (NeurIPS), volume 30, pp. 5998–6008,

work page arXiv

[7] [7]

C., and Xing, E

Wang, H., Ge, S., Lipton, Z. C., and Xing, E. P. Learning robust global representations by penaliz- ing local predictive power. InAdvances in Neu- ral Information Processing Systems 32 (NeurIPS 2019),

work page 2019

[8] [8]

neurips.cc/paper/2019/hash/ 3eefceb8087e964f89c2d59e8a249915-Abstract

URL https://proceedings. neurips.cc/paper/2019/hash/ 3eefceb8087e964f89c2d59e8a249915-Abstract. html. Wang, W., Dai, J., Chen, Z., Huang, Z., Li, Z., Zhu, X., Hu, X., Lu, T., Lu, L., Li, H., Wang, X., and Qiao, Y . Intern- image: Exploring large-scale vision foundation models with deformable convolutions. InProceedings of the IEEE/CVF Conference on Comput...

work page 2019

[9] [9]

12 Activation-Free Backbones for Image Recognition Linear DW Conv DW Conv Linear - Output Dimension PolyAttn Linear - Input Dimension DW Conv DW Conv Learnable scalar Polynomial Degree Norm - Hidden Dimension - Hidden Dimension Linear Linear - Output Dimension Standard Multi-Head Self-Attention Linear - Input Dimension Softmax - Hidden Dimension Linear - ...

work page 2024

[10] [10]

and NASNet (Zoph et al., 2018), provide additional gradient pathways by connecting layers to multiple predecessors. Polynomial networks present particular stability challenges because Hadamard products can amplify activation magnitudes: unlike ReLU, which does not magnify outputs, the multiplication of two large values in a Hadamard product produces an ev...

work page 2018

[11] [11]

We introduce Sigmoid-Scale, which bounds residual contributions through a sigmoid-parameterized learnable scalar, and adopt multi-input skip connections following NASNet

or architectural constraints (Cheng et al., 2024). We introduce Sigmoid-Scale, which bounds residual contributions through a sigmoid-parameterized learnable scalar, and adopt multi-input skip connections following NASNet. This lightweight approach preserves the core polynomial operations while enabling training at around 200 layers. C. Training Details C....

work page 2024

[12] [12]

in several ways, with some modifications following MONet (Cheng et al., 2024). The key insight is that polynomial networks have higher capacity than their activation-based counterparts and therefore benefit from stronger regularization and modified optimization: • Batch size:We use 1024 instead of

work page 2024

[13] [13]

Computational Cost D.1

Params (M) 5.5 Other components Same as CPolyNeXt-T Optimization and augmentation Input resolution32 2 322 642 Optimizer AdamW LR schedule Cosine decay Batch size 96 Learning rate 0.001 Weight decay 0.05 SVHN augmentation – No horizontal flip – Other hyperparameters Same as Table S8 (CPolyNeXt) D. Computational Cost D.1. Inference Throughput and Memory Ta...

work page 1900