Recognition: unknown
ZC-Swish: Stabilizing Deep BN-Free Networks for Edge and Micro-Batch Applications
Pith reviewed 2026-05-10 03:01 UTC · model grok-4.3
The pith
A zero-centered variant of Swish keeps deep networks stable without batch normalization by anchoring activation means near zero.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that the non-zero-centered character of common activations drives compounding mean shifts and training failure in deep BN-free networks, and that ZC-Swish restores stable dynamics by anchoring those means near zero, yielding higher accuracy than standard Swish or ReLU at depth 16 and usable performance at depth 32.
What carries the argument
ZC-Swish, a parameterized activation function that dynamically anchors activation means near zero while preserving Swish's non-monotonic shape.
If this is right
- BN-free convolutional networks can be trained to depth 32 without collapse to random performance.
- ZC-Swish delivers the highest test accuracy at depth 16 among tested activations in the reported experiments.
- Layer-wise activation means and variances remain consistent across depths when ZC-Swish replaces standard Swish.
- The function offers a parameter-free drop-in replacement that enables deeper architectures in memory-constrained or privacy-preserving settings.
Where Pith is reading between the lines
- A similar zero-centering adjustment could be tested on other non-monotonic activations to see whether the same stability gain appears.
- The approach might reduce the need for auxiliary normalization layers in federated or medical-imaging pipelines that already avoid batch statistics.
- If the mean-anchoring mechanism generalizes, it could be combined with weight-initialization changes to push usable depth further in fully normalization-free models.
Load-bearing premise
The non-zero-centered property of standard activations is the dominant cause of instability in BN-free networks, and anchoring means near zero is enough to restore stable training without harming gradients or features.
What would settle it
Train identical depth-16 BN-free convolutional networks with standard Swish and with ZC-Swish on the same data and seeds; if standard Swish reaches comparable or higher accuracy and stable dynamics in repeated runs, the central claim does not hold.
Figures
read the original abstract
Batch Normalization (BN) is a cornerstone of deep learning, yet it fundamentally breaks down in micro-batch regimes (e.g., 3D medical imaging) and non-IID Federated Learning. Removing BN from deep architectures, however, often leads to catastrophic training failures such as vanishing gradients and dying channels. We identify that standard activation functions, like Swish and ReLU, exacerbate this instability in BN-free networks due to their non-zero-centered nature, which causes compounding activation mean-shifts as network depth increases. In this technical communication, we propose Zero-Centered Swish (ZC-Swish), a drop-in activation function parameterized to dynamically anchor activation means near zero. Through targeted stress-testing on BN-free convolutional networks at depths 8, 16, and 32, we demonstrate that while standard Swish collapses to near-random performance at depth 16 and beyond, ZC-Swish maintains stable layer-wise activation dynamics and achieves the highest test accuracy at depth 16 (51.5%) with seed 42. ZC-Swish thus provides a robust, parameter-efficient solution for stabilizing deep networks in memory-constrained and privacy-preserving applications where traditional normalization is unviable.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Zero-Centered Swish (ZC-Swish) as a drop-in activation function for batch-normalization-free deep networks. It identifies the non-zero-centered nature of standard activations (Swish, ReLU) as the cause of compounding mean shifts and training collapse in deep BN-free convolutional networks, particularly at depths 16 and beyond. Through stress tests on networks of depths 8, 16, and 32, the authors claim ZC-Swish maintains stable layer-wise activation dynamics by dynamically anchoring means near zero, while standard Swish collapses to near-random performance; ZC-Swish achieves the highest reported test accuracy of 51.5% at depth 16 with seed 42.
Significance. If the empirical claims hold after proper controls and statistical validation, the work would address a practical barrier to training deep models without batch normalization in micro-batch regimes (e.g., 3D medical imaging) and non-IID federated learning on edge devices. A simple, parameter-efficient activation change could enable deeper BN-free architectures in memory-constrained settings where normalization layers are infeasible.
major comments (3)
- [Abstract and stress-test description] Abstract and stress-test description: the experiments compare ZC-Swish only to standard Swish and do not include controls against other zero-centered activations (e.g., tanh or shifted ReLU) or against standard Swish followed by explicit per-layer mean subtraction. This is load-bearing for the central claim that non-zero-centered activations are the dominant instability driver and that the specific dynamic anchoring in ZC-Swish is what restores stable dynamics.
- [Abstract] Abstract: the 51.5% test accuracy at depth 16 is reported for a single seed (seed 42) with no variance across runs, standard deviations, or multiple independent trials. Without these, the result cannot reliably support claims of stable performance or superiority over standard Swish at depth 16.
- [Abstract] Abstract: no experimental protocol, dataset, architecture details beyond depth, training hyperparameters, or full baseline comparisons are described. This prevents assessment of whether the reported accuracy is robust or reproducible.
minor comments (1)
- The parameterization of ZC-Swish should be presented with an explicit equation or pseudocode in the main text to allow immediate implementation and verification.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments on our manuscript. We address each of the major comments point by point below, providing clarifications and outlining the revisions we plan to implement to strengthen the presentation and empirical support for our claims.
read point-by-point responses
-
Referee: [Abstract and stress-test description] Abstract and stress-test description: the experiments compare ZC-Swish only to standard Swish and do not include controls against other zero-centered activations (e.g., tanh or shifted ReLU) or against standard Swish followed by explicit per-layer mean subtraction. This is load-bearing for the central claim that non-zero-centered activations are the dominant instability driver and that the specific dynamic anchoring in ZC-Swish is what restores stable dynamics.
Authors: We agree that additional controls are necessary to fully substantiate our central claim regarding the role of non-zero-centered activations. Our current experiments show that ZC-Swish avoids the collapse seen with Swish, but to isolate whether the dynamic anchoring is superior to other zero-centering methods, we will include comparisons with tanh, shifted ReLU, and Swish with explicit per-layer mean subtraction in the revised stress tests at depths 8, 16, and 32. revision: yes
-
Referee: [Abstract] Abstract: the 51.5% test accuracy at depth 16 is reported for a single seed (seed 42) with no variance across runs, standard deviations, or multiple independent trials. Without these, the result cannot reliably support claims of stable performance or superiority over standard Swish at depth 16.
Authors: We recognize that single-seed results limit the generalizability of our findings. In the revised manuscript, we will perform the stress tests across multiple independent random seeds and report the mean test accuracies with standard deviations for ZC-Swish and the baselines at each network depth. revision: yes
-
Referee: [Abstract] Abstract: no experimental protocol, dataset, architecture details beyond depth, training hyperparameters, or full baseline comparisons are described. This prevents assessment of whether the reported accuracy is robust or reproducible.
Authors: The abstract is intended as a high-level summary. Comprehensive details on the experimental protocol, including the specific dataset, convolutional network architectures, training hyperparameters, and additional baseline comparisons, are provided in the Experiments and Methods sections of the full manuscript. To address this, we will incorporate a concise reference to the dataset and core settings within the abstract while maintaining its brevity. revision: partial
Circularity Check
No circularity: empirical proposal and stress tests are self-contained
full rationale
The paper advances ZC-Swish as a parameterized activation that anchors means near zero and supports the claim solely through direct empirical stress tests on BN-free convnets at depths 8/16/32, reporting activation dynamics and test accuracy. No derivation, first-principles result, or prediction is offered that could reduce to the authors' own fitted quantities or definitions. No self-citations appear as load-bearing premises, and the work contains no mathematical chain, uniqueness theorem, or ansatz smuggled via prior work. The central result is therefore externally falsifiable via replication of the reported accuracy and dynamics comparisons.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
In: Proceedings of the International Conference on Machine Learning (ICML) (2015)
Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: Proceedings of the International Conference on Machine Learning (ICML) (2015)
2015
-
[2]
Searching for Activation Functions
Ramachandran, P., Zoph, B., Le, Q.V.: Searching for activation functions. arXiv preprint arXiv:1710.05941 (2017)
work page internal anchor Pith review arXiv 2017
-
[3]
British Machine Vision Conference , year =
Misra, D.: Mish: A self regularized non-monotonic activation function. In: Pro- ceedings of the British Machine Vision Conference (BMVC) (2019).https://doi. org/10.5244/C.34.191
-
[4]
In: Proceedings of the 15th European Conference on Computer Vision (ECCV), part XIII, pp
Wu, Y., He, K.: Group normalization. In: Proceedings of the 15th European Conference on Computer Vision (ECCV), part XIII, pp. 3–19. Munich (2018). https://doi.org/10.1007/978-3-030-01261-8_1
-
[5]
In: Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS), JMLR: W&CP, vol
McMahan, H.B., Moore, E., Ramage, D., Hampson, S., Arcas, B.A.Y.: Communication-efficient learning of deep networks from decentralized data. In: Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS), JMLR: W&CP, vol. 54. Florida, USA (2017)
2017
-
[6]
In: Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), pp
He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: Surpassing human- level performance on ImageNet classification. In: Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), pp. 1026–1034. (2015). https://doi.org/10.1109/ICCV.2015.12
-
[7]
In: Proceedings of the 27th International Conference on Machine Learning (ICML), pp
Nair, V., Hinton, G.E.: Rectified linear units improve restricted Boltzmann ma- chines. In: Proceedings of the 27th International Conference on Machine Learning (ICML), pp. 807–814. Haifa, Israel (2010)
2010
-
[8]
Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016)
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[9]
In: Proceedings of the International Conference on Learning Representations (ICLR) (2019)
Zhang, H., Dauphin, Y.N., Ma, T.: Fixup initialization: Residual learning with- out normalization. In: Proceedings of the International Conference on Learning Representations (ICLR) (2019)
2019
-
[10]
In: Proceedings of 34th Conference on Neural Informa- tion Processing Systems (NeurIPS)
De, S., Smith, S.L.: Batch normalization biases residual blocks towards the identity function in deep networks. In: Proceedings of 34th Conference on Neural Informa- tion Processing Systems (NeurIPS). Vancouver, Canada (2020)
2020
-
[11]
In: Proceedings of the 38th International Conference on Machine Learning, PMLR 139, 2021
Brock, A., De, S., Smith, S.L., Simonyan, K.: High-performance large-scale im- age recognition without normalization. In: Proceedings of the 38th International Conference on Machine Learning, PMLR 139, 2021
2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.