arxiv: 2604.19453 · v1 · submitted 2026-04-21 · 💻 cs.LG

Recognition: unknown

ZC-Swish: Stabilizing Deep BN-Free Networks for Edge and Micro-Batch Applications

Suvinava Basak

Authors on Pith no claims yet

Pith reviewed 2026-05-10 03:01 UTC · model grok-4.3

classification 💻 cs.LG

keywords activation functionbatch normalizationdeep networksBN-free trainingmicro-batchnetwork stabilitySwish

0 comments

The pith

A zero-centered variant of Swish keeps deep networks stable without batch normalization by anchoring activation means near zero.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard activations such as Swish and ReLU produce non-zero means that cause activation distributions to shift progressively as layers stack in networks without batch normalization, leading to vanishing gradients and collapse at moderate depths. The paper introduces ZC-Swish as a drop-in replacement that dynamically adjusts the activation to hold means near zero while retaining Swish's shape. Targeted tests on convolutional networks without BN show that standard Swish falls to near-random accuracy by depth 16, whereas ZC-Swish preserves layer-wise stability and reaches 51.5 percent test accuracy at that depth. This matters for edge devices and micro-batch settings where batch normalization cannot be used because of memory limits or non-IID data.

Core claim

The paper claims that the non-zero-centered character of common activations drives compounding mean shifts and training failure in deep BN-free networks, and that ZC-Swish restores stable dynamics by anchoring those means near zero, yielding higher accuracy than standard Swish or ReLU at depth 16 and usable performance at depth 32.

What carries the argument

ZC-Swish, a parameterized activation function that dynamically anchors activation means near zero while preserving Swish's non-monotonic shape.

If this is right

BN-free convolutional networks can be trained to depth 32 without collapse to random performance.
ZC-Swish delivers the highest test accuracy at depth 16 among tested activations in the reported experiments.
Layer-wise activation means and variances remain consistent across depths when ZC-Swish replaces standard Swish.
The function offers a parameter-free drop-in replacement that enables deeper architectures in memory-constrained or privacy-preserving settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

A similar zero-centering adjustment could be tested on other non-monotonic activations to see whether the same stability gain appears.
The approach might reduce the need for auxiliary normalization layers in federated or medical-imaging pipelines that already avoid batch statistics.
If the mean-anchoring mechanism generalizes, it could be combined with weight-initialization changes to push usable depth further in fully normalization-free models.

Load-bearing premise

The non-zero-centered property of standard activations is the dominant cause of instability in BN-free networks, and anchoring means near zero is enough to restore stable training without harming gradients or features.

What would settle it

Train identical depth-16 BN-free convolutional networks with standard Swish and with ZC-Swish on the same data and seeds; if standard Swish reaches comparable or higher accuracy and stable dynamics in repeated runs, the central claim does not hold.

Figures

Figures reproduced from arXiv: 2604.19453 by Suvinava Basak.

**Figure 2.** Figure 2: Training (left) and test (right) loss curves at depth 16. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Training dynamics and accuracy summary across depths. (A) Depth 8; [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

read the original abstract

Batch Normalization (BN) is a cornerstone of deep learning, yet it fundamentally breaks down in micro-batch regimes (e.g., 3D medical imaging) and non-IID Federated Learning. Removing BN from deep architectures, however, often leads to catastrophic training failures such as vanishing gradients and dying channels. We identify that standard activation functions, like Swish and ReLU, exacerbate this instability in BN-free networks due to their non-zero-centered nature, which causes compounding activation mean-shifts as network depth increases. In this technical communication, we propose Zero-Centered Swish (ZC-Swish), a drop-in activation function parameterized to dynamically anchor activation means near zero. Through targeted stress-testing on BN-free convolutional networks at depths 8, 16, and 32, we demonstrate that while standard Swish collapses to near-random performance at depth 16 and beyond, ZC-Swish maintains stable layer-wise activation dynamics and achieves the highest test accuracy at depth 16 (51.5%) with seed 42. ZC-Swish thus provides a robust, parameter-efficient solution for stabilizing deep networks in memory-constrained and privacy-preserving applications where traditional normalization is unviable.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ZC-Swish is a straightforward parameterization of Swish to reduce mean drift in BN-free nets, but the experiments do not isolate whether the centering itself drives the gains.

read the letter

The paper introduces ZC-Swish as a drop-in activation that keeps layer means near zero in deep convolutional networks trained without batch norm. On networks of depth 8, 16, and 32 it reports that plain Swish collapses at depth 16 while ZC-Swish reaches 51.5% test accuracy at that depth with seed 42. The target use cases are micro-batch and federated settings where BN is unavailable or harmful, which is a genuine practical constraint many people run into.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes Zero-Centered Swish (ZC-Swish) as a drop-in activation function for batch-normalization-free deep networks. It identifies the non-zero-centered nature of standard activations (Swish, ReLU) as the cause of compounding mean shifts and training collapse in deep BN-free convolutional networks, particularly at depths 16 and beyond. Through stress tests on networks of depths 8, 16, and 32, the authors claim ZC-Swish maintains stable layer-wise activation dynamics by dynamically anchoring means near zero, while standard Swish collapses to near-random performance; ZC-Swish achieves the highest reported test accuracy of 51.5% at depth 16 with seed 42.

Significance. If the empirical claims hold after proper controls and statistical validation, the work would address a practical barrier to training deep models without batch normalization in micro-batch regimes (e.g., 3D medical imaging) and non-IID federated learning on edge devices. A simple, parameter-efficient activation change could enable deeper BN-free architectures in memory-constrained settings where normalization layers are infeasible.

major comments (3)

[Abstract and stress-test description] Abstract and stress-test description: the experiments compare ZC-Swish only to standard Swish and do not include controls against other zero-centered activations (e.g., tanh or shifted ReLU) or against standard Swish followed by explicit per-layer mean subtraction. This is load-bearing for the central claim that non-zero-centered activations are the dominant instability driver and that the specific dynamic anchoring in ZC-Swish is what restores stable dynamics.
[Abstract] Abstract: the 51.5% test accuracy at depth 16 is reported for a single seed (seed 42) with no variance across runs, standard deviations, or multiple independent trials. Without these, the result cannot reliably support claims of stable performance or superiority over standard Swish at depth 16.
[Abstract] Abstract: no experimental protocol, dataset, architecture details beyond depth, training hyperparameters, or full baseline comparisons are described. This prevents assessment of whether the reported accuracy is robust or reproducible.

minor comments (1)

The parameterization of ZC-Swish should be presented with an explicit equation or pseudocode in the main text to allow immediate implementation and verification.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments on our manuscript. We address each of the major comments point by point below, providing clarifications and outlining the revisions we plan to implement to strengthen the presentation and empirical support for our claims.

read point-by-point responses

Referee: [Abstract and stress-test description] Abstract and stress-test description: the experiments compare ZC-Swish only to standard Swish and do not include controls against other zero-centered activations (e.g., tanh or shifted ReLU) or against standard Swish followed by explicit per-layer mean subtraction. This is load-bearing for the central claim that non-zero-centered activations are the dominant instability driver and that the specific dynamic anchoring in ZC-Swish is what restores stable dynamics.

Authors: We agree that additional controls are necessary to fully substantiate our central claim regarding the role of non-zero-centered activations. Our current experiments show that ZC-Swish avoids the collapse seen with Swish, but to isolate whether the dynamic anchoring is superior to other zero-centering methods, we will include comparisons with tanh, shifted ReLU, and Swish with explicit per-layer mean subtraction in the revised stress tests at depths 8, 16, and 32. revision: yes
Referee: [Abstract] Abstract: the 51.5% test accuracy at depth 16 is reported for a single seed (seed 42) with no variance across runs, standard deviations, or multiple independent trials. Without these, the result cannot reliably support claims of stable performance or superiority over standard Swish at depth 16.

Authors: We recognize that single-seed results limit the generalizability of our findings. In the revised manuscript, we will perform the stress tests across multiple independent random seeds and report the mean test accuracies with standard deviations for ZC-Swish and the baselines at each network depth. revision: yes
Referee: [Abstract] Abstract: no experimental protocol, dataset, architecture details beyond depth, training hyperparameters, or full baseline comparisons are described. This prevents assessment of whether the reported accuracy is robust or reproducible.

Authors: The abstract is intended as a high-level summary. Comprehensive details on the experimental protocol, including the specific dataset, convolutional network architectures, training hyperparameters, and additional baseline comparisons, are provided in the Experiments and Methods sections of the full manuscript. To address this, we will incorporate a concise reference to the dataset and core settings within the abstract while maintaining its brevity. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical proposal and stress tests are self-contained

full rationale

The paper advances ZC-Swish as a parameterized activation that anchors means near zero and supports the claim solely through direct empirical stress tests on BN-free convnets at depths 8/16/32, reporting activation dynamics and test accuracy. No derivation, first-principles result, or prediction is offered that could reduce to the authors' own fitted quantities or definitions. No self-citations appear as load-bearing premises, and the work contains no mathematical chain, uniqueness theorem, or ansatz smuggled via prior work. The central result is therefore externally falsifiable via replication of the reported accuracy and dynamics comparisons.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on abstract only; no explicit free parameters, axioms, or invented entities are detailed beyond the existence of a parameterization for zero-centering.

pith-pipeline@v0.9.0 · 5507 in / 1120 out tokens · 38805 ms · 2026-05-10T03:01:00.029773+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

11 extracted references · 5 canonical work pages · 2 internal anchors

[1]

In: Proceedings of the International Conference on Machine Learning (ICML) (2015)

Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: Proceedings of the International Conference on Machine Learning (ICML) (2015)

2015
[2]

Searching for Activation Functions

Ramachandran, P., Zoph, B., Le, Q.V.: Searching for activation functions. arXiv preprint arXiv:1710.05941 (2017)

work page internal anchor Pith review arXiv 2017
[3]

British Machine Vision Conference , year =

Misra, D.: Mish: A self regularized non-monotonic activation function. In: Pro- ceedings of the British Machine Vision Conference (BMVC) (2019).https://doi. org/10.5244/C.34.191

work page doi:10.5244/c.34.191 2019
[4]

In: Proceedings of the 15th European Conference on Computer Vision (ECCV), part XIII, pp

Wu, Y., He, K.: Group normalization. In: Proceedings of the 15th European Conference on Computer Vision (ECCV), part XIII, pp. 3–19. Munich (2018). https://doi.org/10.1007/978-3-030-01261-8_1

work page doi:10.1007/978-3-030-01261-8_1 2018
[5]

In: Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS), JMLR: W&CP, vol

McMahan, H.B., Moore, E., Ramage, D., Hampson, S., Arcas, B.A.Y.: Communication-efficient learning of deep networks from decentralized data. In: Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS), JMLR: W&CP, vol. 54. Florida, USA (2017)

2017
[6]

In: Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), pp

He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: Surpassing human- level performance on ImageNet classification. In: Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), pp. 1026–1034. (2015). https://doi.org/10.1109/ICCV.2015.12

work page doi:10.1109/iccv.2015.12 2015
[7]

In: Proceedings of the 27th International Conference on Machine Learning (ICML), pp

Nair, V., Hinton, G.E.: Rectified linear units improve restricted Boltzmann ma- chines. In: Proceedings of the 27th International Conference on Machine Learning (ICML), pp. 807–814. Haifa, Israel (2010)

2010
[8]

Layer Normalization

Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016
[9]

In: Proceedings of the International Conference on Learning Representations (ICLR) (2019)

Zhang, H., Dauphin, Y.N., Ma, T.: Fixup initialization: Residual learning with- out normalization. In: Proceedings of the International Conference on Learning Representations (ICLR) (2019)

2019
[10]

In: Proceedings of 34th Conference on Neural Informa- tion Processing Systems (NeurIPS)

De, S., Smith, S.L.: Batch normalization biases residual blocks towards the identity function in deep networks. In: Proceedings of 34th Conference on Neural Informa- tion Processing Systems (NeurIPS). Vancouver, Canada (2020)

2020
[11]

In: Proceedings of the 38th International Conference on Machine Learning, PMLR 139, 2021

Brock, A., De, S., Smith, S.L., Simonyan, K.: High-performance large-scale im- age recognition without normalization. In: Proceedings of the 38th International Conference on Machine Learning, PMLR 139, 2021

2021