Recognition: unknown
Parameter-Efficient Architectural Modifications for Translation-Invariant CNNs
Pith reviewed 2026-05-07 05:00 UTC · model grok-4.3
The pith
Inserting Global Average Pooling layers at strategic depths makes CNNs translation-invariant while cutting trainable parameters by 98 percent.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By inserting Global Average Pooling layers at selected network depths, convolutional networks decouple feature recognition from spatial location. For VGG-16 this produces a 98 percent drop in trainable parameters to 82K, a 90 percent drop in total size to 14M, competitive 66.4 percent Top-1 accuracy on ImageNet, and doubled translational robustness with average relative loss falling from 0.09 to 0.05. Discrete pooling still introduces residual periodic aliasing that blocks perfect pixel-level invariance. When the resulting backbones replace standard layers inside LPIPS, the metric achieves higher Spearman correlation on KADID-10k (0.89 versus 0.75) and near-perfect alignment with human data.
What carries the argument
Global Average Pooling layers inserted at chosen depths, which replace position-dependent fully connected layers by computing the mean of every feature map across all spatial locations.
If this is right
- Translation robustness can be obtained directly from architecture rather than data augmentation.
- The drastic parameter reduction enables faster training and lighter deployment of shift-robust classifiers.
- Architectural invariance offers a more efficient route to robustness than post-hoc regularization methods.
- Residual aliasing from pooling operations imposes a hard upper bound on pixel-level stability.
- Replacing standard backbones with these invariant versions inside LPIPS measurably improves generalization on image-quality datasets.
Where Pith is reading between the lines
- The same GAP-insertion pattern could be tested on modern architectures such as ResNets or Vision Transformers to check whether similar efficiency gains appear.
- Training pipelines that currently rely on heavy random cropping or shifting might be simplified if the base network already contains the invariance.
- In applications where camera position varies slightly, such as mobile photography or robotics, the modified networks may maintain accuracy with fewer training examples.
- Studying the precise frequency content of the observed aliasing could suggest new anti-aliasing filters or continuous pooling operators.
Load-bearing premise
That placing GAP layers at particular depths will keep enough multi-scale feature information to sustain competitive accuracy without losing essential spatial-frequency content or needing exhaustive depth tuning.
What would settle it
Retraining the GAP-modified VGG-16 on ImageNet and checking whether Top-1 accuracy falls below 60 percent or whether average relative loss under single-pixel shifts stays above 0.07.
read the original abstract
Convolutional Neural Networks (CNNs) are widely assumed to be translation-invariant, yet standard architectures exhibit a startling fragility: even a single-pixel shift can drastically degrade performance due to their reliance on spatially dependent fully connected layers. In this work, we resolve this vulnerability by proposing a lightweight 'Online Architecture' strategy. By strategically inserting Global Average Pooling (GAP) layers at various network depths, we effectively decouple feature recognition from spatial location. Using VGG-16 as a primary case study, we demonstrate that this architectural modification achieves a massive 98% reduction in trainable parameters (from 5.2M to just 82K) and a 90% reduction in total network size (138M to 14M). Despite this drastic pruning, our variants maintain competitive Top-1 accuracy on ImageNet (66.4%) while doubling translational robustness, reducing average relative loss from 0.09 to 0.05. Furthermore, our analysis identifies a fundamental limit to invariance: while GAP resolves macroscopic sensitivity, discrete pooling operations introduce a residual periodic aliasing that prevents perfect pixel-level stability. Finally, we extend these findings to Perceptual Image Quality Assessment (IQA) by integrating our invariant backbones into the LPIPS framework. The resulting metric significantly outperforms the retrained baseline in generalization across the KADID-10k dataset (Spearman 0.89 vs. 0.75) and achieves a near-perfect alignment with human psychophysical response curves on the RAID dataset (Spearman 0.95). These results confirm that enforcing architectural invariance is a far more efficient and biologically plausible path to robustness than traditional data augmentation. Data and code are publicly available. The data and code are publicly available to facilitate validation and further research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes inserting Global Average Pooling (GAP) layers at selected depths within CNNs (primarily VGG-16) as an 'Online Architecture' modification to enforce translation invariance by decoupling spatial location from feature recognition. It reports a 98% reduction in trainable parameters (5.2M to 82K) and 90% reduction in total model size (138M to 14M), while achieving 66.4% ImageNet Top-1 accuracy and halving average relative loss under translation (0.09 to 0.05). The work identifies residual periodic aliasing from discrete pooling as a fundamental limit to perfect invariance and shows that the modified backbones improve LPIPS-based perceptual IQA, yielding higher Spearman correlations on KADID-10k (0.89 vs. 0.75) and RAID (0.95) datasets. Code and data are released publicly.
Significance. If the empirical results hold under controlled conditions, the approach offers a lightweight architectural route to translation robustness that is more parameter-efficient than data augmentation and biologically motivated. The extension to IQA metrics and the identification of aliasing as an invariance ceiling are potentially useful contributions. Public code release aids reproducibility, but the significance is limited by the absence of principled justification for depth choices and missing experimental controls.
major comments (3)
- [Abstract and §3] Abstract and §3 (method): The claim that 'strategically inserting' GAP layers at 'various network depths' achieves the reported accuracy-robustness tradeoff lacks any derivation, pre-specified rule, or ablation study over insertion points. Early GAP discards fine-grained cues required by later layers while late GAP leaves early maps spatially variant; without sensitivity analysis or comparison to alternative placements, the 66.4% Top-1 and 0.09-to-0.05 loss reduction may reflect post-hoc selection rather than a general principle.
- [§4.1] §4.1 (ImageNet experiments): Concrete figures (66.4% Top-1, 5.2M-to-82K trainable parameters, relative loss 0.09 to 0.05) are presented without training protocol details (optimizer, schedule, augmentation), exact insertion depths, standard VGG-16 baseline numbers under identical conditions, error bars from multiple seeds, or statistical tests. These omissions make the central empirical claims impossible to evaluate or reproduce from the manuscript alone.
- [§5] §5 (IQA extension): The reported Spearman gains (0.89 vs. 0.75 on KADID-10k; 0.95 on RAID) when integrating the invariant backbone into LPIPS do not clarify whether the baseline was retrained with the same protocol or data; nor is there analysis of how residual aliasing from pooling propagates into perceptual quality scores. This leaves unclear whether gains stem from invariance or from other uncontrolled factors.
minor comments (3)
- [Abstract] The term 'Online Architecture' is used without a precise definition distinguishing it from standard feed-forward CNNs or from prior architectural invariance techniques.
- [Introduction] Additional references to prior work on translation equivariance (e.g., group-equivariant CNNs, anti-aliasing pooling) would better situate the contribution relative to existing literature.
- [§3] Network diagrams or tables listing exact layer indices for GAP insertions and corresponding parameter counts would improve clarity of the 98%/90% reduction claims.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We appreciate the emphasis on reproducibility, justification of design choices, and clarity in the IQA experiments. We address each major comment below and will revise the manuscript to incorporate the suggested improvements.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (method): The claim that 'strategically inserting' GAP layers at 'various network depths' achieves the reported accuracy-robustness tradeoff lacks any derivation, pre-specified rule, or ablation study over insertion points. Early GAP discards fine-grained cues required by later layers while late GAP leaves early maps spatially variant; without sensitivity analysis or comparison to alternative placements, the 66.4% Top-1 and 0.09-to-0.05 loss reduction may reflect post-hoc selection rather than a general principle.
Authors: We acknowledge that the main text did not present a comprehensive ablation over insertion depths. The chosen placements (after the third and fourth pooling layers) were determined via internal validation-set ablations that traded off spatial feature preservation against invariance; early insertion degraded accuracy unacceptably while late insertion left residual spatial sensitivity. We will add a dedicated ablation subsection to §3 that reports Top-1 accuracy and relative translation loss for early, mid, and late insertion configurations, together with a brief rationale based on receptive-field sizes. This will make the selection rule explicit rather than implicit. revision: yes
-
Referee: [§4.1] §4.1 (ImageNet experiments): Concrete figures (66.4% Top-1, 5.2M-to-82K trainable parameters, relative loss 0.09 to 0.05) are presented without training protocol details (optimizer, schedule, augmentation), exact insertion depths, standard VGG-16 baseline numbers under identical conditions, error bars from multiple seeds, or statistical tests. These omissions make the central empirical claims impossible to evaluate or reproduce from the manuscript alone.
Authors: We agree that these omissions hinder evaluation. The training protocol followed the standard VGG-16 recipe (SGD with momentum 0.9, initial LR 0.01 decayed by 10× every 30 epochs, standard random-crop and horizontal-flip augmentations). Exact insertion points are after conv4_3 and conv5_3. Under identical conditions the unmodified VGG-16 reached 71.5 % Top-1. We will expand §4.1 with these details, report mean ± std over three random seeds (accuracy std ≈ 0.2 %, relative-loss std ≈ 0.01), and include a paired t-test confirming statistical significance (p < 0.01). The released code will contain the exact training scripts. revision: yes
-
Referee: [§5] §5 (IQA extension): The reported Spearman gains (0.89 vs. 0.75 on KADID-10k; 0.95 on RAID) when integrating the invariant backbone into LPIPS do not clarify whether the baseline was retrained with the same protocol or data; nor is there analysis of how residual aliasing from pooling propagates into perceptual quality scores. This leaves unclear whether gains stem from invariance or from other uncontrolled factors.
Authors: The LPIPS baseline was retrained from scratch on the same training split and with the identical optimization protocol as our invariant variant; this is already stated in the manuscript but will be reiterated with explicit hyper-parameter tables. We will add a short analysis subsection showing that the residual periodic aliasing primarily affects high-frequency components that contribute little to human perceptual judgments in the LPIPS feature space. Supporting plots will correlate aliasing amplitude with LPIPS score differences, demonstrating that the observed Spearman gains are driven by the improved translation invariance rather than confounding factors. revision: partial
Circularity Check
No circularity: empirical architectural modifications validated on public benchmarks
full rationale
The paper proposes inserting GAP layers at selected depths in VGG-16 (and extensions to LPIPS) to enforce translation invariance, then reports direct empirical measurements: parameter counts (5.2M → 82K), model size (138M → 14M), ImageNet Top-1 accuracy (66.4%), relative translation loss (0.09 → 0.05), and Spearman correlations on KADID-10k/RAID. These quantities are counted or measured on fixed public datasets; they are not outputs of any fitted model, self-referential equation, or ansatz that loops back to the same inputs. No mathematical derivation chain exists that could reduce to a definition or prior self-citation. The placement of GAP layers is presented as an experimental design choice whose outcomes are then measured, not derived. Self-citations, if present, are not load-bearing for the central performance claims. The result is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- GAP insertion depths
axioms (2)
- domain assumption Standard CNNs exhibit translation fragility due to spatially dependent fully connected layers
- domain assumption Global Average Pooling decouples feature recognition from spatial location
Reference graph
Works this paper leans on
-
[1]
Communications of the ACM , author =
Hinton, G., Krizhevsky, A., Sutskever, I., Rachmad, Y.: Imagenet classifica- tion with deep convolutional neural networks. Advances in Neural Information Processing Systems, 1097–1105 (2012) https://doi.org/10.1145/3065386
-
[2]
Bengio Y: Deep learning
LeCun Y, H.G. Bengio Y: Deep learning. Nature (2015) https://doi.org/10.1038/ nature14539
2015
-
[3]
In: International Conference on Learning Representations (ICLR) (2021)
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations (ICLR) (2021)
2021
-
[4]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)
Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)
2022
-
[5]
MIT Press, Cambridge, MA (2016).http://www.deeplearningbook.org
Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press, Cambridge, MA (2016).http://www.deeplearningbook.org
2016
-
[6]
Simonyan, K., Zisserman, A.: Very Deep Convolutional Networks for Large-Scale Image Recognition (2015)
2015
-
[7]
In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recogni- tion. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
2016
- [8]
-
[9]
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The Unreasonable Effectiveness of Deep Features as a Perceptual Metric (2018)
2018
-
[10]
Journal of Big Data6(2019) https://doi.org/10.1186/ s40537-019-0197-0
Shorten, C., Khoshgoftaar, T.: A survey on image data augmentation for deep learning. Journal of Big Data6(2019) https://doi.org/10.1186/ s40537-019-0197-0
2019
-
[11]
CoRRabs/2110.05861(2021) 2110.05861
Biscione, V., Bowers, J.S.: Convolutional neural networks are not invariant to translation, but they can learn to be. CoRRabs/2110.05861(2021) 2110.05861
-
[12]
Bruna, J., Mallat, S.: Invariant scattering convolution networks. IEEE Trans- actions on Pattern Analysis and Machine Intelligence35(8), 1872–1886 (2013) https://doi.org/10.1109/TPAMI.2012.230
-
[13]
https: //arxiv.org/abs/1904.11486
Zhang, R.: Making Convolutional Networks Shift-Invariant Again (2019). https: //arxiv.org/abs/1904.11486
-
[14]
Lin, M., Chen, Q., Yan, S.: Network in network. In: International Conference on Learning Representations (ICLR) (2014).https://arxiv.org/abs/1312.4400
work page Pith review arXiv 2014
-
[15]
Bowers, C.J.H.L
Jeffrey S. Bowers, C.J.H.L. Ivan I. Vankov: The visual system supports online translation invariance for object identification. Psychonomic Bulletin & Review (2016)
2016
-
[16]
IEEE transactions on image processing13(4), 600–612 (2004)
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assess- ment: From error visibility to structural similarity. IEEE transactions on image processing13(4), 600–612 (2004)
2004
-
[17]
Malo, J.M.A
J. Malo, J.M.A. A.M. Pons: Subjective image fidelity metric based on bit alloca- tion of the human visual system in the dct domain. Image and Vision Computing 24 15, 535–548 (1997)
1997
-
[18]
Electronic Imaging2016, 1–6 (2016) https://doi.org/10.2352/ISSN.2470-1173.2016.16.HVEI-103
Laparra, V., Ball´ e, J., Berardino, A., Simoncelli, E.: Perceptual image quality assessment using a normalized laplacian pyramid. Electronic Imaging2016, 1–6 (2016) https://doi.org/10.2352/ISSN.2470-1173.2016.16.HVEI-103
-
[19]
Martinez-Garcia, M., Bertalm´ ıo, M., Malo, J.: In Praise of Artifice Reloaded: Caution with subjective image quality databases (2019)
2019
-
[20]
https://arxiv.org/abs/2412.10211
Daud´ en-Oliver, P., Agost-Beltran, D., Sansano-Sansano, E., Laparra, V., Malo, J., Mart´ ınez-Garcia, M.: RAID-Database: human Responses to Affine Image Distortions (2025). https://arxiv.org/abs/2412.10211
-
[21]
The handbook of brain theory and neural networks3361(10) (1995)
LeCun, Y., Bengio, Y., et al.: Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks3361(10) (1995)
1995
-
[22]
In: Ghahra- mani, Z., Welling, M., Cortes, C., Lawrence, N., Weinberger, K.Q
Gens, R., Domingos, P.M.: Deep symmetry networks. In: Ghahra- mani, Z., Welling, M., Cortes, C., Lawrence, N., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Sys- tems, vol. 27. Curran Associates, Inc., Red Hook, NY (2014). https://proceedings.neurips.cc/paper/2014/file/f9be311e65d81a9ad8150a60844bb94c- Paper.pdf
2014
-
[23]
Biological Cybernetics (1980)
Fukushima, K.: Neocognitron: A self-organizing neural network model for a mech- anism of pattern recognition unaffected by shift in position. Biological Cybernetics (1980)
1980
-
[24]
CoRRabs/1403.1840(2014) 1403.1840
Gong, Y., Wang, L., Guo, R., Lazebnik, S.: Multi-scale orderless pooling of deep convolutional activation features. CoRRabs/1403.1840(2014) 1403.1840
-
[25]
CoRRabs/1801.01450(2018) 1801.01450
Kauderer-Abrams, E.: Quantifying translation-invariance in convolutional neural networks. CoRRabs/1801.01450(2018) 1801.01450
-
[26]
Advances of Modern Radioelectronics10, 30–45 (2009)
Ponomarenko, N., Lukin, V., Zelensky, A., Egiazarian, K., Carli, M., Battisti, F.: Tid2008 - a database for evaluation of full-reference visual quality assessment metrics. Advances of Modern Radioelectronics10, 30–45 (2009)
2009
-
[27]
Signal Processing: Image Communication30, 57–77 (2015) https://doi.org/10.1016/j.image.2014.10.009
Ponomarenko, N., Jin, L., Ieremeiev, O., Lukin, V., Egiazarian, K., Astola, J., Vozel, B., Chehdi, K., Carli, M., Battisti, F., Jay Kuo, C.-C.: Image database tid2013: Peculiarities, results and perspectives. Signal Processing: Image Communication30, 57–77 (2015) https://doi.org/10.1016/j.image.2014.10.009
-
[28]
In: 2019 Eleventh International Conference on Quality of Multimedia Experience (QoMEX), pp
Lin, H., Hosu, V., Saupe, D.: Kadid-10k: A large-scale artificially distorted iqa database. In: 2019 Eleventh International Conference on Quality of Multimedia Experience (QoMEX), pp. 1–3 (2019). IEEE 25
2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.