arxiv: 2604.27870 · v1 · submitted 2026-04-30 · 💻 cs.CV

Recognition: unknown

Parameter-Efficient Architectural Modifications for Translation-Invariant CNNs

Nuria Alabau-Bosque , Jorge Vila-Tomas , Paula Dauden-Oliver , Valero Laparra , Jesus Malo

Authors on Pith no claims yet

Pith reviewed 2026-05-07 05:00 UTC · model grok-4.3

classification 💻 cs.CV

keywords convolutional neural networkstranslation invarianceglobal average poolingparameter efficiencyimage classificationperceptual image qualityVGG-16

0 comments

The pith

Inserting Global Average Pooling layers at strategic depths makes CNNs translation-invariant while cutting trainable parameters by 98 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard CNNs like VGG-16 suffer large accuracy drops from even single-pixel image shifts because their final layers remain tied to exact spatial positions. The paper demonstrates that inserting Global Average Pooling layers at multiple depths removes this position dependence by averaging each feature map across space. On VGG-16 this change reduces trainable parameters from 5.2 million to 82 thousand and total model size from 138 million to 14 million. The smaller network still reaches 66.4 percent Top-1 accuracy on ImageNet and halves the performance loss under shifts. The same modification also improves perceptual image-quality metrics when used inside the LPIPS framework.

Core claim

By inserting Global Average Pooling layers at selected network depths, convolutional networks decouple feature recognition from spatial location. For VGG-16 this produces a 98 percent drop in trainable parameters to 82K, a 90 percent drop in total size to 14M, competitive 66.4 percent Top-1 accuracy on ImageNet, and doubled translational robustness with average relative loss falling from 0.09 to 0.05. Discrete pooling still introduces residual periodic aliasing that blocks perfect pixel-level invariance. When the resulting backbones replace standard layers inside LPIPS, the metric achieves higher Spearman correlation on KADID-10k (0.89 versus 0.75) and near-perfect alignment with human data.

What carries the argument

Global Average Pooling layers inserted at chosen depths, which replace position-dependent fully connected layers by computing the mean of every feature map across all spatial locations.

If this is right

Translation robustness can be obtained directly from architecture rather than data augmentation.
The drastic parameter reduction enables faster training and lighter deployment of shift-robust classifiers.
Architectural invariance offers a more efficient route to robustness than post-hoc regularization methods.
Residual aliasing from pooling operations imposes a hard upper bound on pixel-level stability.
Replacing standard backbones with these invariant versions inside LPIPS measurably improves generalization on image-quality datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same GAP-insertion pattern could be tested on modern architectures such as ResNets or Vision Transformers to check whether similar efficiency gains appear.
Training pipelines that currently rely on heavy random cropping or shifting might be simplified if the base network already contains the invariance.
In applications where camera position varies slightly, such as mobile photography or robotics, the modified networks may maintain accuracy with fewer training examples.
Studying the precise frequency content of the observed aliasing could suggest new anti-aliasing filters or continuous pooling operators.

Load-bearing premise

That placing GAP layers at particular depths will keep enough multi-scale feature information to sustain competitive accuracy without losing essential spatial-frequency content or needing exhaustive depth tuning.

What would settle it

Retraining the GAP-modified VGG-16 on ImageNet and checking whether Top-1 accuracy falls below 60 percent or whether average relative loss under single-pixel shifts stays above 0.07.

read the original abstract

Convolutional Neural Networks (CNNs) are widely assumed to be translation-invariant, yet standard architectures exhibit a startling fragility: even a single-pixel shift can drastically degrade performance due to their reliance on spatially dependent fully connected layers. In this work, we resolve this vulnerability by proposing a lightweight 'Online Architecture' strategy. By strategically inserting Global Average Pooling (GAP) layers at various network depths, we effectively decouple feature recognition from spatial location. Using VGG-16 as a primary case study, we demonstrate that this architectural modification achieves a massive 98% reduction in trainable parameters (from 5.2M to just 82K) and a 90% reduction in total network size (138M to 14M). Despite this drastic pruning, our variants maintain competitive Top-1 accuracy on ImageNet (66.4%) while doubling translational robustness, reducing average relative loss from 0.09 to 0.05. Furthermore, our analysis identifies a fundamental limit to invariance: while GAP resolves macroscopic sensitivity, discrete pooling operations introduce a residual periodic aliasing that prevents perfect pixel-level stability. Finally, we extend these findings to Perceptual Image Quality Assessment (IQA) by integrating our invariant backbones into the LPIPS framework. The resulting metric significantly outperforms the retrained baseline in generalization across the KADID-10k dataset (Spearman 0.89 vs. 0.75) and achieves a near-perfect alignment with human psychophysical response curves on the RAID dataset (Spearman 0.95). These results confirm that enforcing architectural invariance is a far more efficient and biologically plausible path to robustness than traditional data augmentation. Data and code are publicly available. The data and code are publicly available to facilitate validation and further research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper proposes inserting Global Average Pooling (GAP) layers at selected depths within CNNs (primarily VGG-16) as an 'Online Architecture' modification to enforce translation invariance by decoupling spatial location from feature recognition. It reports a 98% reduction in trainable parameters (5.2M to 82K) and 90% reduction in total model size (138M to 14M), while achieving 66.4% ImageNet Top-1 accuracy and halving average relative loss under translation (0.09 to 0.05). The work identifies residual periodic aliasing from discrete pooling as a fundamental limit to perfect invariance and shows that the modified backbones improve LPIPS-based perceptual IQA, yielding higher Spearman correlations on KADID-10k (0.89 vs. 0.75) and RAID (0.95) datasets. Code and data are released publicly.

Significance. If the empirical results hold under controlled conditions, the approach offers a lightweight architectural route to translation robustness that is more parameter-efficient than data augmentation and biologically motivated. The extension to IQA metrics and the identification of aliasing as an invariance ceiling are potentially useful contributions. Public code release aids reproducibility, but the significance is limited by the absence of principled justification for depth choices and missing experimental controls.

major comments (3)

[Abstract and §3] Abstract and §3 (method): The claim that 'strategically inserting' GAP layers at 'various network depths' achieves the reported accuracy-robustness tradeoff lacks any derivation, pre-specified rule, or ablation study over insertion points. Early GAP discards fine-grained cues required by later layers while late GAP leaves early maps spatially variant; without sensitivity analysis or comparison to alternative placements, the 66.4% Top-1 and 0.09-to-0.05 loss reduction may reflect post-hoc selection rather than a general principle.
[§4.1] §4.1 (ImageNet experiments): Concrete figures (66.4% Top-1, 5.2M-to-82K trainable parameters, relative loss 0.09 to 0.05) are presented without training protocol details (optimizer, schedule, augmentation), exact insertion depths, standard VGG-16 baseline numbers under identical conditions, error bars from multiple seeds, or statistical tests. These omissions make the central empirical claims impossible to evaluate or reproduce from the manuscript alone.
[§5] §5 (IQA extension): The reported Spearman gains (0.89 vs. 0.75 on KADID-10k; 0.95 on RAID) when integrating the invariant backbone into LPIPS do not clarify whether the baseline was retrained with the same protocol or data; nor is there analysis of how residual aliasing from pooling propagates into perceptual quality scores. This leaves unclear whether gains stem from invariance or from other uncontrolled factors.

minor comments (3)

[Abstract] The term 'Online Architecture' is used without a precise definition distinguishing it from standard feed-forward CNNs or from prior architectural invariance techniques.
[Introduction] Additional references to prior work on translation equivariance (e.g., group-equivariant CNNs, anti-aliasing pooling) would better situate the contribution relative to existing literature.
[§3] Network diagrams or tables listing exact layer indices for GAP insertions and corresponding parameter counts would improve clarity of the 98%/90% reduction claims.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We appreciate the emphasis on reproducibility, justification of design choices, and clarity in the IQA experiments. We address each major comment below and will revise the manuscript to incorporate the suggested improvements.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (method): The claim that 'strategically inserting' GAP layers at 'various network depths' achieves the reported accuracy-robustness tradeoff lacks any derivation, pre-specified rule, or ablation study over insertion points. Early GAP discards fine-grained cues required by later layers while late GAP leaves early maps spatially variant; without sensitivity analysis or comparison to alternative placements, the 66.4% Top-1 and 0.09-to-0.05 loss reduction may reflect post-hoc selection rather than a general principle.

Authors: We acknowledge that the main text did not present a comprehensive ablation over insertion depths. The chosen placements (after the third and fourth pooling layers) were determined via internal validation-set ablations that traded off spatial feature preservation against invariance; early insertion degraded accuracy unacceptably while late insertion left residual spatial sensitivity. We will add a dedicated ablation subsection to §3 that reports Top-1 accuracy and relative translation loss for early, mid, and late insertion configurations, together with a brief rationale based on receptive-field sizes. This will make the selection rule explicit rather than implicit. revision: yes
Referee: [§4.1] §4.1 (ImageNet experiments): Concrete figures (66.4% Top-1, 5.2M-to-82K trainable parameters, relative loss 0.09 to 0.05) are presented without training protocol details (optimizer, schedule, augmentation), exact insertion depths, standard VGG-16 baseline numbers under identical conditions, error bars from multiple seeds, or statistical tests. These omissions make the central empirical claims impossible to evaluate or reproduce from the manuscript alone.

Authors: We agree that these omissions hinder evaluation. The training protocol followed the standard VGG-16 recipe (SGD with momentum 0.9, initial LR 0.01 decayed by 10× every 30 epochs, standard random-crop and horizontal-flip augmentations). Exact insertion points are after conv4_3 and conv5_3. Under identical conditions the unmodified VGG-16 reached 71.5 % Top-1. We will expand §4.1 with these details, report mean ± std over three random seeds (accuracy std ≈ 0.2 %, relative-loss std ≈ 0.01), and include a paired t-test confirming statistical significance (p < 0.01). The released code will contain the exact training scripts. revision: yes
Referee: [§5] §5 (IQA extension): The reported Spearman gains (0.89 vs. 0.75 on KADID-10k; 0.95 on RAID) when integrating the invariant backbone into LPIPS do not clarify whether the baseline was retrained with the same protocol or data; nor is there analysis of how residual aliasing from pooling propagates into perceptual quality scores. This leaves unclear whether gains stem from invariance or from other uncontrolled factors.

Authors: The LPIPS baseline was retrained from scratch on the same training split and with the identical optimization protocol as our invariant variant; this is already stated in the manuscript but will be reiterated with explicit hyper-parameter tables. We will add a short analysis subsection showing that the residual periodic aliasing primarily affects high-frequency components that contribute little to human perceptual judgments in the LPIPS feature space. Supporting plots will correlate aliasing amplitude with LPIPS score differences, demonstrating that the observed Spearman gains are driven by the improved translation invariance rather than confounding factors. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical architectural modifications validated on public benchmarks

full rationale

The paper proposes inserting GAP layers at selected depths in VGG-16 (and extensions to LPIPS) to enforce translation invariance, then reports direct empirical measurements: parameter counts (5.2M → 82K), model size (138M → 14M), ImageNet Top-1 accuracy (66.4%), relative translation loss (0.09 → 0.05), and Spearman correlations on KADID-10k/RAID. These quantities are counted or measured on fixed public datasets; they are not outputs of any fitted model, self-referential equation, or ansatz that loops back to the same inputs. No mathematical derivation chain exists that could reduce to a definition or prior self-citation. The placement of GAP layers is presented as an experimental design choice whose outcomes are then measured, not derived. Self-citations, if present, are not load-bearing for the central performance claims. The result is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumption that GAP layers can be inserted without destroying useful hierarchical features and on the empirical observation that the resulting models improve robustness; no new physical entities are postulated and no free parameters are explicitly fitted beyond the architectural choice of insertion depths.

free parameters (1)

GAP insertion depths
Chosen strategically at various network depths; the abstract does not specify an optimization procedure or fitted values.

axioms (2)

domain assumption Standard CNNs exhibit translation fragility due to spatially dependent fully connected layers
Invoked in the opening sentence as the motivation for the architectural change.
domain assumption Global Average Pooling decouples feature recognition from spatial location
Core premise of the 'Online Architecture' strategy stated directly in the abstract.

pith-pipeline@v0.9.0 · 5641 in / 1678 out tokens · 75517 ms · 2026-05-07T05:00:43.926472+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 11 canonical work pages

[1]

Communications of the ACM , author =

Hinton, G., Krizhevsky, A., Sutskever, I., Rachmad, Y.: Imagenet classifica- tion with deep convolutional neural networks. Advances in Neural Information Processing Systems, 1097–1105 (2012) https://doi.org/10.1145/3065386

work page doi:10.1145/3065386 2012
[2]

Bengio Y: Deep learning

LeCun Y, H.G. Bengio Y: Deep learning. Nature (2015) https://doi.org/10.1038/ nature14539

2015
[3]

In: International Conference on Learning Representations (ICLR) (2021)

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations (ICLR) (2021)

2021
[4]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

2022
[5]

MIT Press, Cambridge, MA (2016).http://www.deeplearningbook.org

Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press, Cambridge, MA (2016).http://www.deeplearningbook.org

2016
[6]

Simonyan, K., Zisserman, A.: Very Deep Convolutional Networks for Large-Scale Image Recognition (2015)

2015
[7]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp

He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recogni- tion. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

2016
[8]

Azulay, A., Weiss, Y.: Why do deep convolutional networks generalize so poorly to small image transformations? CoRRabs/1805.12177(2018) 1805.12177

work page arXiv 2018
[9]

Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The Unreasonable Effectiveness of Deep Features as a Perceptual Metric (2018)

2018
[10]

Journal of Big Data6(2019) https://doi.org/10.1186/ s40537-019-0197-0

Shorten, C., Khoshgoftaar, T.: A survey on image data augmentation for deep learning. Journal of Big Data6(2019) https://doi.org/10.1186/ s40537-019-0197-0

2019
[11]

CoRRabs/2110.05861(2021) 2110.05861

Biscione, V., Bowers, J.S.: Convolutional neural networks are not invariant to translation, but they can learn to be. CoRRabs/2110.05861(2021) 2110.05861

work page arXiv 2021
[12]

, journal=

Bruna, J., Mallat, S.: Invariant scattering convolution networks. IEEE Trans- actions on Pattern Analysis and Machine Intelligence35(8), 1872–1886 (2013) https://doi.org/10.1109/TPAMI.2012.230

work page doi:10.1109/tpami.2012.230 2013
[13]

https: //arxiv.org/abs/1904.11486

Zhang, R.: Making Convolutional Networks Shift-Invariant Again (2019). https: //arxiv.org/abs/1904.11486

work page arXiv 2019
[14]

Network In Network

Lin, M., Chen, Q., Yan, S.: Network in network. In: International Conference on Learning Representations (ICLR) (2014).https://arxiv.org/abs/1312.4400

work page Pith review arXiv 2014
[15]

Bowers, C.J.H.L

Jeffrey S. Bowers, C.J.H.L. Ivan I. Vankov: The visual system supports online translation invariance for object identification. Psychonomic Bulletin & Review (2016)

2016
[16]

IEEE transactions on image processing13(4), 600–612 (2004)

Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assess- ment: From error visibility to structural similarity. IEEE transactions on image processing13(4), 600–612 (2004)

2004
[17]

Malo, J.M.A

J. Malo, J.M.A. A.M. Pons: Subjective image fidelity metric based on bit alloca- tion of the human visual system in the dct domain. Image and Vision Computing 24 15, 535–548 (1997)

1997
[18]

Electronic Imaging2016, 1–6 (2016) https://doi.org/10.2352/ISSN.2470-1173.2016.16.HVEI-103

Laparra, V., Ball´ e, J., Berardino, A., Simoncelli, E.: Perceptual image quality assessment using a normalized laplacian pyramid. Electronic Imaging2016, 1–6 (2016) https://doi.org/10.2352/ISSN.2470-1173.2016.16.HVEI-103

work page doi:10.2352/issn.2470-1173.2016.16.hvei-103 2016
[19]

Martinez-Garcia, M., Bertalm´ ıo, M., Malo, J.: In Praise of Artifice Reloaded: Caution with subjective image quality databases (2019)

2019
[20]

https://arxiv.org/abs/2412.10211

Daud´ en-Oliver, P., Agost-Beltran, D., Sansano-Sansano, E., Laparra, V., Malo, J., Mart´ ınez-Garcia, M.: RAID-Database: human Responses to Affine Image Distortions (2025). https://arxiv.org/abs/2412.10211

work page arXiv 2025
[21]

The handbook of brain theory and neural networks3361(10) (1995)

LeCun, Y., Bengio, Y., et al.: Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks3361(10) (1995)

1995
[22]

In: Ghahra- mani, Z., Welling, M., Cortes, C., Lawrence, N., Weinberger, K.Q

Gens, R., Domingos, P.M.: Deep symmetry networks. In: Ghahra- mani, Z., Welling, M., Cortes, C., Lawrence, N., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Sys- tems, vol. 27. Curran Associates, Inc., Red Hook, NY (2014). https://proceedings.neurips.cc/paper/2014/file/f9be311e65d81a9ad8150a60844bb94c- Paper.pdf

2014
[23]

Biological Cybernetics (1980)

Fukushima, K.: Neocognitron: A self-organizing neural network model for a mech- anism of pattern recognition unaffected by shift in position. Biological Cybernetics (1980)

1980
[24]

CoRRabs/1403.1840(2014) 1403.1840

Gong, Y., Wang, L., Guo, R., Lazebnik, S.: Multi-scale orderless pooling of deep convolutional activation features. CoRRabs/1403.1840(2014) 1403.1840

work page arXiv 2014
[25]

CoRRabs/1801.01450(2018) 1801.01450

Kauderer-Abrams, E.: Quantifying translation-invariance in convolutional neural networks. CoRRabs/1801.01450(2018) 1801.01450

work page arXiv 2018
[26]

Advances of Modern Radioelectronics10, 30–45 (2009)

Ponomarenko, N., Lukin, V., Zelensky, A., Egiazarian, K., Carli, M., Battisti, F.: Tid2008 - a database for evaluation of full-reference visual quality assessment metrics. Advances of Modern Radioelectronics10, 30–45 (2009)

2009
[27]

Signal Processing: Image Communication30, 57–77 (2015) https://doi.org/10.1016/j.image.2014.10.009

Ponomarenko, N., Jin, L., Ieremeiev, O., Lukin, V., Egiazarian, K., Astola, J., Vozel, B., Chehdi, K., Carli, M., Battisti, F., Jay Kuo, C.-C.: Image database tid2013: Peculiarities, results and perspectives. Signal Processing: Image Communication30, 57–77 (2015) https://doi.org/10.1016/j.image.2014.10.009

work page doi:10.1016/j.image.2014.10.009 2015
[28]

In: 2019 Eleventh International Conference on Quality of Multimedia Experience (QoMEX), pp

Lin, H., Hosu, V., Saupe, D.: Kadid-10k: A large-scale artificially distorted iqa database. In: 2019 Eleventh International Conference on Quality of Multimedia Experience (QoMEX), pp. 1–3 (2019). IEEE 25

2019