Recognition: 1 theorem link
Wide Residual Networks
Pith reviewed 2026-05-13 01:20 UTC · model grok-4.3
The pith
Wide residual networks with reduced depth and increased width outperform much deeper thin residual networks in accuracy and training speed.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Residual networks improve more effectively when made wider rather than deeper; the resulting wide residual networks achieve new state-of-the-art accuracy on CIFAR, SVHN, and COCO while delivering significant gains on ImageNet, all with far fewer layers than the thin deep baselines they replace.
What carries the argument
The wide residual block, formed by decreasing overall network depth and increasing the number of feature channels per layer while retaining the residual shortcut connections.
If this is right
- Training time and memory use drop because shallower networks avoid the slowdown from excessive layers.
- Accuracy improves on CIFAR, SVHN, COCO, and ImageNet without needing thousand-layer depths.
- Feature reuse becomes more effective, allowing simpler networks to reach higher performance.
- The architecture change applies across multiple datasets without requiring entirely new block designs.
Where Pith is reading between the lines
- Architectures in other domains might also gain more from width scaling than from depth scaling when feature reuse is the bottleneck.
- Model design could shift toward finding optimal width-to-depth ratios instead of always maximizing depth.
- Similar width-focused adjustments might improve efficiency in non-residual networks facing training slowdowns.
Load-bearing premise
The performance gains arise primarily from the width increase and depth reduction rather than from training schedule, data augmentation, or hyperparameter differences that might favor the new models.
What would settle it
Re-train the original thousand-layer thin ResNet using the exact same width, training schedule, and data augmentation as the 16-layer wide network and measure whether the accuracy gap disappears or reverses.
read the original abstract
Deep residual networks were shown to be able to scale up to thousands of layers and still have improving performance. However, each fraction of a percent of improved accuracy costs nearly doubling the number of layers, and so training very deep residual networks has a problem of diminishing feature reuse, which makes these networks very slow to train. To tackle these problems, in this paper we conduct a detailed experimental study on the architecture of ResNet blocks, based on which we propose a novel architecture where we decrease depth and increase width of residual networks. We call the resulting network structures wide residual networks (WRNs) and show that these are far superior over their commonly used thin and very deep counterparts. For example, we demonstrate that even a simple 16-layer-deep wide residual network outperforms in accuracy and efficiency all previous deep residual networks, including thousand-layer-deep networks, achieving new state-of-the-art results on CIFAR, SVHN, COCO, and significant improvements on ImageNet. Our code and models are available at https://github.com/szagoruyko/wide-residual-networks
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Wide Residual Networks (WRNs) by decreasing the depth of residual blocks while increasing their width via a width multiplier k. It reports that a simple 16-layer WRN outperforms all prior deep residual networks (including 1000-layer variants) in accuracy and training speed on CIFAR-10/100 and SVHN, with additional gains on ImageNet classification and COCO detection. The authors provide controlled ablations under fixed parameter budgets and release code and models.
Significance. The work is significant because it supplies reproducible empirical evidence that width can be more effective than extreme depth for residual networks, yielding faster convergence and better accuracy under matched training protocols. The public release of code, models, and the use of re-implemented baselines strengthen the reliability of the performance claims and their utility for the community.
minor comments (4)
- [§3.1] §3.1: The description of the basic block could include an explicit equation or diagram showing how the width multiplier k scales the number of filters in the 3×3 convolutions.
- [Table 1] Table 1: Adding a column for total parameters and training time per epoch would make the efficiency claims easier to verify at a glance.
- [§4.2] §4.2: The SVHN results mention a specific dropout placement; a brief note on whether the same schedule was used for all baseline re-implementations would improve clarity.
- [Figure 3] Figure 3: The learning curves are informative, but axis labels could specify the exact metric (e.g., top-1 error) and include a legend for the different k values.
Simulated Author's Rebuttal
We thank the referee for their positive review, accurate summary of our contributions, and recommendation to accept the manuscript. We appreciate the recognition of the significance of our empirical results on width versus depth in residual networks, as well as the value placed on our code and model releases.
Circularity Check
No significant circularity
full rationale
The paper is an empirical architecture study. It conducts controlled ablations on ResNet blocks, proposes wider-shallower variants, and validates via accuracy/efficiency comparisons on fixed public benchmarks (CIFAR, SVHN, ImageNet, COCO). No equations, fitted parameters renamed as predictions, or self-referential derivations appear. Baselines are re-implemented under the authors' protocol rather than taken verbatim. Central claims rest on experimental outcomes independent of prior self-citations or definitional loops.
Axiom & Free-Parameter Ledger
free parameters (2)
- width multiplier k
- dropout rate
axioms (2)
- domain assumption Residual skip connections mitigate vanishing gradients and enable training of deep networks.
- standard math Stochastic gradient descent with momentum and standard learning rate decay trains the networks to convergence.
Forward citations
Cited by 30 Pith papers
-
Denoising Diffusion Probabilistic Models
Denoising diffusion probabilistic models generate high-quality images by learning to reverse a fixed forward diffusion process, achieving FID 3.17 on CIFAR10.
-
The Geometric Structure of Models Learning Sparse Data
In sparse regimes, models exploit normal alignment of Jacobians to minimize loss and maximize robustness; GrokAlign induces this alignment to accelerate training and RFAMs improve adversarial robustness.
-
Low Rank Adaptation for Adversarial Perturbation
Adversarial perturbations possess an inherently low-rank structure that enables more efficient and effective black-box adversarial attacks via subspace projection.
-
Concept Inconsistency in Dermoscopic Concept Bottleneck Models: A Rough-Set Analysis of the Derm7pt Dataset
Rough-set analysis finds 16.4% of 305 concept profiles in Derm7pt inconsistent (306 images), capping hard CBM accuracy at 92.1%; symmetric filtering produces a 705-image consistent benchmark where EfficientNet-B5 reac...
-
Momentum Further Constrains Sharpness at the Edge of Stochastic Stability
Momentum SGD exhibits two distinct EoSS regimes for batch sharpness, stabilizing at 2(1-β)/η for small batches and 2(1+β)/η for large batches, aligning with linear stability thresholds.
-
Learning Robustness at Test-Time from a Non-Robust Teacher
A test-time adaptation framework anchors adversarial training to a non-robust teacher's predictions, yielding more stable optimization and better robustness-accuracy trade-offs than standard self-consistency methods.
-
Novel Anomaly Detection Scenarios and Evaluation Metrics to Address the Ambiguity in the Definition of Normal Samples
Introduces scenarios and metrics for ambiguous normal samples in anomaly detection plus RePaste method achieving SOTA on the new metric on MVTec AD while retaining high AUROC and PRO.
-
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.
-
Video Diffusion Models
A diffusion model for video generation extends image architectures with joint image-video training and improved conditional sampling, delivering first large-scale text-to-video results and state-of-the-art performance...
-
Venus-DeFakerOne: Unified Fake Image Detection & Localization
DeFakerOne integrates InternVL2 and SAM2 into a single model that achieves state-of-the-art results on 39 detection and 9 localization benchmarks for unified fake image detection and localization.
-
FedVSSAM: Mitigating Flatness Incompatibility in Sharpness-Aware Federated Learning
FedVSSAM mitigates flatness incompatibility in SAM-based federated learning by consistently using a variance-suppressed adjusted direction for local perturbation, descent, and global updates, with non-convex convergen...
-
Direct-to-Event Spiking Neural Network Transfer
This work provides the first systematic study of transferring direct-coded spiking neural networks to event-based representations while aiming to preserve accuracy and reduce energy use.
-
Deep Wave Network for Modeling Multi-Scale Physical Dynamics
DW-Net improves the accuracy versus computational cost Pareto front over standard U-Nets for 2D and 3D multi-scale flow benchmarks by stacking multiple waves while keeping training settings identical.
-
Detecting Adversarial Data via Provable Adversarial Noise Amplification
A provable adversarial noise amplification theorem under sufficient conditions enables a custom-trained detector that identifies adversarial examples at inference time using enhanced layer-wise noise signals.
-
Learning to Reason: Targeted Knowledge Discovery and Fuzzy Logic Update for Robust Image Recognition
A differentiable fuzzy logic module called DKU discovers implicit concepts from image classification supervision and applies logical adjustments to improve class probabilities on PASCAL-VOC, COCO, and MedMNIST.
-
FastAT Benchmark: A Comprehensive Framework for Fair Evaluation of Fast Adversarial Training Methods
The FastAT Benchmark standardizes evaluation of over twenty fast adversarial training methods under unified conditions, showing that well-designed single-step approaches can match or exceed PGD-AT robustness at lower ...
-
Generative Cross-Entropy: A Strictly Proper Loss for Data-Efficient Classification
GenCE is a strictly proper loss obtained by normalizing each sample's softmax against the batch predictions, outperforming cross-entropy in low-data and imbalanced regimes with better calibration and OOD detection.
-
StableTTA: Improving Vision Model Performance by Training-free Test-Time Adaptation Methods
StableTTA improves ImageNet-1K accuracy across 71 vision models by stabilizing logit aggregation under coherent-batch inference and enabling efficient single-forward-pass adaptation.
-
Revisiting Feature Prediction for Learning Visual Representations from Video
V-JEPA models trained only on feature prediction from 2 million public videos achieve 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet-1K using frozen ViT-H/16 backbones.
-
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models
SPIN lets weak LLMs become strong by self-generating training data from previous model versions and training to prefer human-annotated responses over its own outputs, outperforming DPO even with extra GPT-4 data on be...
-
Rethinking Atrous Convolution for Semantic Image Segmentation
DeepLabv3 improves semantic segmentation by capturing multi-scale context with cascaded or parallel atrous convolutions and adding global context to ASPP, achieving better results on PASCAL VOC 2012 without DenseCRF p...
-
SGDR: Stochastic Gradient Descent with Warm Restarts
SGDR uses periodic warm restarts of the learning rate in SGD to reach new state-of-the-art error rates of 3.14% on CIFAR-10 and 16.21% on CIFAR-100.
-
Taming the Long Tail: Rebalancing Adversarial Training via Adaptive Perturbation
RobustLT adaptively adjusts perturbations in adversarial training to simultaneously improve robustness and class balance on long-tailed datasets.
-
A Composite Activation Function for Learning Stable Binary Representations
HTAF is a sigmoid-tanh composite that approximates the Heaviside function to allow stable gradient training of binary activation networks, yielding ICBMs with stable discretization and competitive performance on image tasks.
-
Memory Efficient Full-gradient Attacks (MEFA) Framework for Adversarial Defense Evaluations
MEFA enables exact full-gradient white-box attacks on iterative stochastic purification defenses like diffusion and Langevin EBMs by trading recomputation for lower memory, revealing vulnerabilities missed by approxim...
-
Generative Cross-Entropy: A Strictly Proper Loss for Data-Efficient Classification
Generative Cross-Entropy loss improves both accuracy and calibration over standard cross-entropy by augmenting it with a generative p(x|y) term, especially on long-tailed data, and pairs with adaptive temperature scal...
-
Foundations of Reliable Inference: Reliability-Efficiency Co-Design
A unified framework is developed for co-designing reliability and efficiency to enable efficient reliable inference with trustworthy uncertainty quantification in AI models.
-
JEPAMatch: Geometric Representation Shaping for Semi-Supervised Learning
JEPAMatch augments FlexMatch with LeJEPA-derived latent regularization to produce better-structured representations, yielding higher accuracy and faster convergence on CIFAR-100, STL-10, and Tiny-ImageNet.
-
Image Classification via Random Dilated Convolution with Multi-Branch Feature Extraction and Context Excitation
RDCNet reports state-of-the-art accuracy on CIFAR-10, CIFAR-100, SVHN, Imagenette, and Imagewoof by combining random dilated convolutions with multi-branch and attention modules.
-
A Transfer Learning Evaluation of Deep Neural Networks for Image Classification
Empirical comparison of transfer learning performance across eleven pre-trained models on five image datasets using accuracy, time, and size metrics.
Reference graph
Works this paper leans on
-
[1]
Understanding the difficulty of training deep feed- forward neural networks
Yoshua Bengio and Xavier Glorot. Understanding the difficulty of training deep feed- forward neural networks. In Proceedings of AISTATS 2010, volume 9, pages 249–256, May 2010
work page 2010
-
[2]
Scaling learning algorithms towards AI
Yoshua Bengio and Yann LeCun. Scaling learning algorithms towards AI. In Léon Bottou, Olivier Chapelle, D. DeCoste, and J. Weston, editors, Large Scale Kernel Ma- chines. MIT Press, 2007
work page 2007
-
[3]
On the complexity of shallow and deep neu- ral network classifiers
Monica Bianchini and Franco Scarselli. On the complexity of shallow and deep neu- ral network classifiers. In 22th European Symposium on Artificial Neural Networks, ESANN 2014, Bruges, Belgium, April 23-25, 2014, 2014
work page 2014
-
[4]
T. Chen, I. Goodfellow, and J. Shlens. Net2net: Accelerating learning via knowledge transfer. In International Conference on Learning Representation, 2016
work page 2016
-
[5]
Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)
Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning by exponential linear units (elus). CoRR, abs/1511.07289, 2015
work page Pith review arXiv 2015
-
[6]
R. Collobert, K. Kavukcuoglu, and C. Farabet. Torch7: A matlab-like environment for machine learning. In BigLearn, NIPS Workshop, 2011
work page 2011
-
[7]
Locnet: Improving localization accuracy for object detection
Spyros Gidaris and Nikos Komodakis. Locnet: Improving localization accuracy for object detection. In Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on, 2016
work page 2016
-
[8]
Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio
Ian J. Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. Maxout networks. In Sanjoy Dasgupta and David McAllester, editors, Pro- ceedings of the 30th International Conference on Machine Learning (ICML’13), pages 1319–1327, 2013
work page 2013
-
[9]
Benjamin Graham. Fractional max-pooling. arXiv:1412.6071, 2014
-
[10]
Training and investigating residual nets, 2016
Sam Gross and Michael Wilber. Training and investigating residual nets, 2016. URL https://github.com/facebook/fb.resnet.torch
work page 2016
-
[11]
Deep Residual Learning for Image Recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[12]
Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. CoRR, abs/1502.01852, 2015. 14 SERGEY ZAGORUYKO AND NIKOS KOMODAKIS: WIDE RESIDUAL NETWORKS
work page Pith review arXiv 2015
-
[13]
Identity mappings in deep residual networks
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. CoRR, abs/1603.05027, 2016
-
[14]
Deep networks with stochastic depth
Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q. Weinberger. Deep networks with stochastic depth. CoRR, abs/1603.09382, 2016
-
[15]
Batch normalization: Accelerating deep network training by reducing internal covariate shift
Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In David Blei and Francis Bach, editors, Proceedings of the 32nd International Conference on Machine Learning (ICML-15) , pages 448–456. JMLR Workshop and Conference Proceedings, 2015
work page 2015
-
[16]
A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convo- lutional neural networks. In NIPS, 2012
work page 2012
-
[17]
Cifar-10 (canadian institute for advanced research)
Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. Cifar-10 (canadian institute for advanced research). 2012. URL http://www.cs.toronto.edu/~kriz/ cifar.html
work page 2012
-
[18]
An empirical evaluation of deep architectures on problems with many factors of variation
Hugo Larochelle, Dumitru Erhan, Aaron Courville, James Bergstra, and Yoshua Ben- gio. An empirical evaluation of deep architectures on problems with many factors of variation. In Zoubin Ghahramani, editor, Proceedings of the 24th International Con- ference on Machine Learning (ICML’07), pages 473–480. ACM, 2007
work page 2007
-
[19]
C.-Y . Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu. Deeply-Supervised Nets. 2014
work page 2014
-
[20]
Network in network.CoRR, abs/1312.4400, 2013
Min Lin, Qiang Chen, and Shuicheng Yan. Network in network.CoRR, abs/1312.4400, 2013
-
[21]
Optnet - reducing memory usage in torch neural networks, 2016
Francisco Massa. Optnet - reducing memory usage in torch neural networks, 2016. URL https://github.com/fmassa/optimize-net
work page 2016
-
[22]
Montúfar, Razvan Pascanu, KyungHyun Cho, and Yoshua Bengio
Guido F. Montúfar, Razvan Pascanu, KyungHyun Cho, and Yoshua Bengio. On the number of linear regions of deep neural networks. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 2924–2932, 2014
work page 2014
-
[23]
Deep learning made easier by linear transformations in perceptrons
Tapani Raiko, Harri Valpola, and Yann Lecun. Deep learning made easier by linear transformations in perceptrons. In Neil D. Lawrence and Mark A. Girolami, editors, Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics (AISTATS-12), volume 22, pages 924–932, 2012
work page 2012
-
[24]
FitNets: Hints for Thin Deep Nets
Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. FitNets: Hints for thin deep nets. Technical Report Arxiv report 1412.6550, arXiv, 2014
work page internal anchor Pith review arXiv 2014
-
[25]
J. Schmidhuber. Learning complex, extended sequences using the principle of history compression. Neural Computation, 4(2):234–242, 1992
work page 1992
-
[26]
K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015
work page 2015
-
[27]
N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting.JMLR, 2014. SERGEY ZAGORUYKO AND NIKOS KOMODAKIS: WIDE RESIDUAL NETWORKS 15
work page 2014
-
[28]
Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. Highway networks. CoRR, abs/1505.00387, 2015
work page Pith review arXiv 2015
-
[29]
Ilya Sutskever, James Martens, George E. Dahl, and Geoffrey E. Hinton. On the im- portance of initialization and momentum in deep learning. In Sanjoy Dasgupta and David Mcallester, editors, Proceedings of the 30th International Conference on Ma- chine Learning (ICML-13), volume 28, pages 1139–1147. JMLR Workshop and Con- ference Proceedings, May 2013
work page 2013
-
[30]
C. Szegedy, W. Liu, Y . Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V . Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015
work page 2015
-
[31]
Inception-v4, inception- resnet and the impact of residual connections on learning
Christian Szegedy, Sergey Ioffe, and Vincent Vanhoucke. Inception-v4, inception- resnet and the impact of residual connections on learning. abs/1602.07261, 2016
-
[32]
S. Zagoruyko, A. Lerer, T.-Y . Lin, P. O. Pinheiro, S. Gross, S. Chintala, and P. Dollár. A multipath network for object detection. In BMVC, 2016
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.