arxiv: 1605.07146 · v4 · submitted 2016-05-23 · 💻 cs.CV · cs.LG· cs.NE

Recognition: 1 theorem link

Wide Residual Networks

Sergey Zagoruyko , Nikos Komodakis

Authors on Pith no claims yet

Pith reviewed 2026-05-13 01:20 UTC · model grok-4.3

classification 💻 cs.CV cs.LGcs.NE

keywords residual networkswide residual networksimage classificationneural network architectureCIFARdeep learningfeature reuse

0 comments

The pith

Wide residual networks with reduced depth and increased width outperform much deeper thin residual networks in accuracy and training speed.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines the diminishing feature reuse problem in very deep residual networks, where adding layers yields smaller gains at high computational cost. It proposes wide residual networks that instead reduce the number of layers while expanding the width of each layer through more channels. Experiments demonstrate that even a basic 16-layer wide network surpasses the accuracy of prior thousand-layer deep residual networks while training more efficiently. This matters because it points to a practical way to build stronger image recognition models without the slowdowns of extreme depth on benchmarks like CIFAR, SVHN, and ImageNet.

Core claim

Residual networks improve more effectively when made wider rather than deeper; the resulting wide residual networks achieve new state-of-the-art accuracy on CIFAR, SVHN, and COCO while delivering significant gains on ImageNet, all with far fewer layers than the thin deep baselines they replace.

What carries the argument

The wide residual block, formed by decreasing overall network depth and increasing the number of feature channels per layer while retaining the residual shortcut connections.

If this is right

Training time and memory use drop because shallower networks avoid the slowdown from excessive layers.
Accuracy improves on CIFAR, SVHN, COCO, and ImageNet without needing thousand-layer depths.
Feature reuse becomes more effective, allowing simpler networks to reach higher performance.
The architecture change applies across multiple datasets without requiring entirely new block designs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Architectures in other domains might also gain more from width scaling than from depth scaling when feature reuse is the bottleneck.
Model design could shift toward finding optimal width-to-depth ratios instead of always maximizing depth.
Similar width-focused adjustments might improve efficiency in non-residual networks facing training slowdowns.

Load-bearing premise

The performance gains arise primarily from the width increase and depth reduction rather than from training schedule, data augmentation, or hyperparameter differences that might favor the new models.

What would settle it

Re-train the original thousand-layer thin ResNet using the exact same width, training schedule, and data augmentation as the 16-layer wide network and measure whether the accuracy gap disappears or reverses.

read the original abstract

Deep residual networks were shown to be able to scale up to thousands of layers and still have improving performance. However, each fraction of a percent of improved accuracy costs nearly doubling the number of layers, and so training very deep residual networks has a problem of diminishing feature reuse, which makes these networks very slow to train. To tackle these problems, in this paper we conduct a detailed experimental study on the architecture of ResNet blocks, based on which we propose a novel architecture where we decrease depth and increase width of residual networks. We call the resulting network structures wide residual networks (WRNs) and show that these are far superior over their commonly used thin and very deep counterparts. For example, we demonstrate that even a simple 16-layer-deep wide residual network outperforms in accuracy and efficiency all previous deep residual networks, including thousand-layer-deep networks, achieving new state-of-the-art results on CIFAR, SVHN, COCO, and significant improvements on ImageNet. Our code and models are available at https://github.com/szagoruyko/wide-residual-networks

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows that widening residual blocks and cutting depth beats extreme-depth thin ResNets on standard vision benchmarks while training faster.

read the letter

The main thing to know is that a 16-layer wide residual network outperforms the thousand-layer deep thin versions on CIFAR, SVHN, and ImageNet while using fewer parameters and training quicker. They arrived at this by running a controlled study of ResNet block factors—width multiplier, depth, dropout—and then building WRNs around the winning combination of wider but shallower blocks. The experiments re-implement the deep baselines under the same training schedule and augmentation, which removes the usual apples-to-oranges problem and makes the accuracy and speed gains look reliable. Code and models are released, so the numbers can be checked directly. The central empirical claim holds up on the reported datasets. One minor soft spot is that the gains still sit inside the usual CIFAR/ImageNet training recipe; it is not obvious whether a deeper net could close the gap with different optimization or augmentation that was not tested here. That does not undermine the controlled ablations they did run, but it limits how far the architecture claim travels without further work. This is useful reading for anyone tuning convolutional architectures for vision tasks or trying to reduce training time on residual nets. The evidence is solid enough that a serious editor should send it to review rather than desk-reject; the comparisons are reproducible and the design choice is clearly motivated by the ablations.

Referee Report

0 major / 4 minor

Summary. The manuscript introduces Wide Residual Networks (WRNs) by decreasing the depth of residual blocks while increasing their width via a width multiplier k. It reports that a simple 16-layer WRN outperforms all prior deep residual networks (including 1000-layer variants) in accuracy and training speed on CIFAR-10/100 and SVHN, with additional gains on ImageNet classification and COCO detection. The authors provide controlled ablations under fixed parameter budgets and release code and models.

Significance. The work is significant because it supplies reproducible empirical evidence that width can be more effective than extreme depth for residual networks, yielding faster convergence and better accuracy under matched training protocols. The public release of code, models, and the use of re-implemented baselines strengthen the reliability of the performance claims and their utility for the community.

minor comments (4)

[§3.1] §3.1: The description of the basic block could include an explicit equation or diagram showing how the width multiplier k scales the number of filters in the 3×3 convolutions.
[Table 1] Table 1: Adding a column for total parameters and training time per epoch would make the efficiency claims easier to verify at a glance.
[§4.2] §4.2: The SVHN results mention a specific dropout placement; a brief note on whether the same schedule was used for all baseline re-implementations would improve clarity.
[Figure 3] Figure 3: The learning curves are informative, but axis labels could specify the exact metric (e.g., top-1 error) and include a legend for the different k values.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive review, accurate summary of our contributions, and recommendation to accept the manuscript. We appreciate the recognition of the significance of our empirical results on width versus depth in residual networks, as well as the value placed on our code and model releases.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is an empirical architecture study. It conducts controlled ablations on ResNet blocks, proposes wider-shallower variants, and validates via accuracy/efficiency comparisons on fixed public benchmarks (CIFAR, SVHN, ImageNet, COCO). No equations, fitted parameters renamed as predictions, or self-referential derivations appear. Baselines are re-implemented under the authors' protocol rather than taken verbatim. Central claims rest on experimental outcomes independent of prior self-citations or definitional loops.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

Claims rest on empirical validation using standard deep learning training; no new theoretical entities or derivations are introduced beyond the architectural modification.

free parameters (2)

width multiplier k
Controls the number of channels in residual blocks; values like 2, 4, 8, 10 are tested and selected for best accuracy-parameter trade-off.
dropout rate
Added between convolutions in wide blocks for regularization; rates such as 0.3-0.4 are tuned per dataset.

axioms (2)

domain assumption Residual skip connections mitigate vanishing gradients and enable training of deep networks.
Directly adopted from the original ResNet work without re-derivation.
standard math Stochastic gradient descent with momentum and standard learning rate decay trains the networks to convergence.
Common optimization practice assumed to be effective across compared architectures.

pith-pipeline@v0.9.0 · 5482 in / 1282 out tokens · 52257 ms · 2026-05-13T01:20:31.195097+00:00 · methodology

discussion (0)

Forward citations

Cited by 30 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Denoising Diffusion Probabilistic Models
cs.LG 2020-06 accept novelty 8.0

Denoising diffusion probabilistic models generate high-quality images by learning to reverse a fixed forward diffusion process, achieving FID 3.17 on CIFAR10.
The Geometric Structure of Models Learning Sparse Data
cs.LG 2026-05 unverdicted novelty 7.0

In sparse regimes, models exploit normal alignment of Jacobians to minimize loss and maximize robustness; GrokAlign induces this alignment to accelerate training and RFAMs improve adversarial robustness.
Low Rank Adaptation for Adversarial Perturbation
cs.LG 2026-04 unverdicted novelty 7.0

Adversarial perturbations possess an inherently low-rank structure that enables more efficient and effective black-box adversarial attacks via subspace projection.
Concept Inconsistency in Dermoscopic Concept Bottleneck Models: A Rough-Set Analysis of the Derm7pt Dataset
cs.LG 2026-04 conditional novelty 7.0

Rough-set analysis finds 16.4% of 305 concept profiles in Derm7pt inconsistent (306 images), capping hard CBM accuracy at 92.1%; symmetric filtering produces a 705-image consistent benchmark where EfficientNet-B5 reac...
Momentum Further Constrains Sharpness at the Edge of Stochastic Stability
cs.LG 2026-04 unverdicted novelty 7.0

Momentum SGD exhibits two distinct EoSS regimes for batch sharpness, stabilizing at 2(1-β)/η for small batches and 2(1+β)/η for large batches, aligning with linear stability thresholds.
Learning Robustness at Test-Time from a Non-Robust Teacher
cs.CV 2026-04 unverdicted novelty 7.0

A test-time adaptation framework anchors adversarial training to a non-robust teacher's predictions, yielding more stable optimization and better robustness-accuracy trade-offs than standard self-consistency methods.
Novel Anomaly Detection Scenarios and Evaluation Metrics to Address the Ambiguity in the Definition of Normal Samples
cs.CV 2026-04 unverdicted novelty 7.0

Introduces scenarios and metrics for ambiguous normal samples in anomaly detection plus RePaste method achieving SOTA on the new metric on MVTec AD while retaining high AUROC and PRO.
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
cs.LG 2022-08 conditional novelty 7.0

LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.
Video Diffusion Models
cs.CV 2022-04 unverdicted novelty 7.0

A diffusion model for video generation extends image architectures with joint image-video training and improved conditional sampling, delivering first large-scale text-to-video results and state-of-the-art performance...
Venus-DeFakerOne: Unified Fake Image Detection & Localization
cs.CV 2026-05 unverdicted novelty 6.0

DeFakerOne integrates InternVL2 and SAM2 into a single model that achieves state-of-the-art results on 39 detection and 9 localization benchmarks for unified fake image detection and localization.
FedVSSAM: Mitigating Flatness Incompatibility in Sharpness-Aware Federated Learning
cs.LG 2026-05 unverdicted novelty 6.0

FedVSSAM mitigates flatness incompatibility in SAM-based federated learning by consistently using a variance-suppressed adjusted direction for local perturbation, descent, and global updates, with non-convex convergen...
Direct-to-Event Spiking Neural Network Transfer
cs.NE 2026-05 unverdicted novelty 6.0

This work provides the first systematic study of transferring direct-coded spiking neural networks to event-based representations while aiming to preserve accuracy and reduce energy use.
Deep Wave Network for Modeling Multi-Scale Physical Dynamics
cs.LG 2026-05 unverdicted novelty 6.0

DW-Net improves the accuracy versus computational cost Pareto front over standard U-Nets for 2D and 3D multi-scale flow benchmarks by stacking multiple waves while keeping training settings identical.
Detecting Adversarial Data via Provable Adversarial Noise Amplification
cs.LG 2026-05 unverdicted novelty 6.0

A provable adversarial noise amplification theorem under sufficient conditions enables a custom-trained detector that identifies adversarial examples at inference time using enhanced layer-wise noise signals.
Learning to Reason: Targeted Knowledge Discovery and Fuzzy Logic Update for Robust Image Recognition
cs.CV 2026-04 unverdicted novelty 6.0

A differentiable fuzzy logic module called DKU discovers implicit concepts from image classification supervision and applies logical adjustments to improve class probabilities on PASCAL-VOC, COCO, and MedMNIST.
FastAT Benchmark: A Comprehensive Framework for Fair Evaluation of Fast Adversarial Training Methods
cs.CV 2026-04 conditional novelty 6.0

The FastAT Benchmark standardizes evaluation of over twenty fast adversarial training methods under unified conditions, showing that well-designed single-step approaches can match or exceed PGD-AT robustness at lower ...
Generative Cross-Entropy: A Strictly Proper Loss for Data-Efficient Classification
cs.LG 2026-04 unverdicted novelty 6.0

GenCE is a strictly proper loss obtained by normalizing each sample's softmax against the batch predictions, outperforming cross-entropy in low-data and imbalanced regimes with better calibration and OOD detection.
StableTTA: Improving Vision Model Performance by Training-free Test-Time Adaptation Methods
cs.CV 2026-04 unverdicted novelty 6.0

StableTTA improves ImageNet-1K accuracy across 71 vision models by stabilizing logit aggregation under coherent-batch inference and enabling efficient single-forward-pass adaptation.
Revisiting Feature Prediction for Learning Visual Representations from Video
cs.CV 2024-02 conditional novelty 6.0

V-JEPA models trained only on feature prediction from 2 million public videos achieve 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet-1K using frozen ViT-H/16 backbones.
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models
cs.LG 2024-01 unverdicted novelty 6.0

SPIN lets weak LLMs become strong by self-generating training data from previous model versions and training to prefer human-annotated responses over its own outputs, outperforming DPO even with extra GPT-4 data on be...
Rethinking Atrous Convolution for Semantic Image Segmentation
cs.CV 2017-06 unverdicted novelty 6.0

DeepLabv3 improves semantic segmentation by capturing multi-scale context with cascaded or parallel atrous convolutions and adding global context to ASPP, achieving better results on PASCAL VOC 2012 without DenseCRF p...
SGDR: Stochastic Gradient Descent with Warm Restarts
cs.LG 2016-08 accept novelty 6.0

SGDR uses periodic warm restarts of the learning rate in SGD to reach new state-of-the-art error rates of 3.14% on CIFAR-10 and 16.21% on CIFAR-100.
Taming the Long Tail: Rebalancing Adversarial Training via Adaptive Perturbation
cs.LG 2026-05 unverdicted novelty 5.0

RobustLT adaptively adjusts perturbations in adversarial training to simultaneously improve robustness and class balance on long-tailed datasets.
A Composite Activation Function for Learning Stable Binary Representations
cs.LG 2026-05 unverdicted novelty 5.0

HTAF is a sigmoid-tanh composite that approximates the Heaviside function to allow stable gradient training of binary activation networks, yielding ICBMs with stable discretization and competitive performance on image tasks.
Memory Efficient Full-gradient Attacks (MEFA) Framework for Adversarial Defense Evaluations
cs.LG 2026-05 unverdicted novelty 5.0

MEFA enables exact full-gradient white-box attacks on iterative stochastic purification defenses like diffusion and Langevin EBMs by trading recomputation for lower memory, revealing vulnerabilities missed by approxim...
Generative Cross-Entropy: A Strictly Proper Loss for Data-Efficient Classification
cs.LG 2026-04 unverdicted novelty 5.0

Generative Cross-Entropy loss improves both accuracy and calibration over standard cross-entropy by augmenting it with a generative p(x|y) term, especially on long-tailed data, and pairs with adaptive temperature scal...
Foundations of Reliable Inference: Reliability-Efficiency Co-Design
cs.LG 2026-05 unverdicted novelty 4.0

A unified framework is developed for co-designing reliability and efficiency to enable efficient reliable inference with trustworthy uncertainty quantification in AI models.
JEPAMatch: Geometric Representation Shaping for Semi-Supervised Learning
cs.LG 2026-04 unverdicted novelty 4.0

JEPAMatch augments FlexMatch with LeJEPA-derived latent regularization to produce better-structured representations, yielding higher accuracy and faster convergence on CIFAR-100, STL-10, and Tiny-ImageNet.
Image Classification via Random Dilated Convolution with Multi-Branch Feature Extraction and Context Excitation
cs.CV 2026-04 unverdicted novelty 3.0

RDCNet reports state-of-the-art accuracy on CIFAR-10, CIFAR-100, SVHN, Imagenette, and Imagewoof by combining random dilated convolutions with multi-branch and attention modules.
A Transfer Learning Evaluation of Deep Neural Networks for Image Classification
cs.CV 2026-05 unverdicted novelty 2.0

Empirical comparison of transfer learning performance across eleven pre-trained models on five image datasets using accuracy, time, and size metrics.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · cited by 29 Pith papers · 2 internal anchors

[1]

Understanding the difﬁculty of training deep feed- forward neural networks

Yoshua Bengio and Xavier Glorot. Understanding the difﬁculty of training deep feed- forward neural networks. In Proceedings of AISTATS 2010, volume 9, pages 249–256, May 2010

work page 2010
[2]

Scaling learning algorithms towards AI

Yoshua Bengio and Yann LeCun. Scaling learning algorithms towards AI. In Léon Bottou, Olivier Chapelle, D. DeCoste, and J. Weston, editors, Large Scale Kernel Ma- chines. MIT Press, 2007

work page 2007
[3]

On the complexity of shallow and deep neu- ral network classiﬁers

Monica Bianchini and Franco Scarselli. On the complexity of shallow and deep neu- ral network classiﬁers. In 22th European Symposium on Artiﬁcial Neural Networks, ESANN 2014, Bruges, Belgium, April 23-25, 2014, 2014

work page 2014
[4]

T. Chen, I. Goodfellow, and J. Shlens. Net2net: Accelerating learning via knowledge transfer. In International Conference on Learning Representation, 2016

work page 2016
[5]

Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)

Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning by exponential linear units (elus). CoRR, abs/1511.07289, 2015

work page Pith review arXiv 2015
[6]

Collobert, K

R. Collobert, K. Kavukcuoglu, and C. Farabet. Torch7: A matlab-like environment for machine learning. In BigLearn, NIPS Workshop, 2011

work page 2011
[7]

Locnet: Improving localization accuracy for object detection

Spyros Gidaris and Nikos Komodakis. Locnet: Improving localization accuracy for object detection. In Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on, 2016

work page 2016
[8]

Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio

Ian J. Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. Maxout networks. In Sanjoy Dasgupta and David McAllester, editors, Pro- ceedings of the 30th International Conference on Machine Learning (ICML’13), pages 1319–1327, 2013

work page 2013
[9]

Fractional max-pooling

Benjamin Graham. Fractional max-pooling. arXiv:1412.6071, 2014

work page arXiv 2014
[10]

Training and investigating residual nets, 2016

Sam Gross and Michael Wilber. Training and investigating residual nets, 2016. URL https://github.com/facebook/fb.resnet.torch

work page 2016
[11]

Deep Residual Learning for Image Recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[12]

Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectiﬁers: Surpassing human-level performance on imagenet classiﬁcation. CoRR, abs/1502.01852, 2015. 14 SERGEY ZAGORUYKO AND NIKOS KOMODAKIS: WIDE RESIDUAL NETWORKS

work page Pith review arXiv 2015
[13]

Identity mappings in deep residual networks

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. CoRR, abs/1603.05027, 2016

work page arXiv 2016
[14]

Deep networks with stochastic depth

Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q. Weinberger. Deep networks with stochastic depth. CoRR, abs/1603.09382, 2016

work page arXiv 2016
[15]

Batch normalization: Accelerating deep network training by reducing internal covariate shift

Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In David Blei and Francis Bach, editors, Proceedings of the 32nd International Conference on Machine Learning (ICML-15) , pages 448–456. JMLR Workshop and Conference Proceedings, 2015

work page 2015
[16]

Krizhevsky, I

A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classiﬁcation with deep convo- lutional neural networks. In NIPS, 2012

work page 2012
[17]

Cifar-10 (canadian institute for advanced research)

Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. Cifar-10 (canadian institute for advanced research). 2012. URL http://www.cs.toronto.edu/~kriz/ cifar.html

work page 2012
[18]

An empirical evaluation of deep architectures on problems with many factors of variation

Hugo Larochelle, Dumitru Erhan, Aaron Courville, James Bergstra, and Yoshua Ben- gio. An empirical evaluation of deep architectures on problems with many factors of variation. In Zoubin Ghahramani, editor, Proceedings of the 24th International Con- ference on Machine Learning (ICML’07), pages 473–480. ACM, 2007

work page 2007
[19]

C.-Y . Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu. Deeply-Supervised Nets. 2014

work page 2014
[20]

Network in network.CoRR, abs/1312.4400, 2013

Min Lin, Qiang Chen, and Shuicheng Yan. Network in network.CoRR, abs/1312.4400, 2013

work page arXiv 2013
[21]

Optnet - reducing memory usage in torch neural networks, 2016

Francisco Massa. Optnet - reducing memory usage in torch neural networks, 2016. URL https://github.com/fmassa/optimize-net

work page 2016
[22]

Montúfar, Razvan Pascanu, KyungHyun Cho, and Yoshua Bengio

Guido F. Montúfar, Razvan Pascanu, KyungHyun Cho, and Yoshua Bengio. On the number of linear regions of deep neural networks. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 2924–2932, 2014

work page 2014
[23]

Deep learning made easier by linear transformations in perceptrons

Tapani Raiko, Harri Valpola, and Yann Lecun. Deep learning made easier by linear transformations in perceptrons. In Neil D. Lawrence and Mark A. Girolami, editors, Proceedings of the Fifteenth International Conference on Artiﬁcial Intelligence and Statistics (AISTATS-12), volume 22, pages 924–932, 2012

work page 2012
[24]

FitNets: Hints for Thin Deep Nets

Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. FitNets: Hints for thin deep nets. Technical Report Arxiv report 1412.6550, arXiv, 2014

work page internal anchor Pith review arXiv 2014
[25]

Schmidhuber

J. Schmidhuber. Learning complex, extended sequences using the principle of history compression. Neural Computation, 4(2):234–242, 1992

work page 1992
[26]

Simonyan and A

K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015

work page 2015
[27]

Srivastava, G

N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overﬁtting.JMLR, 2014. SERGEY ZAGORUYKO AND NIKOS KOMODAKIS: WIDE RESIDUAL NETWORKS 15

work page 2014
[28]

Highway Networks

Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. Highway networks. CoRR, abs/1505.00387, 2015

work page Pith review arXiv 2015
[29]

Dahl, and Geoffrey E

Ilya Sutskever, James Martens, George E. Dahl, and Geoffrey E. Hinton. On the im- portance of initialization and momentum in deep learning. In Sanjoy Dasgupta and David Mcallester, editors, Proceedings of the 30th International Conference on Ma- chine Learning (ICML-13), volume 28, pages 1139–1147. JMLR Workshop and Con- ference Proceedings, May 2013

work page 2013
[30]

Szegedy, W

C. Szegedy, W. Liu, Y . Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V . Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015

work page 2015
[31]

Inception-v4, inception- resnet and the impact of residual connections on learning

Christian Szegedy, Sergey Ioffe, and Vincent Vanhoucke. Inception-v4, inception- resnet and the impact of residual connections on learning. abs/1602.07261, 2016

work page arXiv 2016
[32]

Zagoruyko, A

S. Zagoruyko, A. Lerer, T.-Y . Lin, P. O. Pinheiro, S. Gross, S. Chintala, and P. Dollár. A multipath network for object detection. In BMVC, 2016

work page 2016