pith. sign in

arxiv: 1907.10473 · v1 · pith:VQDQKYLAnew · submitted 2019-07-22 · 💻 cs.CV · cs.LG

Switchable Normalization for Learning-to-Normalize Deep Representation

Pith reviewed 2026-05-24 18:02 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords Switchable Normalizationbatch normalizationlayer normalizationchannel normalizationdeep neural networkscomputer visionnormalization techniquesImageNet
0
0 comments X

The pith

Switchable Normalization lets each layer learn to weight among channel, layer, and minibatch statistics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Switchable Normalization to address the problem of selecting normalization methods in deep networks. It computes importance weights that let each normalization layer choose or combine three scopes for calculating means and variances: channel, layer, or minibatch. This selection happens end-to-end during training and requires no extra hyperparameters. A reader would care because fixed normalizers like batch norm degrade with small batches and alternatives like group norm demand manual tuning of group count. The method is shown to improve results on ImageNet classification plus detection, segmentation, face recognition, and action recognition benchmarks while remaining stable across batch sizes.

Core claim

Switchable Normalization (SN) learns a set of importance weights for each normalization layer so that the layer can automatically select or blend among three distinct scopes for computing normalization statistics: channel-wise, layer-wise, and minibatch-wise. The weights are optimized jointly with the rest of the network parameters, allowing the choice of normalizer to adapt to network architecture, task, and data distribution without manual intervention or sensitive hyperparameters.

What carries the argument

The end-to-end learned importance weights that switch or combine channel normalization, layer normalization, and minibatch normalization for each individual layer.

If this is right

  • SN maintains accuracy when minibatch size drops to two images per GPU, unlike standard batch normalization.
  • SN removes the need to search over the number of groups required by group normalization.
  • Different layers inside the same network can and do converge to different preferred normalizers.
  • SN yields higher accuracy than any single fixed normalizer on ImageNet, COCO, Cityscapes, ADE20K, MegaFace, and Kinetics without architecture-specific changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The results imply that the best normalization scope is often position-dependent within a network rather than uniform across all layers.
  • The same switching mechanism could be applied to additional normalization variants or to statistics computed over other dimensions such as spatial regions.
  • In new domains with unusual batch sizes or data statistics, the learned weights might serve as a diagnostic for which normalizer properties matter most.
  • One could measure whether the final importance weights correlate with network depth or with properties of the input distribution.

Load-bearing premise

End-to-end optimization of the importance weights will discover an effective per-layer selection rather than collapsing to an average or overfitting the training distribution.

What would settle it

Train an identical architecture with SN and with a single fixed normalizer on ImageNet; if the fixed version matches or exceeds SN accuracy while using the same batch size and training schedule, the benefit of the learned switch is not supported.

Figures

Figures reproduced from arXiv: 1907.10473 by Jiamin Ren, Jingyu Li, Ping Luo, Ruimao Zhang, Zhanglin Peng.

Figure 1
Figure 1. Figure 1 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Geometric view of directions and lengths of the filters in WN, IN, BN, LN, and SN. These normalizers are compared in an unify way by represent them by using WN that decomposes optimization of filters into their directions and lengths. In this way, IN is identical to WN that sets the filter norm to ‘1’ (i.e. kw1k = kw2k = 1) and then rescales them to γ. LN is less constrained than IN and WN to increase lear… view at source ↗
Figure 3
Figure 3. Figure 3: Comparisons of learning curves. (a) visualizes the validation curves of SN with different settings of batch size. The bracket (·, ·) denotes (#GPUs, #samples per GPU). (b) compares the top-1 train and validation curves on ImageNet of SN, BN, and GN in the batch size of (8,32). (c) compares the train and validation curves of SN and GN in the batch size of (8,2). batch size to compute the gradients is as sma… view at source ↗
Figure 4
Figure 4. Figure 4: Importance weights v.s. batch sizes. The bracket (·, ·) indicates (#GPUs, #samples per GPU). SN doesn’t have BN in (8, 1). Furthermore, we repeat training of ResNet50 several times in ImageNet, to show that when the network, task, batch setting and data are fixed, the importance weights of SN are not sensitive to the change of training protocols such as solver, parameter initialization, and learning rate d… view at source ↗
Figure 5
Figure 5. Figure 5: Fig.5. We have several observations to answer what factors that [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 5
Figure 5. Figure 5: Selected operations of each SN layer in ResNet50. There are 53 SN layers. (a,b) show the importance weights for µ and σ of (8, 32), while (c,d) show those of (8, 2). The y-axis represents the importance weights that sum to 1, while the x-axis shows different residual blocks of ResNet50. The SN layers in different places are highlighted differently. For example, the SN layers follow the 3×3 conv layers are … view at source ↗
Figure 6
Figure 6. Figure 6: Comparisons of ‘BN’, ‘SN with moving average’, and ‘SN with batch average’, when training ResNet50 on ImageNet in (8, 32). We see that SN with batch average produces more stable convergence than the other methods. Hard ratios are relatively stable, although the soft ratios are varying in training. A hard ratio is a sparse vector obtained by applying max function to λ µ z or λ σ z such as max(λ σ z ), that … view at source ↗
Figure 7
Figure 7. Figure 7: Ratios of (a) λ µ z and (b) λ σ z in ResNet50+SN(8,32) for each normalization layer for 100 epochs, as well as (c) their divergence D(λ µ z kλ σ z ). Receptive field (RF) of each layer is given (53 normalization layers in total). The last 6 subfigures at the 4th, 8th, and 12th row show results of different ranges of RF including ‘RF<49’, ‘49∼99’, ‘99∼199’, ‘199∼299’, ‘299∼427’, and ‘ALL’ (i.e. 7∼427) [PIT… view at source ↗
Figure 8
Figure 8. Figure 8: Hard ratios for variance (σ) and mean (µ) including BN (green), IN (blue), and LN (red). Snapshots of ResNet50 trained after 30 (top), 60 (middle), and 90 (bottom) epochs are shown. RF is given for each layer (53 normalization layers in total). A bar with slashes denotes SN after 3 × 3 conv layer (the others are 1 × 1 conv). A black square ‘’ indicates SN at the shortcut. It’s better to zoom in 200%. 4.2.… view at source ↗
Figure 9
Figure 9. Figure 9: Average precision (AP) curves of Faster R-CNN on the 2017 val set of COCO. (a) plots the results of finetuning pretrained networks. (b) shows training the models from scratch. backbonehead AP AP.5 AP.75 APl APm APs BN† – 36.7 58.4 39.6 48.1 39.8 21.1 BN† GN 37.2 58.0 40.4 48.6 40.3 21.6 BN† SN 38.0 59.4 41.5 48.9 41.3 22.7 GN GN 38.2 58.7 41.3 49.6 41.0 22.4 SN SN 39.3 60.9 42.8 50.3 42.7 23.5 TABLE 5: Fas… view at source ↗
Figure 9
Figure 9. Figure 9: References [1] J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization. arXiv:1607.06450, 2016. 1, 4 [2] L. J. Ba and R. Caruana. Do deep nets really need to be deep? In NIPS, 2014. 3 [3] B. Colson, P. Marcotte, and G. Savard. An overview of bilevel optimization. Annals of operations research, 2007. 9 [4] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. … view at source ↗
Figure 12
Figure 12. Figure 12: Selected normalizers of each SN layer in ResNet50 for semantic image parsing in ADE20K and Cityscapes. There are 53 SN layers. (a,b) show the importance weights for µ and σ of (8, 2) in ADE20K, while (c,d) show those of (8, 2) in Cityscapes. The y-axis represents the importance weights that sum to 1, while the x-axis shows different residual blocks of ResNet50. The SN layers in different places are highli… view at source ↗
Figure 13
Figure 13. Figure 13: Ratios for detection and segmentation including BN (orange), IN (green), and LN (red). We show λ µ z and λ σ z in ResNet50+SN(8,2) finetuned to (a) COCO, (b) Cityscapes, and (c) ADE20K. On one hand, this could be attributed to PSPNet providing a superior baseline compared with DeepLab. On the other hand, the spatial pyramid pooling (i.e. one type of multi-scale global pooling) may make IN and LN unstable … view at source ↗
Figure 14
Figure 14. Figure 14: Finetuning ResNet50+SN in ADE20K. controller is a LSTM whose parameters are trained by using the REINFORCE [60] algorithm to sample a cell architecture, while a child model is a CNN that stacks many sampled cell architectures and its parameters are trained by back-propagation with SGD. In [14], the LSTM controller is learned to produce an architecture with high reward, which is the classification accuracy… view at source ↗
Figure 15
Figure 15. Figure 15: Results of Image Stylization. The first column visualizes the content and the style images. The second and third columns are the results of IN and SN respectively. SN works comparably well with IN in this task. steps Loss (a) Image Style Transfer epochs Validation Accuracy (b) ENAS on CIFAR-10 [PITH_FULL_IMAGE:figures/full_fig_p015_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: (a) shows the losses of BN, IN, and SN in the task of image stylization. SN converges faster than IN and BN. As shown in Fig.1 and the supplementary material, SN adapts its importance weight to IN while producing comparable stylization results. (b) plots the accuracy on the validation set of CIFAR-10 when searching network architectures. Investigating SN facilitates the understanding of normalization appr… view at source ↗
read the original abstract

We address a learning-to-normalize problem by proposing Switchable Normalization (SN), which learns to select different normalizers for different normalization layers of a deep neural network. SN employs three distinct scopes to compute statistics (means and variances) including a channel, a layer, and a minibatch. SN switches between them by learning their importance weights in an end-to-end manner. It has several good properties. First, it adapts to various network architectures and tasks. Second, it is robust to a wide range of batch sizes, maintaining high performance even when small minibatch is presented (e.g. 2 images/GPU). Third, SN does not have sensitive hyper-parameter, unlike group normalization that searches the number of groups as a hyper-parameter. Without bells and whistles, SN outperforms its counterparts on various challenging benchmarks, such as ImageNet, COCO, CityScapes, ADE20K, MegaFace, and Kinetics. Analyses of SN are also presented to answer the following three questions: (a) Is it useful to allow each normalization layer to select its own normalizer? (b) What impacts the choices of normalizers? (c) Do different tasks and datasets prefer different normalizers? We hope SN will help ease the usage and understand the normalization techniques in deep learning. The code of SN has been released at https://github.com/switchablenorms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes Switchable Normalization (SN), which learns per-layer importance weights to select among three normalization scopes (channel/instance, layer, and minibatch) in an end-to-end fashion. The central claims are that SN adapts automatically to architectures and tasks, remains robust even at very small batch sizes (e.g., 2 images/GPU), requires no sensitive hyperparameters, and outperforms standard BN/IN/LN baselines on ImageNet, COCO, CityScapes, ADE20K, MegaFace, and Kinetics. The authors also present analyses addressing (a) whether per-layer selection is useful, (b) what influences normalizer choice, and (c) whether tasks/datasets prefer different normalizers. Public code release is noted.

Significance. If the empirical results and analyses hold, SN would be a practically useful contribution by removing the need to manually choose or tune normalization methods while delivering measurable gains across detection, segmentation, recognition, and video tasks. The released code is a clear strength for reproducibility. The three targeted analyses directly engage the question of whether learned switchability adds value beyond a fixed mixture, which mitigates the primary stress-test concern.

major comments (1)
  1. [§5] §5 (Analyses): the paper reports learned weight distributions and per-layer selections that are visibly non-uniform across layers and tasks; this directly addresses the possibility that optimization collapses to uniform averaging. No further control experiment against a fixed (1/3,1/3,1/3) mixture is required for the central claim once these distributions are shown.
minor comments (2)
  1. [Table 1, Figure 3] Table 1 and Figure 3: axis labels and legend entries for the three scopes should be standardized to the same terminology used in §3 (e.g., “IN”, “LN”, “BN”) to avoid reader confusion.
  2. [§4.2] §4.2: the statement that SN “does not have sensitive hyper-parameter” should be qualified by noting that the three-scope formulation itself is a modeling choice; a brief sentence acknowledging this would improve precision.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive review and recommendation for minor revision. We address the single major comment below.

read point-by-point responses
  1. Referee: [§5] §5 (Analyses): the paper reports learned weight distributions and per-layer selections that are visibly non-uniform across layers and tasks; this directly addresses the possibility that optimization collapses to uniform averaging. No further control experiment against a fixed (1/3,1/3,1/3) mixture is required for the central claim once these distributions are shown.

    Authors: We appreciate the referee's assessment that the analyses in Section 5 are sufficient. The reported non-uniform weight distributions and per-layer selections across architectures and tasks demonstrate that SN does not collapse to uniform averaging, thereby supporting the value of learned switchability without requiring an additional fixed-mixture control experiment. revision: no

Circularity Check

0 steps flagged

No significant circularity in SN derivation or claims

full rationale

The paper defines Switchable Normalization via end-to-end learned importance weights over three normalization scopes (channel, layer, minibatch). Performance claims on ImageNet, COCO and other benchmarks are presented as empirical outcomes of this optimization, not as quantities forced equal to the inputs by any equation or self-citation chain. No load-bearing step reduces a prediction to a fitted constant, renames a known result, or imports uniqueness from prior author work. The method remains self-contained against external benchmarks; the reader's assessment of minor non-load-bearing self-citation aligns with a score of 0 here.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical superiority of the learned switching mechanism; the only background assumptions are standard deep-learning premises that normalization stabilizes training and that end-to-end optimization can discover useful per-layer choices.

axioms (1)
  • domain assumption Normalization layers improve training stability and generalization in deep neural networks
    Invoked implicitly as the motivation for comparing normalizers; standard premise in the field.

pith-pipeline@v0.9.0 · 5785 in / 1261 out tokens · 52633 ms · 2026-05-24T18:02:36.678868+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · 12 internal anchors

  1. [1]

    Batch normalization: Accelerating deep net- work training by reducing internal covariate shift,

    S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep net- work training by reducing internal covariate shift,” in ICML, 2015

  2. [2]

    Instance Normalization: The Missing Ingredient for Fast Stylization

    D. Ulyanov, A. Vedaldi, and V . Lempitsky, “Instance normalization: The missing ingredient for fast stylization,” arXiv:1607.08022, 2016

  3. [3]

    Layer Normalization

    J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv:1607.06450, 2016

  4. [4]

    Perceptual losses for real-time style transfer and super-resolution,

    J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time style transfer and super-resolution,” in ECCV, 2016

  5. [5]

    Group normalization,

    Y . Wu and K. He, “Group normalization,” in ECCV, 2018

  6. [6]

    Deep residual learning for image recognition,

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016

  7. [7]

    Imagenet: A large-scale hierarchical image database,

    J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in CVPR, 2009

  8. [8]

    Imagenet large scale visual recognition challenge,

    O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual recognition challenge,” International Journal of Computer Vision , vol. 115, no. 3, pp. 211–252, 2015

  9. [9]

    Microsoft coco: Common objects in context,

    T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll ´ar, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in ECCV, 2014

  10. [10]

    The cityscapes dataset for semantic urban scene understanding,

    M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” in CVPR, 2016

  11. [11]

    Scene parsing through ADE20K dataset,

    B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba, “Scene parsing through ADE20K dataset,” in CVPR, 2017

  12. [12]

    The megaface benchmark: 1 million faces for recognition at scale,

    I. Kemelmacher-Shlizerman, S. M. Seitz, D. Miller, and E. Brossard, “The megaface benchmark: 1 million faces for recognition at scale,” in CVPR, 2016

  13. [13]

    The Kinetics Human Action Video Dataset

    W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijaya- narasimhan, F. Viola, T. Green, T. Back, P. Natsev et al. , “The kinetics human action video dataset,” arXiv:1705.06950, 2017

  14. [14]

    Efficient Neural Architecture Search via Parameter Sharing

    H. Pham, M. Y . Guan, B. Zoph, Q. V . Le, and J. Dean, “Efficient neural architecture search via parameter sharing,” arXiv:1802.03268, 2018

  15. [15]

    Batch renormalization: Towards reducing minibatch depen- dence in batch-normalized models,

    S. Ioffe, “Batch renormalization: Towards reducing minibatch depen- dence in batch-normalized models,” in NIPS, 2017

  16. [16]

    Batch kalman nor- malization: Towards training deep neural networks with micro-batches,

    G. Wang, J. Peng, P. Luo, X. Wang, and L. Lin, “Batch kalman nor- malization: Towards training deep neural networks with micro-batches,” NIPS, 2018

  17. [17]

    Weight normalization: A simple repa- rameterization to accelerate training of deep neural networks,

    T. Salimans and D. P. Kingma, “Weight normalization: A simple repa- rameterization to accelerate training of deep neural networks,” in NIPS, 2016

  18. [18]

    Spectral normal- ization for generative adversarial networks,

    T. Miyato, T. Kataoka, M. Koyama, and Y . Yoshida, “Spectral normal- ization for generative adversarial networks,” in ICLR, 2018

  19. [19]

    Natural neural networks,

    G. Desjardins, K. Simonyan, R. Pascanu, and K. Kavukcuoglu, “Natural neural networks,” NIPS, 2015

  20. [20]

    Learning deep architectures via generalized whitened neural networks,

    P. Luo, “Learning deep architectures via generalized whitened neural networks,” ICML, 2017

  21. [21]

    Amari and H

    S.-i. Amari and H. Nagaoka, Methods of information geometry . Amer- ican Mathematical Soc., 2007, vol. 191

  22. [22]

    The Shattered Gradients Problem: If resnets are the answer, then what is the question?

    D. Balduzzi, M. Frean, L. Leary, J. Lewis, K. W.-D. Ma, and B. McWilliams, “The shattered gradients problem: If resnets are the answer, then what is the question?” arXiv preprint arXiv:1702.08591 , 2017

  23. [23]

    How does batch normalization help optimization?(no, it is not about internal covariate shift),

    S. Santurkar, D. Tsipras, A. Ilyas, and A. Madry, “How does batch normalization help optimization?(no, it is not about internal covariate shift),” in NIPS, 2018

  24. [24]

    Norm matters: efficient and accurate normalization schemes in deep networks,

    E. Hoffer, R. Banner, I. Golan, and D. Soudry, “Norm matters: efficient and accurate normalization schemes in deep networks,” in NIPS, 2018

  25. [25]

    Towards understanding regularization in batch normalization,

    P. Luo, X. Wang, W. Shao, and Z. Peng, “Towards understanding regularization in batch normalization,” ICLR, 2019

  26. [26]

    A mean field theory of batch normalization,

    V . R. J. S.-D. S. S. S. Greg Yang, Jeffrey Pennington, “A mean field theory of batch normalization,” in ICLR, 2019

  27. [27]

    Theoretical analysis of auto rate- tuning by batch normalization,

    K. L. Sanjeev Arora, Zhiyuan Li, “Theoretical analysis of auto rate- tuning by batch normalization,” in ICLR, 2019

  28. [28]

    Understanding batch normalization,

    B. S. K. Q. W. Johan Bjorck, Carla Gomes, “Understanding batch normalization,” in NIPS, 2018

  29. [29]

    Normal- izing the normalizers: Comparing and extending network normalization schemes,

    M. Ren, R. Liao, R. Urtasun, F. H. Sinz, and R. S. Zemel, “Normal- izing the normalizers: Comparing and extending network normalization schemes,” in ICLR, 2016. IEEE TRANSACTIONS ON PATTERN ANAL YSIS AND MACHINE INTELLIGENCE 16

  30. [30]

    Two at once: enhancing learning and generalization capacities via ibn-net,

    J. S. Xinggang Pan, Ping Luo and X. Tang, “Two at once: enhancing learning and generalization capacities via ibn-net,” in ECCV, 2018

  31. [31]

    Differentiable dynamic normalization for learning deep representation,

    P. Luo, P. Zhanglin, S. Wenqi, Z. Ruimao, R. Jiamin, and W. Lingyun, “Differentiable dynamic normalization for learning deep representation,” in ICML, 2019

  32. [32]

    Streaming Normalization: Towards Simpler and More Biologically-plausible Normalizations for Online and Recurrent Learning

    Q. Liao, K. Kawaguchi, and T. Poggio, “Streaming normalization: To- wards simpler and more biologically-plausible normalizations for online and recurrent learning,” arXiv preprint arXiv:1610.06160, 2016

  33. [33]

    Decorrelated batch normal- ization,

    L. Huang, D. Yang, B. Lang, and J. Deng, “Decorrelated batch normal- ization,” in CVPR, 2018

  34. [34]

    Learning Visual Reasoning Without Strong Priors

    E. Perez, H. de Vries, and F. Strub, “Learning visual reasoning without strong priors,” in arXiv:1707.03017, 2017

  35. [35]

    Switchable whitening for deep representation learning,

    X. Pan, X. Zhan, J. Shi, X. Tang, and P. Luo, “Switchable whitening for deep representation learning,” arXiv, 2019

  36. [36]

    Centered weight normalization in accelerating training of deep neural networks,

    L. Huang, X. Liu, Y . Liu, B. Lang, and D. Tao, “Centered weight normalization in accelerating training of deep neural networks,” inICCV, 2017

  37. [37]

    Generative adversarial nets,

    I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. C. Courville, and Y . Bengio, “Generative adversarial nets,” in NIPS, 2014

  38. [38]

    Eigennet: Towards fast and structural learning of deep neural networks,

    P. Luo, “Eigennet: Towards fast and structural learning of deep neural networks,” IJCAI, 2017

  39. [39]

    An overview of bilevel optimiza- tion,

    B. Colson, P. Marcotte, and G. Savard, “An overview of bilevel optimiza- tion,” Annals of operations research , 2007

  40. [40]

    Gradient-based hyperpa- rameter optimization through reversible learning,

    D. Maclaurin, D. Duvenaud, and R. Adams, “Gradient-based hyperpa- rameter optimization through reversible learning,” ICML, 2015

  41. [41]

    DARTS: Differentiable Architecture Search

    H. Liu, K. Simonyan, and Y . Yang, “Darts: Differentiable architecture search,” arXiv:1806.09055, 2018

  42. [42]

    Arbitrary style transfer in real-time with adaptive instance normalization,

    X. Huang and S. Belongie, “Arbitrary style transfer in real-time with adaptive instance normalization,” in ICCV, 2017

  43. [43]

    Improved texture networks: Maximizing quality and diversity in feed-forward stylization and texture synthesis,

    D. Ulyanov, A. Vedaldi, and V . Lempitsky, “Improved texture networks: Maximizing quality and diversity in feed-forward stylization and texture synthesis,” in CVPR, 2017

  44. [44]

    Imagenet classification with deep convolutional neural networks,

    A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in NIPS, 2012

  45. [45]

    Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

    P. Goyal, P. Dollr, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y . Jia, and K. He, “Accurate, large minibatch sgd: Training imagenet in 1 hour,” arXiv:1706.02677, 2017

  46. [46]

    Ssn: Learning sparse switchable normalization via sparsestmax,

    W. Shao, T. Meng, J. Li, R. Zhang, Y . Li, X. Wang, and P. Luo, “Ssn: Learning sparse switchable normalization via sparsestmax,” in CVPR, 2019

  47. [47]

    Faster r-cnn: Towards real-time object detection with region proposal networks,

    S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in NIPS, 2015

  48. [48]

    Feature Pyramid Networks for Object Detection

    T.-Y . Lin, P. Dollra, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” arXiv:1612.03144, 2016

  49. [49]

    Mask r-cnn,

    K. He, G. Gkioxari, P. Dollr, and R. Girshick, “Mask r-cnn,” ICCV, 2017

  50. [50]

    Detec- tron,

    R. Girshick, I. Radosavovic, G. Gkioxari, P. Doll ´ar, and K. He, “Detec- tron,” https:// github.com/ facebookresearch/ detectron, 2018

  51. [51]

    A faster pytorch implementation of faster r-cnn,

    J. Yang, J. Lu, D. Batra, and D. Parikh, “A faster pytorch implementation of faster r-cnn,” https:// github.com/ jwyang/ faster-rcnn.pytorch, 2017

  52. [52]

    Megdet: A large mini-batch object detector,

    C. Peng, T. Xiao, Z. Li, Y . Jiang, X. Zhang, K. Jia, G. Yu, and J. Sun, “Megdet: A large mini-batch object detector,” in CVPR, 2018

  53. [53]

    Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,

    L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 4, pp. 834–848, 2018

  54. [54]

    Pyramid scene parsing network,

    H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” in CVPR, 2017

  55. [55]

    Arcface: Additive angular margin loss for deep face recognition

    J. Deng, J. Guo, N. Xue, and S. Zafeiriou, “Arcface: Additive angular margin loss for deep face recognition,” arXiv preprint arXiv:1801.07698, 2018

  56. [56]

    Sphereface: Deep hypersphere embedding for face recognition,

    W. Liu, Y . Wen, Z. Yu, M. Li, B. Raj, and L. Song, “Sphereface: Deep hypersphere embedding for face recognition,” in CVPR, 2017

  57. [57]

    Cosface: Large margin cosine loss for deep face recognition,

    H. Wang, Y . Wang, Z. Zhou, X. Ji, D. Gong, J. Zhou, Z. Li, and W. Liu, “Cosface: Large margin cosine loss for deep face recognition,” in CVPR, 2018

  58. [58]

    Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

    J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” arXiv:1705.07750, 2017

  59. [59]

    Very Deep Convolutional Networks for Large-Scale Image Recognition

    K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv:1409.1556, 2014

  60. [60]

    Simple statistical gradient-following algorithms for connectionist reinforcement learning,

    R. J. Williams, “Simple statistical gradient-following algorithms for connectionist reinforcement learning,” Machine Learning, 1992

  61. [61]

    Learning multiple layers of features from tiny images,

    A. Krizhevsky, “Learning multiple layers of features from tiny images,” Technical report, 2009

  62. [62]

    Batch normalized recurrent neural networks,

    C. Laurent, G. Pereyra, P. Brakel, Y . Zhang, and Y . Bengio, “Batch normalized recurrent neural networks,” in ICASSP, 2016. APPENDIX A BACK-PROPAGATION OF SN In practice, the back-propagation (BP) stage can be computed by auto differentiation (AD). For the software without AD, we provide the backward computations of SN for single GPU and multiple GPUs as...