pith. machine review for the scientific record. sign in

arxiv: 2605.10989 · v1 · submitted 2026-05-09 · 💻 cs.LG · cs.AI

Recognition: unknown

SURGE: Surrogate Gradient Adaptation in Binary Neural Networks

Authors on Pith no claims yet

Pith reviewed 2026-05-13 06:34 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords binary neural networksgradient mismatchsurrogate gradientstraight-through estimatorquantized neural networksauxiliary backpropagationgradient compensation
0
0 comments X

The pith

SURGE reduces gradient mismatch in binary neural networks by routing auxiliary full-precision gradients through a dual-path compensator and adaptive scaler.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to solve the gradient mismatch that arises when training binary neural networks because the sign function used for binarization is non-differentiable. Standard straight-through estimators clip gradients to a fixed range and lose information, so the authors introduce a learnable compensation method that runs a parallel full-precision branch alongside each binarized layer. During backpropagation the dual-path compensator decomposes the output to let the full-precision path supply the missing gradient components, while an adaptive scaler dynamically balances the two paths by their norms. Experiments across image classification, object detection, and language tasks show that this yields higher accuracy than prior hand-crafted surrogate methods.

Core claim

SURGE constructs, for each binarized layer, a parallel full-precision auxiliary branch whose gradients are decoupled from the binary path via output decomposition; the Dual-Path Gradient Compensator uses this branch to estimate components beyond the first-order approximation of the straight-through estimator, and the Adaptive Gradient Scaler applies a norm-based optimal scale factor to balance the two gradient streams without introducing new bias.

What carries the argument

Dual-Path Gradient Compensator (DPGC) paired with Adaptive Gradient Scaler (AGS): the DPGC adds a full-precision auxiliary branch per binarized layer and decouples gradients through output decomposition; the AGS computes a dynamic norm-based scale to balance branch contributions.

If this is right

  • Binary neural networks reach higher top-1 accuracy on ImageNet-scale classification than previous surrogate-gradient methods.
  • Object detectors and language models quantized to binary weights converge faster and achieve better task metrics when trained with the dual-path compensator.
  • Training stability improves because the adaptive scaler prevents the auxiliary gradient from dominating or vanishing relative to the binary path.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same auxiliary-branch idea could be applied to other non-differentiable operations such as hard thresholding or low-bit quantization beyond binary weights.
  • Because the compensator is added only during training, inference cost remains identical to a standard binary network.
  • The norm-based scaling rule might generalize to other multi-path gradient flows that currently rely on fixed weighting.

Load-bearing premise

The full-precision auxiliary branch can estimate the gradient components missed by the straight-through estimator without itself introducing bias or training instability.

What would settle it

A controlled ablation in which the auxiliary branch is removed or the scale factor is frozen shows no accuracy gain over a plain straight-through estimator on the same tasks and architectures.

Figures

Figures reproduced from arXiv: 2605.10989 by Baochang Zhang, Boyu Liu, Canyu Chen, Haoyu Huang, Linlin Yang, Xuhui Liu, Yanjing Li, Yuguang Yang, Zhongqian Fu.

Figure 1
Figure 1. Figure 1: (a-b) Activation gradient patterns without/with SURGE (left/right); (c) Gradient distribution comparison; (d) Cumulative probability of gradients. STE provides a first-order approximation for the sign function’s gradient and clips out-of-range activation gradients, while SURGE compensates them with a Dual-Path Gradient Compensator (a-b). SURGE also right-shifts gradient distributions of activations (c-d), … view at source ↗
Figure 2
Figure 2. Figure 2: Overall architecture of SURGE. (a) Integration into common backbones (left: convolution block; right: transformer block). (b) Component details. DPGC constructs a parallel full-precision parameterized branch (auxiliary branch, shown with red arrows for forward pass and blue arrows for backpropagation) for each binarized layer (main branch, represented by black arrows in forward pass and green arrows for ba… view at source ↗
Figure 3
Figure 3. Figure 3: Ablation study on parameter scaling strategies. (a) is fixed scaling with constant factors across training iterations. (b) is adaptive scaling via parameter η that dynamically adjusts the compensation strength (Eq. 7). driven design (Theorem 5.3) successfully balances gradient compensation and training stability. Ablation on Gradient Compensation Scope of DPGC. We ablate the gradient compensation scope on … view at source ↗
read the original abstract

The training of Binary Neural Networks (BNNs) is fundamentally based on gradient approximation for non-differentiable binarization operations (e.g., sign function). However, prevailing methods including the Straight-Through Estimator (STE) and its improved variants, rely on hand-crafted designs that suffer from gradient mismatch problem and information loss induced by fixed-range gradient clipping. To address this, we propose SURrogate GradiEnt Adaptation (SURGE), a novel learnable gradient compensation framework with theoretical grounding. SURGE mitigates gradient mismatch through auxiliary backpropagation. Specifically, we design a Dual-Path Gradient Compensator (DPGC) that constructs a parallel full-precision auxiliary branch for each binarized layer, decoupling gradient flow via output decomposition during backpropagation. DPGC enables bias-reduced gradient estimation by leveraging the full-precision branch to estimate components beyond STE's first-order approximation. To further enhance training stability, we introduce an Adaptive Gradient Scaler (AGS) based on an optimal scale factor to dynamically balance inter-branch gradient contributions via norm-based scaling. Experiments on image classification, object detection, and language understanding tasks demonstrate that SURGE performs best over state-of-the-art methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The method rests on new architectural components whose validity depends on unproven assumptions about gradient decomposition accuracy and scale balancing.

free parameters (1)
  • optimal scale factor
    Used in AGS to balance inter-branch gradients; appears chosen dynamically or fitted during optimization.
axioms (1)
  • domain assumption Output decomposition during backpropagation decouples gradient flow and enables unbiased estimation beyond first-order STE approximation.
    Invoked directly in the DPGC design to justify auxiliary branch utility.
invented entities (2)
  • Dual-Path Gradient Compensator (DPGC) no independent evidence
    purpose: Constructs parallel full-precision auxiliary branch per binarized layer for gradient compensation
    Newly introduced component to mitigate mismatch.
  • Adaptive Gradient Scaler (AGS) no independent evidence
    purpose: Dynamically balances gradient contributions via norm-based scaling
    New mechanism for training stability.

pith-pipeline@v0.9.0 · 5527 in / 1324 out tokens · 85547 ms · 2026-05-13T06:34:36.751549+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

129 extracted references · 129 canonical work pages · 5 internal anchors

  1. [1]

    Scaling Learning Algorithms Towards

    Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

  2. [2]

    Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

    Estimating or propagating gradients through stochastic neurons for conditional computation , author=. arXiv preprint arXiv:1308.3432 , year=

  3. [3]

    Learning sparse neural networks through L\_0 regularization , author=

  4. [4]

    Differentiable soft quantization: Bridging full-precision and low-bit neural networks , author=

  5. [5]

    Binaryconnect: Training deep neural networks with binary weights during propagations , author=

  6. [6]

    2016 , organization=

    Xnor-net: Imagenet classification using binary convolutional neural networks , author=. 2016 , organization=

  7. [7]

    Forward and backward information retention for accurate binary neural networks , author=

  8. [8]

    Recu: Reviving the dead weights in binary neural networks , author=

  9. [9]

    Bi-real net: Enhancing the performance of 1-bit cnns with improved representational capability and advanced training algorithm , author=

  10. [10]

    Learning frequency domain approximation for binary neural networks , author=

  11. [11]

    2020 , organization=

    Bats: Binary architecture search , author=. 2020 , organization=

  12. [12]

    2015 , publisher=

    Imagenet large scale visual recognition challenge , author=. 2015 , publisher=

  13. [13]

    2010 , publisher=

    The pascal visual object classes (voc) challenge , author=. 2010 , publisher=

  14. [14]

    2009 , publisher=

    Learning multiple layers of features from tiny images , author=. 2009 , publisher=

  15. [15]

    Very Deep Convolutional Networks for Large-Scale Image Recognition

    Very deep convolutional networks for large-scale image recognition , author=. arXiv preprint arXiv:1409.1556 , year=

  16. [16]

    2022 , organization=

    Recurrent bilinear optimization for binary neural networks , author=. 2022 , organization=

  17. [17]

    2020 , organization=

    Reactnet: Towards precise binary neural network with generalized activation functions , author=. 2020 , organization=

  18. [18]

    Regularizing activation distribution for training binarized deep networks , author=

  19. [19]

    Rotated binary neural network , author=

  20. [20]

    Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients,

    Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients , author=. arXiv preprint arXiv:1606.06160 , year=

  21. [21]

    Searching for low-bit weights in quantized neural networks , author=

  22. [22]

    and Bengio, Y

    Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1 , author=. arXiv preprint arXiv:1602.02830 , year=

  23. [23]

    Tbn: Convolutional neural network with ternary inputs and binary weights , author=

  24. [24]

    2022 , publisher=

    Towards compact 1-bit cnns via bayesian learning , author=. 2022 , publisher=

  25. [25]

    Bidet: An efficient binarized object detector , author=

  26. [26]

    2022 , organization=

    Ida-det: An information discrepancy-aware distillation for 1-bit detectors , author=. 2022 , organization=

  27. [27]

    Categorical Reparameterization with Gumbel-Softmax

    Categorical reparameterization with gumbel-softmax , author=. arXiv preprint arXiv:1611.01144 , year=

  28. [28]

    Learned step size quantization , author=

  29. [29]

    Delving deep into rectifiers: Surpassing human-level performance on imagenet classification , author=

  30. [30]

    Layer-wise searching for 1-bit detectors , author=

  31. [31]

    Circulant binary convolutional networks: Enhancing the performance of 1-bit dcnns with circulant back propagation , author=

  32. [32]

    Bnn+: Improved binary network training , author=

  33. [33]

    Language models are few-shot learners , author=

  34. [34]

    Qwen2.5 Technical Report

    Qwen2. 5 technical report , author=. arXiv preprint arXiv:2412.15115 , year=

  35. [35]

    2023 , publisher=

    Structured pruning for deep convolutional neural networks: A survey , author=. 2023 , publisher=

  36. [36]

    Distilling the knowledge in a neural network , author=

  37. [37]

    Efficient Low-Bit Quantization with Adaptive Scales for Multi-Task Co-Training , author=

  38. [38]

    On compressing deep models by low rank and sparse decomposition , author=

  39. [39]

    and Osindero, Simon and Teh, Yee Whye , journal =

    Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , year =

  40. [40]

    2016 , publisher=

    Deep learning , author=. 2016 , publisher=

  41. [41]

    Latticenet: Towards lightweight image super-resolution with lattice block , author=

  42. [42]

    Enhanced deep residual networks for single image super-resolution , author=

  43. [43]

    Fast, accurate, and lightweight super-resolution with cascading residual network , author=

  44. [44]

    Data-free knowledge distillation for image super-resolution , author=

  45. [45]

    Learning with privileged information for efficient image super-resolution , author=

  46. [46]

    Wang, Huan and Zhang, Yulun and Qin, Can and Van Gool, Luc and Fu, Yun , journal=TPAMI, title=

  47. [47]

    Deep learning with low precision by half-wave gaussian quantization , author=

  48. [48]

    Learning to quantize deep networks by optimizing quantization intervals with task loss , author=

  49. [49]

    Lsq+: Improving low-bit quantization through learnable offsets and better initialization , author=

  50. [50]

    Network quantization with element-wise gradient scaling , author=

  51. [51]

    Fracbits: Mixed precision quantization via fractional bit-widths , author=

  52. [52]

    Eq-net: Elastic quantization neural networks , author=

  53. [53]

    Wang, Longguang and Dong, Xiaoyu and Wang, Yingqian and Liu, Li and An, Wei and Guo, Yulan , title =

  54. [54]

    Pams: Quantized super-resolution via parameterized max scale , author=

  55. [55]

    Cadyq: Content-aware dynamic quantization for image super-resolution , author=

  56. [56]

    QuantSR: accurate low-bit quantization for efficient image super-resolution , author=

  57. [57]

    Pre-trained image processing transformer , author=

  58. [58]

    Transactions on Machine Learning Research , year=

    Polyvit: Co-training vision transformers on images, videos and audio , author=. Transactions on Machine Learning Research , year=

  59. [59]

    Attentive single-tasking of multiple tasks , author=

  60. [60]

    Unit: Multimodal multitask learning with a unified transformer , author=

  61. [61]

    Computer Science , year=

    Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation , author=. Computer Science , year=

  62. [62]

    AQ-DETR: Low-Bit Quantized Detection Transformer with Auxiliary Queries , author=

  63. [63]

    Omnivec: Learning robust representations with cross modal sharing , author=. Proc. of WACV , year=

  64. [64]

    Moment matching for multi-source domain adaptation , author=

  65. [65]

    Imagenet: A large-scale hierarchical image database , author=

  66. [66]

    Science China Information Sciences , year=

    FUSAR-Ship: Building a high-resolution SAR-AIS matchup dataset of Gaofen-3 for ship detection and recognition , author=. Science China Information Sciences , year=

  67. [67]

    IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing , year=

    OpenSARShip: A dataset dedicated to Sentinel-1 ship interpretation , author=. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing , year=

  68. [68]

    Remote Sensing , year=

    A public dataset for fine-grained ship classification in optical remote sensing images , author=. Remote Sensing , year=

  69. [69]

    Minigpt-v2: large language model as a unified interface for vision-language multi-task learning

    Minigpt-v2: large language model as a unified interface for vision-language multi-task learning , author=. arXiv:2310.09478 , year=

  70. [70]

    DOTA: A large-scale dataset for object detection in aerial images , author=

  71. [71]

    Machine learning , year=

    Multitask learning , author=. Machine learning , year=

  72. [72]

    Adversarial Multi-task Learning for Text Classification , author=. Proc. of ACL , year=

  73. [73]

    Facial landmark detection by deep multi-task learning , author=

  74. [74]

    Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture , author=

  75. [75]

    Faster R-CNN: Towards real-time object detection with region proposal networks , author=

  76. [76]

    Ubernet: Training a universal convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory , author=

  77. [77]

    Mask r-cnn , author=

  78. [78]

    Multi-task learning using uncertainty to weigh losses for scene geometry and semantics , author=

  79. [79]

    Attention is all you need , author=

  80. [80]

    Learning to jointly share and prune weights for grounding based vision and language models , author=

Showing first 80 references.