Recognition: unknown
SURGE: Surrogate Gradient Adaptation in Binary Neural Networks
Pith reviewed 2026-05-13 06:34 UTC · model grok-4.3
The pith
SURGE reduces gradient mismatch in binary neural networks by routing auxiliary full-precision gradients through a dual-path compensator and adaptive scaler.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SURGE constructs, for each binarized layer, a parallel full-precision auxiliary branch whose gradients are decoupled from the binary path via output decomposition; the Dual-Path Gradient Compensator uses this branch to estimate components beyond the first-order approximation of the straight-through estimator, and the Adaptive Gradient Scaler applies a norm-based optimal scale factor to balance the two gradient streams without introducing new bias.
What carries the argument
Dual-Path Gradient Compensator (DPGC) paired with Adaptive Gradient Scaler (AGS): the DPGC adds a full-precision auxiliary branch per binarized layer and decouples gradients through output decomposition; the AGS computes a dynamic norm-based scale to balance branch contributions.
If this is right
- Binary neural networks reach higher top-1 accuracy on ImageNet-scale classification than previous surrogate-gradient methods.
- Object detectors and language models quantized to binary weights converge faster and achieve better task metrics when trained with the dual-path compensator.
- Training stability improves because the adaptive scaler prevents the auxiliary gradient from dominating or vanishing relative to the binary path.
Where Pith is reading between the lines
- The same auxiliary-branch idea could be applied to other non-differentiable operations such as hard thresholding or low-bit quantization beyond binary weights.
- Because the compensator is added only during training, inference cost remains identical to a standard binary network.
- The norm-based scaling rule might generalize to other multi-path gradient flows that currently rely on fixed weighting.
Load-bearing premise
The full-precision auxiliary branch can estimate the gradient components missed by the straight-through estimator without itself introducing bias or training instability.
What would settle it
A controlled ablation in which the auxiliary branch is removed or the scale factor is frozen shows no accuracy gain over a plain straight-through estimator on the same tasks and architectures.
Figures
read the original abstract
The training of Binary Neural Networks (BNNs) is fundamentally based on gradient approximation for non-differentiable binarization operations (e.g., sign function). However, prevailing methods including the Straight-Through Estimator (STE) and its improved variants, rely on hand-crafted designs that suffer from gradient mismatch problem and information loss induced by fixed-range gradient clipping. To address this, we propose SURrogate GradiEnt Adaptation (SURGE), a novel learnable gradient compensation framework with theoretical grounding. SURGE mitigates gradient mismatch through auxiliary backpropagation. Specifically, we design a Dual-Path Gradient Compensator (DPGC) that constructs a parallel full-precision auxiliary branch for each binarized layer, decoupling gradient flow via output decomposition during backpropagation. DPGC enables bias-reduced gradient estimation by leveraging the full-precision branch to estimate components beyond STE's first-order approximation. To further enhance training stability, we introduce an Adaptive Gradient Scaler (AGS) based on an optimal scale factor to dynamically balance inter-branch gradient contributions via norm-based scaling. Experiments on image classification, object detection, and language understanding tasks demonstrate that SURGE performs best over state-of-the-art methods.
Editorial analysis
A structured set of objections, weighed in public.
Axiom & Free-Parameter Ledger
free parameters (1)
- optimal scale factor
axioms (1)
- domain assumption Output decomposition during backpropagation decouples gradient flow and enables unbiased estimation beyond first-order STE approximation.
invented entities (2)
-
Dual-Path Gradient Compensator (DPGC)
no independent evidence
-
Adaptive Gradient Scaler (AGS)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Scaling Learning Algorithms Towards
Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards
-
[2]
Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation
Estimating or propagating gradients through stochastic neurons for conditional computation , author=. arXiv preprint arXiv:1308.3432 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Learning sparse neural networks through L\_0 regularization , author=
-
[4]
Differentiable soft quantization: Bridging full-precision and low-bit neural networks , author=
-
[5]
Binaryconnect: Training deep neural networks with binary weights during propagations , author=
-
[6]
Xnor-net: Imagenet classification using binary convolutional neural networks , author=. 2016 , organization=
work page 2016
-
[7]
Forward and backward information retention for accurate binary neural networks , author=
-
[8]
Recu: Reviving the dead weights in binary neural networks , author=
-
[9]
Bi-real net: Enhancing the performance of 1-bit cnns with improved representational capability and advanced training algorithm , author=
-
[10]
Learning frequency domain approximation for binary neural networks , author=
-
[11]
Bats: Binary architecture search , author=. 2020 , organization=
work page 2020
-
[12]
Imagenet large scale visual recognition challenge , author=. 2015 , publisher=
work page 2015
-
[13]
The pascal visual object classes (voc) challenge , author=. 2010 , publisher=
work page 2010
-
[14]
Learning multiple layers of features from tiny images , author=. 2009 , publisher=
work page 2009
-
[15]
Very Deep Convolutional Networks for Large-Scale Image Recognition
Very deep convolutional networks for large-scale image recognition , author=. arXiv preprint arXiv:1409.1556 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Recurrent bilinear optimization for binary neural networks , author=. 2022 , organization=
work page 2022
-
[17]
Reactnet: Towards precise binary neural network with generalized activation functions , author=. 2020 , organization=
work page 2020
-
[18]
Regularizing activation distribution for training binarized deep networks , author=
-
[19]
Rotated binary neural network , author=
-
[20]
Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients,
Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients , author=. arXiv preprint arXiv:1606.06160 , year=
-
[21]
Searching for low-bit weights in quantized neural networks , author=
-
[22]
Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1 , author=. arXiv preprint arXiv:1602.02830 , year=
-
[23]
Tbn: Convolutional neural network with ternary inputs and binary weights , author=
-
[24]
Towards compact 1-bit cnns via bayesian learning , author=. 2022 , publisher=
work page 2022
-
[25]
Bidet: An efficient binarized object detector , author=
-
[26]
Ida-det: An information discrepancy-aware distillation for 1-bit detectors , author=. 2022 , organization=
work page 2022
-
[27]
Categorical Reparameterization with Gumbel-Softmax
Categorical reparameterization with gumbel-softmax , author=. arXiv preprint arXiv:1611.01144 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
Learned step size quantization , author=
-
[29]
Delving deep into rectifiers: Surpassing human-level performance on imagenet classification , author=
-
[30]
Layer-wise searching for 1-bit detectors , author=
-
[31]
Circulant binary convolutional networks: Enhancing the performance of 1-bit dcnns with circulant back propagation , author=
-
[32]
Bnn+: Improved binary network training , author=
-
[33]
Language models are few-shot learners , author=
-
[34]
Qwen2. 5 technical report , author=. arXiv preprint arXiv:2412.15115 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[35]
Structured pruning for deep convolutional neural networks: A survey , author=. 2023 , publisher=
work page 2023
-
[36]
Distilling the knowledge in a neural network , author=
-
[37]
Efficient Low-Bit Quantization with Adaptive Scales for Multi-Task Co-Training , author=
-
[38]
On compressing deep models by low rank and sparse decomposition , author=
-
[39]
and Osindero, Simon and Teh, Yee Whye , journal =
Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , year =
- [40]
-
[41]
Latticenet: Towards lightweight image super-resolution with lattice block , author=
-
[42]
Enhanced deep residual networks for single image super-resolution , author=
-
[43]
Fast, accurate, and lightweight super-resolution with cascading residual network , author=
-
[44]
Data-free knowledge distillation for image super-resolution , author=
-
[45]
Learning with privileged information for efficient image super-resolution , author=
-
[46]
Wang, Huan and Zhang, Yulun and Qin, Can and Van Gool, Luc and Fu, Yun , journal=TPAMI, title=
-
[47]
Deep learning with low precision by half-wave gaussian quantization , author=
-
[48]
Learning to quantize deep networks by optimizing quantization intervals with task loss , author=
-
[49]
Lsq+: Improving low-bit quantization through learnable offsets and better initialization , author=
-
[50]
Network quantization with element-wise gradient scaling , author=
-
[51]
Fracbits: Mixed precision quantization via fractional bit-widths , author=
-
[52]
Eq-net: Elastic quantization neural networks , author=
-
[53]
Wang, Longguang and Dong, Xiaoyu and Wang, Yingqian and Liu, Li and An, Wei and Guo, Yulan , title =
-
[54]
Pams: Quantized super-resolution via parameterized max scale , author=
-
[55]
Cadyq: Content-aware dynamic quantization for image super-resolution , author=
-
[56]
QuantSR: accurate low-bit quantization for efficient image super-resolution , author=
-
[57]
Pre-trained image processing transformer , author=
-
[58]
Transactions on Machine Learning Research , year=
Polyvit: Co-training vision transformers on images, videos and audio , author=. Transactions on Machine Learning Research , year=
-
[59]
Attentive single-tasking of multiple tasks , author=
-
[60]
Unit: Multimodal multitask learning with a unified transformer , author=
-
[61]
Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation , author=. Computer Science , year=
-
[62]
AQ-DETR: Low-Bit Quantized Detection Transformer with Auxiliary Queries , author=
-
[63]
Omnivec: Learning robust representations with cross modal sharing , author=. Proc. of WACV , year=
-
[64]
Moment matching for multi-source domain adaptation , author=
-
[65]
Imagenet: A large-scale hierarchical image database , author=
-
[66]
Science China Information Sciences , year=
FUSAR-Ship: Building a high-resolution SAR-AIS matchup dataset of Gaofen-3 for ship detection and recognition , author=. Science China Information Sciences , year=
-
[67]
IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing , year=
OpenSARShip: A dataset dedicated to Sentinel-1 ship interpretation , author=. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing , year=
-
[68]
A public dataset for fine-grained ship classification in optical remote sensing images , author=. Remote Sensing , year=
-
[69]
Minigpt-v2: large language model as a unified interface for vision-language multi-task learning
Minigpt-v2: large language model as a unified interface for vision-language multi-task learning , author=. arXiv:2310.09478 , year=
-
[70]
DOTA: A large-scale dataset for object detection in aerial images , author=
- [71]
-
[72]
Adversarial Multi-task Learning for Text Classification , author=. Proc. of ACL , year=
-
[73]
Facial landmark detection by deep multi-task learning , author=
-
[74]
Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture , author=
-
[75]
Faster R-CNN: Towards real-time object detection with region proposal networks , author=
-
[76]
Ubernet: Training a universal convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory , author=
-
[77]
Mask r-cnn , author=
-
[78]
Multi-task learning using uncertainty to weigh losses for scene geometry and semantics , author=
-
[79]
Attention is all you need , author=
-
[80]
Learning to jointly share and prune weights for grounding based vision and language models , author=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.