StoMPP progressively binarizes BNN layers layerwise from input to output via stochastic masks, delivering depth-scalable accuracy gains in a fully STE-free regime by controlling activation-induced gradient blockades.
hub
DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients
17 Pith papers cite this work. Polarity classification is still indexing.
abstract
We propose DoReFa-Net, a method to train convolutional neural networks that have low bitwidth weights and activations using low bitwidth parameter gradients. In particular, during backward pass, parameter gradients are stochastically quantized to low bitwidth numbers before being propagated to convolutional layers. As convolutions during forward/backward passes can now operate on low bitwidth weights and activations/gradients respectively, DoReFa-Net can use bit convolution kernels to accelerate both training and inference. Moreover, as bit convolutions can be efficiently implemented on CPU, FPGA, ASIC and GPU, DoReFa-Net opens the way to accelerate training of low bitwidth neural network on these hardware. Our experiments on SVHN and ImageNet datasets prove that DoReFa-Net can achieve comparable prediction accuracy as 32-bit counterparts. For example, a DoReFa-Net derived from AlexNet that has 1-bit weights, 2-bit activations, can be trained from scratch using 6-bit gradients to get 46.1\% top-1 accuracy on ImageNet validation set. The DoReFa-Net AlexNet model is released publicly.
hub tools
citation-role summary
citation-polarity summary
roles
method 1polarities
use method 1representative citing papers
FTerViT introduces fully ternary Vision Transformers with TernaryBitConv2d and TernaryLayerNorm operators, achieving 82.43% ImageNet top-1 at 6.09 MB with 15x compression.
DPQuant uses epoch-wise probabilistic layer rotation and DP loss sensitivity to quantize only a changing subset of layers, reducing accuracy degradation from quantization noise in DP-SGD and delivering up to 2.21x throughput gains with under 2% accuracy drop.
CoRa reclaims quantization residuals in pre-trained ConvNets by searching low-rank adapter architectures instead of weights, matching SOTA accuracy on ImageNet in 3-4 bit settings with under 250 iterations on 1600 images.
Single-electron and single-photon stochastic physical neural networks achieve over 97% MNIST test accuracy when trained with empirical outputs in the backward pass using few trials per layer.
Mixed precision training uses FP16 for most computations, FP32 master weights for accumulation, and loss scaling to enable accurate training of large DNNs with halved memory usage.
QuantSR+ introduces RBD, QSA, and SFD techniques to achieve state-of-the-art accuracy-efficiency trade-offs in 2-4 bit quantized image super-resolution networks, with reported PSNR gains like 0.29 dB on Urban100 for SwinIR-S.
A modular framework decomposes Transformer nonlinearities into spike-compatible primitives realized via LIF population coding and bit-shift scaling, supporting Softmax, SiLU, and normalization with under 1% accuracy drop in LLMs.
FP8 formats E4M3 and E5M2 match 16-bit training accuracy on CNNs, RNNs, and Transformers up to 175B parameters without hyperparameter changes.
DiBA factors weight matrices into diagonal-binary-diagonal-binary-diagonal form to cut matrix-vector multiplies from mn to m+k+n operations and improves accuracy on DistilBERT and audio transformer tasks after replacement.
JacQuant learns a diagonal or block-diagonal Jacobian surrogate to replace STE in QAT, with convergence proofs and higher accuracy than STE at ≤2 bits on LLMs.
Simulation of 4-state MTJ crossbars achieves 94.48% MNIST accuracy for neural inference, close to 97.56% software baseline, with analysis showing quantization as primary error and an optimal number of states per cell.
WNQ uses weight normalization to reshape weight distributions and reduce quantization error, outperforming baselines on CIFAR-100 and ImageNet.
A BNN-based YOLOv3-tiny-like object detector with 1-bit weights and 8-bit activations is implemented in Verilog on FPGA, achieving 39.6% mAP50 on VOC and 0.999964 correlation with the ONNX model in RTL simulation.
CTT is a compression pipeline for LLMs that achieves up to 49x memory reduction, 10x faster inference, 81% lower CO2 emissions, and retains 68-98% accuracy on code clone detection, summarization, and generation tasks.
The prune-quantize-distill ordering produces a better accuracy-size-latency frontier on CIFAR-10/100 than any single technique or other orderings, with INT8 QAT providing the main runtime gain.
citing papers explorer
-
Layerwise Progressive Freezing: A Training Scaffold for Depth-Scalable Binary Networks
StoMPP progressively binarizes BNN layers layerwise from input to output via stochastic masks, delivering depth-scalable accuracy gains in a fully STE-free regime by controlling activation-induced gradient blockades.
-
FTerViT: Fully Ternary Vision Transformer
FTerViT introduces fully ternary Vision Transformers with TernaryBitConv2d and TernaryLayerNorm operators, achieving 82.43% ImageNet top-1 at 6.09 MB with 15x compression.
-
DPQuant: Efficient and Differentially-Private Model Training via Dynamic Quantization Scheduling
DPQuant uses epoch-wise probabilistic layer rotation and DP loss sensitivity to quantize only a changing subset of layers, reducing accuracy degradation from quantization noise in DP-SGD and delivering up to 2.21x throughput gains with under 2% accuracy drop.
-
Reclaiming Residual Knowledge: A Novel Paradigm to Low-Bit Quantization
CoRa reclaims quantization residuals in pre-trained ConvNets by searching low-rank adapter architectures instead of weights, matching SOTA accuracy on ImageNet in 3-4 bit settings with under 250 iterations on 1600 images.
-
Training single-electron and single-photon stochastic physical neural networks
Single-electron and single-photon stochastic physical neural networks achieve over 97% MNIST test accuracy when trained with empirical outputs in the backward pass using few trials per layer.
-
Mixed Precision Training
Mixed precision training uses FP16 for most computations, FP32 master weights for accumulation, and loss scaling to enable accurate training of large DNNs with halved memory usage.
-
QuantSR+: Pushing the Limit of Quantized Image Super-Resolution Networks
QuantSR+ introduces RBD, QSA, and SFD techniques to achieve state-of-the-art accuracy-efficiency trade-offs in 2-4 bit quantized image super-resolution networks, with reported PSNR gains like 0.29 dB on Urban100 for SwinIR-S.
-
Plug-and-Play Spiking Operators: Breaking the Nonlinearity Bottleneck in Spiking Transformers
A modular framework decomposes Transformer nonlinearities into spike-compatible primitives realized via LIF population coding and bit-shift scaling, supporting Softmax, SiLU, and normalization with under 1% accuracy drop in LLMs.
-
FP8 Formats for Deep Learning
FP8 formats E4M3 and E5M2 match 16-bit training accuracy on CNNs, RNNs, and Transformers up to 175B parameters without hyperparameter changes.
-
DiBA: Diagonal and Binary Matrix Approximation for Neural Network Weight Compression
DiBA factors weight matrices into diagonal-binary-diagonal-binary-diagonal form to cut matrix-vector multiplies from mn to m+k+n operations and improves accuracy on DistilBERT and audio transformer tasks after replacement.
-
JacQuant: STE-Free Quantization-Aware Training via Learned Jacobian Surrogates
JacQuant learns a diagonal or block-diagonal Jacobian surrogate to replace STE in QAT, with convergence proofs and higher accuracy than STE at ≤2 bits on LLMs.
-
Multibit neural inference in a N-ary crossbar architecture
Simulation of 4-state MTJ crossbars achieves 94.48% MNIST accuracy for neural inference, close to 97.56% software baseline, with analysis showing quantization as primary error and an optimal number of states per cell.
-
Weight Normalization based Quantization for Deep Neural Network Compression
WNQ uses weight normalization to reshape weight distributions and reduce quantization error, outperforming baselines on CIFAR-100 and ImageNet.
-
Design and Implementation of BNN-Based Object Detection on FPGA
A BNN-based YOLOv3-tiny-like object detector with 1-bit weights and 8-bit activations is implemented in Verilog on FPGA, achieving 39.6% mAP50 on VOC and 0.999964 correlation with the ONNX model in RTL simulation.
-
Carbon-Taxed Transformers: A Green Compression Pipeline for Overgrown Language Models
CTT is a compression pipeline for LLMs that achieves up to 49x memory reduction, 10x faster inference, 81% lower CO2 emissions, and retains 68-98% accuracy on code clone detection, summarization, and generation tasks.
-
Prune-Quantize-Distill: An Ordered Pipeline for Efficient Neural Network Compression
The prune-quantize-distill ordering produces a better accuracy-size-latency frontier on CIFAR-10/100 than any single technique or other orderings, with INT8 QAT providing the main runtime gain.
- SURGE: Surrogate Gradient Adaptation in Binary Neural Networks