arxiv: 2604.03957 · v1 · submitted 2026-04-05 · 💻 cs.LG · cs.CL

Recognition: 2 theorem links

· Lean Theorem

BWTA: Accurate and Efficient Binarized Transformer by Algorithm-Hardware Co-design

Yifu Ding , Xianglong Liu , Shenghao Jin , Jinyang Guo , Jiwen Lu

Authors on Pith no claims yet

Pith reviewed 2026-05-13 17:22 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords ultra-low-bit quantizationbinarized transformerternary activationalgorithm-hardware co-designBERTlarge language modelsCUDA kernelquantization-aware training

0 comments

The pith

Binary weights and ternary activations let Transformers reach within 3.5 percent of full-precision accuracy while running 16 to 24 times faster on GPUs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces BWTA, a quantization approach that sets weights to binary values and activations to ternary values by projecting small numbers to zero. It pairs this with a Smooth Multi-Stage Quantization training routine that uses level-wise degradation and magnitude alignment to keep convergence stable. Custom CUDA kernels then handle the resulting binary and ternary matrix multiplications for both linear layers and attention, delivering large speedups with modest accuracy cost on BERT and language models. A sympathetic reader would care because the method shows a concrete path to low-memory, low-latency inference without requiring new hardware or major model redesigns.

Core claim

BWTA projects tiny values to zero during binarization of weights and ternarization of activations, trains the model with Smooth Multi-Stage Quantization that combines Levelwise Degradation Strategy and Magnitude-Alignment Projection Factor, and supplies a BWTA MatMul CUDA kernel using bit-packing; this combination keeps average GLUE drop at 3.5 percent for BERT and maintains comparable perplexity for LLMs while producing 16-24 times kernel speedup over FP16.

What carries the argument

The BWTA scheme that binarizes weights and ternarizes activations by projecting tiny values to zero, supported by Smooth Multi-Stage Quantization training and a custom instruction-level-parallel CUDA MatMul kernel.

If this is right

BERT models under BWTA show an average 3.5 percent GLUE drop and less than 2 percent drop on five additional tasks.
Large language models quantized with BWTA retain comparable perplexity and task accuracy to their full-precision versions.
The custom CUDA kernel delivers 16 to 24 times speedup over FP16 at the matrix-multiplication level and 216 to 330 tokens per second end-to-end prefill on LLMs.
Memory footprint is reduced because weights and activations use only one or two bits per value.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same projection-and-kernel pattern could be tested on other attention-based architectures such as vision transformers to check whether the accuracy preservation holds beyond language.
If the zero-projection rule generalizes, similar custom kernels might be written for additional low-bit formats on the same GPU hardware without waiting for new instruction sets.
The reported token-per-second numbers suggest BWTA could be used to serve larger models on existing server GPUs before new accelerator hardware arrives.

Load-bearing premise

Projecting tiny values to zero together with the Smooth Multi-Stage Quantization procedure will preserve accuracy across Transformer models and tasks without needing architecture-specific retuning.

What would settle it

Running the published BWTA procedure on a standard BERT-base model and observing an average GLUE score drop larger than 5 percent compared with the full-precision baseline would falsify the near-full-precision claim.

Figures

Figures reproduced from arXiv: 2604.03957 by Jinyang Guo, Jiwen Lu, Shenghao Jin, Xianglong Liu, Yifu Ding.

**Figure 4.** Figure 4: (a) The illustration of the bit/levelwise multi-stage [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 6.** Figure 6: Instruction-level parallel bitpack from 32 FP16 (Half) to 0 0 0 0 1 1 0 0 1 0 A V RealA RealV 结果 1 1 1 1 1 1 0 1 0 0 1 0 0 1 0 1 1 -1 4 5 6 7 8 9 10 11 12 13 14 15 s 1 2 3 4 5 6 7 8 [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 10.** Figure 10: Comparison of different re-initialization strategies for [PITH_FULL_IMAGE:figures/full_fig_p007_10.png] view at source ↗

**Figure 11.** Figure 11: Overall time comparisons for each step in the GEMM kernels with four typical shapes. [PITH_FULL_IMAGE:figures/full_fig_p009_11.png] view at source ↗

**Figure 13.** Figure 13: Task loss surface of quantized models and the full [PITH_FULL_IMAGE:figures/full_fig_p010_13.png] view at source ↗

**Figure 12.** Figure 12: The quantized activation of (a) bitwise and (b) levelwise [PITH_FULL_IMAGE:figures/full_fig_p010_12.png] view at source ↗

**Figure 14.** Figure 14: The curves of scaling factors and their gradients. [PITH_FULL_IMAGE:figures/full_fig_p015_14.png] view at source ↗

read the original abstract

Ultra low-bit quantization brings substantial efficiency for Transformer-based models, but the accuracy degradation and limited GPU support hinder its wide usage. In this paper, we analyze zero-point distortion in binarization and propose a Binary Weights & Ternary Activations (BWTA) quantization scheme, which projects tiny values to zero and preserves the accuracy of extremely low-bit models. For training, we propose Smooth Multi-Stage Quantization, combining a Levelwise Degradation Strategy and a Magnitude-Alignment Projection Factor to enable stable and fast convergence. For inference, we develop a BWTA MatMul CUDA kernel with instruction-level parallel bit-packing and comprehensive binary/ternary MatMul implementations for both linear and attention operators, allowing seamless integration across Transformer architectures. Experiments show that BWTA approaches full-precision performance for BERT, with an average 3.5% drop on GLUE and less than 2% drop on five tasks, and achieves comparable perplexity and accuracy for LLMs. In efficiency, it delivers 16 to 24 times kernel-level speedup over FP16 on NVIDIA GPUs, and 216 to 330 tokens/s end-to-end prefill speedup with lower memory footprint on LLMs. As an algorithm-hardware co-design, BWTA demonstrates practical, low-latency ultra-low-bit inference without sacrificing model quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BWTA keeps accuracy drops small on BERT and LLMs with binary weights plus ternary activations and a custom bit-packed kernel, but the results rest on limited models and lack ablations or error bars.

read the letter

BWTA keeps accuracy drops small on BERT and LLMs with binary weights plus ternary activations and a custom bit-packed kernel, but the results rest on limited models and lack ablations or error bars. The paper introduces the BWTA scheme that projects tiny values to zero, pairs it with Smooth Multi-Stage Quantization that uses levelwise degradation and a magnitude-alignment factor, and supplies an instruction-level CUDA kernel for binary/ternary MatMul in both linear and attention layers. On GLUE the average drop is 3.5 percent with smaller losses on some tasks, LLMs hold comparable perplexity, and the kernel delivers 16-24 times speedup over FP16 along with higher tokens per second and lower memory use. Those outcomes come from concrete experiments on BERT-base and a handful of LLM setups, and the co-design approach looks workable for getting low-bit models to train stably. The soft spots are the absence of error bars, missing ablations that isolate the zero-projection effect on attention versus feed-forward parts, and no scaling tests across model widths or depths. The stress-test concern about architecture-specific tuning therefore carries some weight, since nothing in the results shows when the distortion stays negligible. This paper is for engineers who need practical quantization recipes and GPU kernels for Transformer inference. A reader focused on deployment efficiency will find the training schedule and kernel implementation worth examining. The empirical claims are concrete enough to justify referee time even though the statistics and breadth need tightening. I would send it to peer review.

Referee Report

3 major / 2 minor

Summary. The paper proposes Binary Weights & Ternary Activations (BWTA) quantization for Transformer models, which projects tiny values to zero to mitigate zero-point distortion in binarization. Training uses Smooth Multi-Stage Quantization combining Levelwise Degradation Strategy and Magnitude-Alignment Projection Factor for stable convergence. Inference relies on a custom BWTA MatMul CUDA kernel with bit-packing for linear and attention operators. Experiments claim near full-precision results: 3.5% average GLUE drop for BERT-base, <2% on five tasks, comparable LLM perplexity/accuracy, plus 16-24x kernel speedup over FP16 and 216-330 tokens/s end-to-end prefill.

Significance. If the empirical claims hold with proper validation, the work would demonstrate a practical algorithm-hardware co-design for ultra-low-bit Transformer inference that achieves substantial efficiency gains while preserving model quality, addressing key barriers to deploying binarized/ternary models on GPUs.

major comments (3)

[Experiments] Experimental section: the headline accuracy claims (3.5% GLUE drop, <2% on five tasks, comparable LLM metrics) are presented without error bars, standard deviations across runs, or ablation studies isolating the zero-projection step versus the Smooth Multi-Stage Quantization components; this makes it impossible to assess whether the reported drops are statistically stable or architecture-specific.
[§3.2] §3.2 (Smooth Multi-Stage Quantization): the Magnitude-Alignment Projection Factor is introduced as a free hyper-parameter without a scaling analysis or bound showing when the induced distortion remains negligible as model depth or width increases; the assumption that it generalizes across BERT and LLMs without per-architecture retuning is load-bearing for the central accuracy claim but unsupported by the provided evidence.
[Inference kernel] Kernel implementation (BWTA MatMul CUDA kernel): the 16-24x speedup and seamless integration claims rest on an unverified custom kernel; no micro-benchmark tables compare against cuBLAS/FP16 baselines under identical batch/size conditions, nor is there confirmation that the bit-packing preserves numerical equivalence to the quantized forward pass.

minor comments (2)

[Abstract] Abstract and §4: the phrase 'approaches full-precision performance' is used without a precise definition (e.g., within X% of FP32 on all tasks); a table summarizing per-task deltas would improve clarity.
[§3] Notation: the Levelwise Degradation Strategy and Magnitude-Alignment Projection Factor are described at a high level; explicit pseudocode or equations for the projection threshold and degradation schedule would aid reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on experimental rigor, the analysis of the projection factor, and kernel verification. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Experiments] Experimental section: the headline accuracy claims (3.5% GLUE drop, <2% on five tasks, comparable LLM metrics) are presented without error bars, standard deviations across runs, or ablation studies isolating the zero-projection step versus the Smooth Multi-Stage Quantization components; this makes it impossible to assess whether the reported drops are statistically stable or architecture-specific.

Authors: We agree that error bars and component ablations would improve statistical assessment. In the revised manuscript we will report all GLUE and LLM metrics as means over at least three random seeds with standard deviations. We will also add a dedicated ablation subsection that isolates the zero-projection step from the Levelwise Degradation Strategy and Magnitude-Alignment Projection Factor, quantifying their individual and joint contributions to final accuracy. revision: yes
Referee: [§3.2] §3.2 (Smooth Multi-Stage Quantization): the Magnitude-Alignment Projection Factor is introduced as a free hyper-parameter without a scaling analysis or bound showing when the induced distortion remains negligible as model depth or width increases; the assumption that it generalizes across BERT and LLMs without per-architecture retuning is load-bearing for the central accuracy claim but unsupported by the provided evidence.

Authors: The factor is fixed at 0.1 for all models without retuning. While we do not derive a closed-form bound on distortion scaling, we empirically validate the choice across BERT-base/large and LLMs up to 7B parameters. In revision we will add a sensitivity study varying the factor on models of different widths and depths, together with an empirical analysis of activation magnitude distributions showing why the induced distortion stays small. revision: partial
Referee: [Inference kernel] Kernel implementation (BWTA MatMul CUDA kernel): the 16-24x speedup and seamless integration claims rest on an unverified custom kernel; no micro-benchmark tables compare against cuBLAS/FP16 baselines under identical batch/size conditions, nor is there confirmation that the bit-packing preserves numerical equivalence to the quantized forward pass.

Authors: We will add a micro-benchmark table comparing the BWTA MatMul kernel against cuBLAS FP16 for the exact matrix shapes and batch/sequence configurations used in the BERT and LLM experiments. We will also include a numerical equivalence verification section showing that the bit-packed kernel matches a reference quantized implementation to within 1e-6 maximum absolute error. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical co-design validated by experiments

full rationale

The paper presents BWTA as an algorithm-hardware co-design consisting of a zero-projection binarization scheme, Smooth Multi-Stage Quantization training (Levelwise Degradation + Magnitude-Alignment Projection Factor), and a custom CUDA MatMul kernel. No equations, derivations, or predictions are shown that reduce by construction to fitted parameters, self-defined quantities, or self-citation chains. Performance numbers (3.5% GLUE drop, 16-24x speedup) are reported as direct experimental outcomes on BERT and LLMs rather than outputs forced from the method's own inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claims rest on the domain assumption that zero-point projection in binarization can be compensated by magnitude alignment during training, plus standard assumptions about GPU instruction behavior for bit operations.

free parameters (1)

Magnitude-Alignment Projection Factor
Introduced to stabilize convergence in the multi-stage quantization process; value not specified in abstract.

axioms (1)

domain assumption Gradual introduction of quantization constraints via levelwise degradation enables stable training of low-bit models.
Core premise of the Smooth Multi-Stage Quantization strategy described in the abstract.

invented entities (1)

BWTA MatMul CUDA kernel no independent evidence
purpose: Provides instruction-level parallel bit-packing for binary/ternary matrix multiplications in linear and attention layers.
New implementation artifact required for the claimed speedups; no independent evidence of correctness beyond the abstract's performance numbers.

pith-pipeline@v0.9.0 · 5545 in / 1403 out tokens · 34772 ms · 2026-05-13T17:22:39.541516+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

propose Binary Weights & Ternary Activations (BWTA) quantization scheme, which projects tiny values to zero... Levelwise Degradation Strategy... Magnitude Alignment Projection Factor
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean alpha_pin_under_high_calibration unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Smooth Multi-Stage Quantization... Levelwise Degradation Strategy... {L0, L1, ..., Ll} with Ll=1

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · 4 internal anchors

[1]

Analyzing the Structure of Attention in a Transformer Language Model

J. Vig and Y. Belinkov, “Analyzing the structure of attention in a transformer language model,”arXiv preprint arXiv:1906.04284, 2019

work page Pith review arXiv 1906
[2]

A survey on semi-supervised learning,

J. E. Van Engelen and H. H. Hoos, “A survey on semi-supervised learning,”Machine learning, vol. 109, no. 2, pp. 373–440, 2020

work page 2020
[3]

Multimodal learning with transformers: A survey,

P . Xu, X. Zhu, and D. A. Clifton, “Multimodal learning with transformers: A survey,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 10, pp. 12 113–12 132, 2023

work page 2023
[4]

Crossformer++: A versatile vision transformer hinging on cross-scale attention,

W. Wang, W. Chen, Q. Qiu, L. Chen, B. Wu, B. Lin, X. He, and W. Liu, “Crossformer++: A versatile vision transformer hinging on cross-scale attention,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023

work page 2023
[5]

A practical survey on faster and lighter transformers,

Q. Fournier, G. M. Caron, and D. Aloise, “A practical survey on faster and lighter transformers,”ACM Computing Surveys, vol. 55, no. 14s, pp. 1–40, 2023

work page 2023
[6]

Towards accurate and compact architectures via neural architecture transformer,

Y. Guo, Y. Zheng, M. Tan, Q. Chen, Z. Li, J. Chen, P . Zhao, and J. Huang, “Towards accurate and compact architectures via neural architecture transformer,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 10, pp. 6501–6516, 2021

work page 2021
[7]

Q-vit: Accurate and fully quantized low-bit vision transformer,

Y. Li, S. Xu, B. Zhang, X. Cao, P . Gao, and G. Guo, “Q-vit: Accurate and fully quantized low-bit vision transformer,”Advances in neural information processing systems, vol. 35, pp. 34 451–34 463, 2022

work page 2022
[8]

SQuant: On-the-fly data-free quantization via diagonal hessian approximation,

C. Guo, Y. Qiu, J. Leng, X. Gao, C. Zhang, Y. Liu, F. Yang, Y. Zhu, and M. Guo, “SQuant: On-the-fly data-free quantization via diagonal hessian approximation,” inInternational Conference on Learning Representations, 2022

work page 2022
[9]

Quantization and training of neural networks for efficient integer-arithmetic-only inference,

B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko, “Quantization and training of neural networks for efficient integer-arithmetic-only inference,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2704–2713

work page 2018
[11]

Knowledge distillation via the target-aware transformer,

S. Lin, H. Xie, B. Wang, K. Yu, X. Chang, X. Liang, and G. Wang, “Knowledge distillation via the target-aware transformer,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 10 915–10 924

work page 2022
[12]

Dearkd: data-efficient early knowledge distillation for vision transformers,

X. Chen, Q. Cao, Y. Zhong, J. Zhang, S. Gao, and D. Tao, “Dearkd: data-efficient early knowledge distillation for vision transformers,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 12 052–12 062

work page 2022
[13]

Knowledge distillation and student- teacher learning for visual intelligence: A review and new outlooks,

L. Wang and K.-J. Yoon, “Knowledge distillation and student- teacher learning for visual intelligence: A review and new outlooks,” IEEE transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 6, pp. 3048–3068, 2021

work page 2021
[14]

Tprune: Efficient transformer pruning for mobile devices,

J. Mao, H. Yang, A. Li, H. Li, and Y. Chen, “Tprune: Efficient transformer pruning for mobile devices,”ACM Transactions on Cyber-Physical Systems, vol. 5, no. 3, pp. 1–22, 2021

work page 2021
[15]

Width & depth pruning for vision transformers,

F. Yu, K. Huang, M. Wang, Y. Cheng, W. Chu, and L. Cui, “Width & depth pruning for vision transformers,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 3, 2022, pp. 3143–3151

work page 2022
[16]

A survey on deep neural network pruning: Taxonomy, comparison, analysis, and recom- mendations,

H. Cheng, M. Zhang, and J. Q. Shi, “A survey on deep neural network pruning: Taxonomy, comparison, analysis, and recom- mendations,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

work page 2024
[17]

Shapeshifter: a parameter- efficient transformer using factorized reshaped matrices,

A. Panahi, S. Saeedi, and T. Arodz, “Shapeshifter: a parameter- efficient transformer using factorized reshaped matrices,”Advances in Neural Information Processing Systems, vol. 34, pp. 1337–1350, 2021

work page 2021
[18]

Subformer: Exploring weight sharing for parameter efficiency in generative transformers,

M. Reid, E. Marrese-Taylor, and Y. Matsuo, “Subformer: Exploring weight sharing for parameter efficiency in generative transformers,” arXiv preprint arXiv:2101.00234, 2021

work page arXiv 2021
[19]

Qlora: Efficient finetuning of quantized llms,

T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “Qlora: Efficient finetuning of quantized llms,”Advances in Neural Informa- tion Processing Systems, vol. 36, 2024

work page 2024
[20]

Lite-mono: A lightweight cnn and transformer architecture for self-supervised monocular depth estimation,

N. Zhang, F. Nex, G. Vosselman, and N. Kerle, “Lite-mono: A lightweight cnn and transformer architecture for self-supervised monocular depth estimation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 18 537–18 546

work page 2023
[21]

Towards lightweight transformer via group-wise transfor- mation for vision-and-language tasks,

G. Luo, Y. Zhou, X. Sun, Y. Wang, L. Cao, Y. Wu, F. Huang, and R. Ji, “Towards lightweight transformer via group-wise transfor- mation for vision-and-language tasks,”IEEE Transactions on Image Processing, vol. 31, pp. 3386–3398, 2022

work page 2022
[22]

Bibert: Accurate fully binarized bert,

H. Qin, Y. Ding, M. Zhang, Q. Yan, A. Liu, Q. Dang, Z. Liu, and X. Liu, “Bibert: Accurate fully binarized bert,”arXiv preprint arXiv:2203.06390, 2022

work page arXiv 2022
[23]

Learning efficient binarized object detectors with information compression,

Z. Wang, J. Lu, Z. Wu, and J. Zhou, “Learning efficient binarized object detectors with information compression,”IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–1, 2021

work page 2021
[24]

Learning channel-wise interactions for binary convolutional neural networks,

Z. Wang, J. Lu, and J. Zhou, “Learning channel-wise interactions for binary convolutional neural networks,”IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–1, 2020

work page 2020
[25]

Hierarchical binary cnns for landmark localization with limited resources,

A. Bulat and G. Tzimiropoulos, “Hierarchical binary cnns for landmark localization with limited resources,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 2, pp. 343–356, 2020

work page 2020
[26]

Binaryformer: A hierarchical-adaptive binary vision transformer (vit) for efficient computing,

M. Wang, Z. Xu, B. Zheng, and W. Xie, “Binaryformer: A hierarchical-adaptive binary vision transformer (vit) for efficient computing,”IEEE Transactions on Industrial Informatics, 2024

work page 2024
[27]

Bitnet: Scaling 1- bit transformers for large language models.arXiv preprint arXiv:2310.11453,

H. Wang, S. Ma, L. Dong, S. Huang, H. Wang, L. Ma, F. Yang, R. Wang, Y. Wu, and F. Wei, “Bitnet: Scaling 1-bit transformers for large language models,”arXiv preprint arXiv:2310.11453, 2023

work page arXiv 2023
[28]

Scalable matmul-free language modeling,

R.-J. Zhu, Y. Zhang, S. Abreu, E. Sifferman, T. Sheaves, Y. Wang, D. Richmond, S. B. Shrestha, P . Zhou, and J. K. Eshraghian, “Scalable matmul-free language modeling,”arXiv preprint arXiv:2406.02528, 2024

work page arXiv 2024
[29]

BinaryBERT: Pushing the limit of BERT quantization,

H. Bai, W. Zhang, L. Hou, L. Shang, J. Jin, X. Jiang, Q. Liu, M. Lyu, and I. King, “BinaryBERT: Pushing the limit of BERT quantization,” inProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), C. Zong, F. Xia, W. Li, and R....

work page 2021
[30]

Bit: Robustly binarized multi-distilled trans- former,

Z. Liu, B. Oguz, A. Pappu, L. Xiao, S. Yih, M. Li, R. Krishnamoorthi, and Y. Mehdad, “Bit: Robustly binarized multi-distilled trans- former,”Advances in neural information processing systems, vol. 35, pp. 14 303–14 316, 2022

work page 2022
[31]

Mlbert: Multi-level fully binarized bert,

M. M. Nasab, M. Fakhire, M. E. Salehi, and M. Modarresi, “Mlbert: Multi-level fully binarized bert,” in2024 1st International Confer- ence on Innovative Engineering Sciences and Technological Research (ICIESTR), 2024, pp. 1–6

work page 2024
[32]

Bipft: Binary pre-trained foundation transformer with low-rank estimation of binarization residual polynomials,

X. Xing, L. Du, X. Wang, X. Zeng, Y. Wang, Z. Zhang, and J. Zhang, “Bipft: Binary pre-trained foundation transformer with low-rank estimation of binarization residual polynomials,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 14, 2024, pp. 16 094–16 102

work page 2024
[33]

Bebert: Efficient and robust binary ensemble bert,

J. Tian, C. Fang, H. Wang, and Z. Wang, “Bebert: Efficient and robust binary ensemble bert,” inICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5

work page 2023
[34]

Binary and ternary natural language generation,

Z. Liu, B. Oguz, A. Pappu, Y. Shi, and R. Krishnamoorthi, “Binary and ternary natural language generation,” inProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), A. Rogers, J. Boyd-Graber, and N. Okazaki, Eds. Toronto, Canada: Association for Computational Linguistics, Jul. 2023, pp. 65–77. [Onl...

work page 2023
[35]

Binaryvit: Pushing binary vision trans- formers towards convolutional models,

P .-H. C. Le and X. Li, “Binaryvit: Pushing binary vision trans- formers towards convolutional models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2023, pp. 4664–4673

work page 2023
[36]

Bivit: Extremely compressed binary vision transformers,

Y. He, Z. Lou, L. Zhang, J. Liu, W. Wu, H. Zhou, and B. Zhuang, “Bivit: Extremely compressed binary vision transformers,” in2023 IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 5628–5640

work page 2023
[37]

Db-llm: Accurate dual-binarization for efficient llms,

H. Chen, C. Lv, L. Ding, H. Qin, X. Zhou, Y. Ding, X. Liu, M. Zhang, J. Guo, X. Liu, and D. Tao, “Db-llm: Accurate dual-binarization for efficient llms,” inAnnual Meeting of the Association for Computational Linguistics, 2024

work page 2024
[38]

Pb-llm: Partially binarized large language models,

Y. Shang, Z. Yuan, Q. Wu, and Z. Dong, “Pb-llm: Partially binarized large language models,”arXiv preprint arXiv:2310.00034, 2023

work page arXiv 2023
[39]

Billm: Pushing the limit of post-training quantization for llms,

W. Huang, Y. Liu, H. Qin, Y. Li, S. Zhang, X. Liu, M. Magno, and X. Qi, “Billm: Pushing the limit of post-training quantization for llms,” inInternational Conference on Machine Learning, 2024

work page 2024
[40]

Overcoming oscillations in quantization-aware training,

M. Nagel, M. Fournarakis, Y. Bondarenko, and T. Blankevoort, “Overcoming oscillations in quantization-aware training,” inPro- ceedings of the 39th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 162, 17–23 Jul 2022, pp. 16 318–16 330

work page 2022
[41]

Differentiable soft quantization: Bridging full-precision and low- PREPRINT SUBMITTED TO ARXIV 12 bit neural networks,

R. Gong, X. Liu, S. Jiang, T. Li, P . Hu, J. Lin, F. Yu, and J. Yan, “Differentiable soft quantization: Bridging full-precision and low- PREPRINT SUBMITTED TO ARXIV 12 bit neural networks,” inInternational Conference on Computer Vision (ICCV), 2019

work page 2019
[42]

Atom: Low-bit quantization for efficient and accurate llm serving,

Y. Zhao, C.-Y. Lin, K. Zhu, Z. Ye, L. Chen, S. Zheng, L. Ceze, A. Krishnamurthy, T. Chen, and B. Kasikci, “Atom: Low-bit quantization for efficient and accurate llm serving,” inProceedings of Machine Learning and Systems, P . Gibbons, G. Pekhimenko, and C. D. Sa, Eds., vol. 6, 2024, pp. 196–209. [Online]. Avail- able: https://proceedings.mlsys.org/paper f...

work page 2024
[43]

Biqgemm: Matrix multiplication with lookup table for binary-coding-based quantized dnns,

Y. Jeon, B. Park, S. J. Kwon, B. Kim, J. Yun, and D. Lee, “Biqgemm: Matrix multiplication with lookup table for binary-coding-based quantized dnns,” inSC20: International Conference for High Perfor- mance Computing, Networking, Storage and Analysis, 2020, pp. 1–14

work page 2020
[44]

Bit-slicing fpga accelerator for quantized neural networks,

O. Bilaniuk, S. Wagner, Y. Savaria, and J.-P . David, “Bit-slicing fpga accelerator for quantized neural networks,” in2019 IEEE International Symposium on Circuits and Systems (ISCAS). IEEE, 2019, pp. 1–5

work page 2019
[45]

Efficient approaches for gemm acceleration on leading ai-optimized fpgas,

E. Taka, D. Gourounas, A. Gerstlauer, D. Marculescu, and A. Arora, “Efficient approaches for gemm acceleration on leading ai-optimized fpgas,”arXiv preprint arXiv:2404.11066, 2024

work page arXiv 2024
[46]

FINN: A framework for fast, scalable binarized neural network inference,

Y. Umuroglu, N. J. Fraser, G. Gambardella, M. Blott, P . H. W. Leong, M. Jahre, and K. A. Vissers, “FINN: A framework for fast, scalable binarized neural network inference,”CoRR, vol. abs/1612.07119, 2016

work page arXiv 2016
[47]

Softmap: Software-hardware co-design for integer-only softmax on associative processors,

M. Rakka, J. Li, G. Dai, A. Eltawil, M. Fouda, and F. J. Kurdahi, “Softmap: Software-hardware co-design for integer-only softmax on associative processors,” inarXiv.org, 2024

work page 2024
[48]

Sole: Hardware- software co-design of softmax and layernorm for efficient trans- former inference,

W. Wang, S. Zhou, W. Sun, P . Sun, and Y. Liu, “Sole: Hardware- software co-design of softmax and layernorm for efficient trans- former inference,” in2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD), 2023, pp. 1–9

work page 2023
[49]

Quant-LLM: Accelerating the serving of large language models via FP6-Centric Algorithm-System Co-Design on modern GPUs,

H. Xia, Z. Zheng, X. Wu, S. Chen, Z. Yao, S. Youn, A. Bakhtiari, M. Wyatt, D. Zhuang, Z. Zhou, O. Ruwase, Y. He, and S. L. Song, “Quant-LLM: Accelerating the serving of large language models via FP6-Centric Algorithm-System Co-Design on modern GPUs,” in2024 USENIX Annual Technical Conference (USENIX ATC 24), Jul. 2024, pp. 699–713

work page 2024
[50]

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman, “Glue: A multi-task benchmark and analysis platform for natural language understanding,”arXiv preprint arXiv:1804.07461, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[51]

Dynabert: Dynamic bert with adaptive width and depth,

L. Hou, Z. Huang, L. Shang, X. Jiang, X. Chen, and Q. Liu, “Dynabert: Dynamic bert with adaptive width and depth,”Advances in Neural Information Processing Systems, vol. 33, pp. 9782–9793, 2020

work page 2020
[52]

Q8bert: Quantized 8bit bert,

O. Zafrir, G. Boudoukh, P . Izsak, and M. Wasserblat, “Q8bert: Quantized 8bit bert,” in2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing-NeurIPS Edition (EMC2- NIPS). IEEE, 2019, pp. 36–39

work page 2019
[53]

Ternarybert: Distillation-aware ultra-low bit bert,

W. Zhang, L. Hou, Y. Yin, L. Shang, X. Chen, X. Jiang, and Q. Liu, “Ternarybert: Distillation-aware ultra-low bit bert,”arXiv preprint arXiv:2009.12812, 2020

work page arXiv 2009
[54]

Ptq4vit: Post-training quantization framework for vision transformers with twin uniform quantization,

Z. Yuan, C. Xue, Y. Chen, Q. Wu, and G. Sun, “Ptq4vit: Post-training quantization framework for vision transformers with twin uniform quantization,”arXiv preprint arXiv:2111.12293, 2021

work page arXiv 2021
[55]

Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

Y. Bengio, N. Leonard, and A. Courville, “Estimating or prop- agating gradients through stochastic neurons for conditional computation,”arXiv preprint arXiv:1308.3432, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[56]

Learned step size quantization,

S. K. Esser, J. L. McKinstry, D. Bablani, R. Appuswamy, and D. S. Modha, “Learned step size quantization,”arXiv preprint arXiv:1902.08153, 2019

work page arXiv 1902
[57]

Training binary neural networks with real-to-binary convolutions,

B. Martinez, J. Yang, A. Bulat, and G. Tzimiropoulos, “Training binary neural networks with real-to-binary convolutions,”arXiv preprint arXiv:2003.11535, 2020

work page arXiv 2003
[58]

Awq: Activation-aware weight quantization for on-device llm compression and acceleration,

J. Lin, J. Tang, H. Tang, S. Yang, W.-M. Chen, W.-C. Wang, G. Xiao, X. Dang, C. Gan, and S. Han, “Awq: Activation-aware weight quantization for on-device llm compression and acceleration,” Proceedings of machine learning and systems, vol. 6, pp. 87–100, 2024

work page 2024
[59]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, “Gptq: Accurate post-training quantization for generative pre-trained transformers,” arXiv preprint arXiv:2210.17323, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[60]

Omniquant: Omnidirectionally calibrated quantization for large lan- guage models.arXiv preprint arXiv:2308.13137,

W. Shao, M. Chen, Z. Zhang, P . Xu, L. Zhao, Z. Li, K. Zhang, P . Gao, Y. Qiao, and P . Luo, “Omniquant: Omnidirectionally calibrated quantization for large language models,”arXiv preprint arXiv:2308.13137, 2023

work page arXiv 2023
[61]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi`ere, N. Goyal, E. Hambro, F. Azharet al., “Llama: Open and efficient foundation language models,”arXiv preprint arXiv:2302.13971, 2023. Yifu Dingis a Ph.D. Candidate under the su- pervision of Prof. Xianglong Liu in the School of Computer Science and Engineering & Shenyuan ...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[62]

Relevance & comparability. GPUs are the de-facto platform for LLMs; evaluating on a single commodity platform enables fair, reproducible comparisons with existing low-bit baselines and real workloads (pre- fill/decode), without confounds from heterogeneous boards or toolchains

work page
[63]

Our work co- optimizes BWTA with GPU realities (MMA tile geome- try, register/shared-memory limits, low-bit instruction throughput, packing/layout)

Co-design within real constraints. Our work co- optimizes BWTA with GPU realities (MMA tile geome- try, register/shared-memory limits, low-bit instruction throughput, packing/layout). This is not software-only; the algorithmic choices were made because they map efficiently to GPU bitwise execution paths

work page
[64]

GPU kernels can be immediately adopted by the most frameworks and toolkits, benefiting a wide range of models and inference pipelines

Community choice. GPU kernels can be immediately adopted by the most frameworks and toolkits, benefiting a wide range of models and inference pipelines. We agree that FPGA/ASIC can further improve energy effi- ciency. However, our GPU-oriented kernels and layouts are not a 1:1 drop-in for FPGA/ASIC due to different instruction sets, memory hierarchies, an...

work page
[65]

https://docs.nvidia.com/cuda/cublas PREPRINT SUBMITTED TO ARXIV 17 Quant #Bits Size (MB) MNLI -m/mm QQP QNLI SST-2 CoLA STS-B MRPC RTE Avg. Full Precision 32/32 418 84.9/85.5 91.4 88.6 93.2 59.7 89.8 86.2 72.2 83.9 Q2BERT 2/8 43.0 47.2/47.3 67.0 61.3 80.6 0 4.4 68.4 52.7 47.7 Q-BERT 2/8 43.0 76.6/77.0 – – 84.6 – – 68.3 52.7 – TernaryBERT 2/8 28.0 83.3/83....

work page 2048