pith. machine review for the scientific record. sign in

arxiv: 2604.03957 · v1 · submitted 2026-04-05 · 💻 cs.LG · cs.CL

Recognition: 2 theorem links

· Lean Theorem

BWTA: Accurate and Efficient Binarized Transformer by Algorithm-Hardware Co-design

Authors on Pith no claims yet

Pith reviewed 2026-05-13 17:22 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords ultra-low-bit quantizationbinarized transformerternary activationalgorithm-hardware co-designBERTlarge language modelsCUDA kernelquantization-aware training
0
0 comments X

The pith

Binary weights and ternary activations let Transformers reach within 3.5 percent of full-precision accuracy while running 16 to 24 times faster on GPUs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces BWTA, a quantization approach that sets weights to binary values and activations to ternary values by projecting small numbers to zero. It pairs this with a Smooth Multi-Stage Quantization training routine that uses level-wise degradation and magnitude alignment to keep convergence stable. Custom CUDA kernels then handle the resulting binary and ternary matrix multiplications for both linear layers and attention, delivering large speedups with modest accuracy cost on BERT and language models. A sympathetic reader would care because the method shows a concrete path to low-memory, low-latency inference without requiring new hardware or major model redesigns.

Core claim

BWTA projects tiny values to zero during binarization of weights and ternarization of activations, trains the model with Smooth Multi-Stage Quantization that combines Levelwise Degradation Strategy and Magnitude-Alignment Projection Factor, and supplies a BWTA MatMul CUDA kernel using bit-packing; this combination keeps average GLUE drop at 3.5 percent for BERT and maintains comparable perplexity for LLMs while producing 16-24 times kernel speedup over FP16.

What carries the argument

The BWTA scheme that binarizes weights and ternarizes activations by projecting tiny values to zero, supported by Smooth Multi-Stage Quantization training and a custom instruction-level-parallel CUDA MatMul kernel.

If this is right

  • BERT models under BWTA show an average 3.5 percent GLUE drop and less than 2 percent drop on five additional tasks.
  • Large language models quantized with BWTA retain comparable perplexity and task accuracy to their full-precision versions.
  • The custom CUDA kernel delivers 16 to 24 times speedup over FP16 at the matrix-multiplication level and 216 to 330 tokens per second end-to-end prefill on LLMs.
  • Memory footprint is reduced because weights and activations use only one or two bits per value.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same projection-and-kernel pattern could be tested on other attention-based architectures such as vision transformers to check whether the accuracy preservation holds beyond language.
  • If the zero-projection rule generalizes, similar custom kernels might be written for additional low-bit formats on the same GPU hardware without waiting for new instruction sets.
  • The reported token-per-second numbers suggest BWTA could be used to serve larger models on existing server GPUs before new accelerator hardware arrives.

Load-bearing premise

Projecting tiny values to zero together with the Smooth Multi-Stage Quantization procedure will preserve accuracy across Transformer models and tasks without needing architecture-specific retuning.

What would settle it

Running the published BWTA procedure on a standard BERT-base model and observing an average GLUE score drop larger than 5 percent compared with the full-precision baseline would falsify the near-full-precision claim.

Figures

Figures reproduced from arXiv: 2604.03957 by Jinyang Guo, Jiwen Lu, Shenghao Jin, Xianglong Liu, Yifu Ding.

Figure 2
Figure 2. Figure 2: Histograms for binary/ternary activation in Self-Attention [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: (a) The illustration of the bit/levelwise multi-stage [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Instruction-level parallel bitpack from 32 FP16 (Half) to 0 0 0 0 1 1 0 0 1 0 A V RealA RealV 结果 1 1 1 1 1 1 0 1 0 0 1 0 0 1 0 1 1 -1 4 5 6 7 8 9 10 11 12 13 14 15 s 1 2 3 4 5 6 7 8 [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 10
Figure 10. Figure 10: Comparison of different re-initialization strategies for [PITH_FULL_IMAGE:figures/full_fig_p007_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Overall time comparisons for each step in the GEMM kernels with four typical shapes. [PITH_FULL_IMAGE:figures/full_fig_p009_11.png] view at source ↗
Figure 13
Figure 13. Figure 13: Task loss surface of quantized models and the full [PITH_FULL_IMAGE:figures/full_fig_p010_13.png] view at source ↗
Figure 12
Figure 12. Figure 12: The quantized activation of (a) bitwise and (b) levelwise [PITH_FULL_IMAGE:figures/full_fig_p010_12.png] view at source ↗
Figure 14
Figure 14. Figure 14: The curves of scaling factors and their gradients. [PITH_FULL_IMAGE:figures/full_fig_p015_14.png] view at source ↗
read the original abstract

Ultra low-bit quantization brings substantial efficiency for Transformer-based models, but the accuracy degradation and limited GPU support hinder its wide usage. In this paper, we analyze zero-point distortion in binarization and propose a Binary Weights & Ternary Activations (BWTA) quantization scheme, which projects tiny values to zero and preserves the accuracy of extremely low-bit models. For training, we propose Smooth Multi-Stage Quantization, combining a Levelwise Degradation Strategy and a Magnitude-Alignment Projection Factor to enable stable and fast convergence. For inference, we develop a BWTA MatMul CUDA kernel with instruction-level parallel bit-packing and comprehensive binary/ternary MatMul implementations for both linear and attention operators, allowing seamless integration across Transformer architectures. Experiments show that BWTA approaches full-precision performance for BERT, with an average 3.5% drop on GLUE and less than 2% drop on five tasks, and achieves comparable perplexity and accuracy for LLMs. In efficiency, it delivers 16 to 24 times kernel-level speedup over FP16 on NVIDIA GPUs, and 216 to 330 tokens/s end-to-end prefill speedup with lower memory footprint on LLMs. As an algorithm-hardware co-design, BWTA demonstrates practical, low-latency ultra-low-bit inference without sacrificing model quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes Binary Weights & Ternary Activations (BWTA) quantization for Transformer models, which projects tiny values to zero to mitigate zero-point distortion in binarization. Training uses Smooth Multi-Stage Quantization combining Levelwise Degradation Strategy and Magnitude-Alignment Projection Factor for stable convergence. Inference relies on a custom BWTA MatMul CUDA kernel with bit-packing for linear and attention operators. Experiments claim near full-precision results: 3.5% average GLUE drop for BERT-base, <2% on five tasks, comparable LLM perplexity/accuracy, plus 16-24x kernel speedup over FP16 and 216-330 tokens/s end-to-end prefill.

Significance. If the empirical claims hold with proper validation, the work would demonstrate a practical algorithm-hardware co-design for ultra-low-bit Transformer inference that achieves substantial efficiency gains while preserving model quality, addressing key barriers to deploying binarized/ternary models on GPUs.

major comments (3)
  1. [Experiments] Experimental section: the headline accuracy claims (3.5% GLUE drop, <2% on five tasks, comparable LLM metrics) are presented without error bars, standard deviations across runs, or ablation studies isolating the zero-projection step versus the Smooth Multi-Stage Quantization components; this makes it impossible to assess whether the reported drops are statistically stable or architecture-specific.
  2. [§3.2] §3.2 (Smooth Multi-Stage Quantization): the Magnitude-Alignment Projection Factor is introduced as a free hyper-parameter without a scaling analysis or bound showing when the induced distortion remains negligible as model depth or width increases; the assumption that it generalizes across BERT and LLMs without per-architecture retuning is load-bearing for the central accuracy claim but unsupported by the provided evidence.
  3. [Inference kernel] Kernel implementation (BWTA MatMul CUDA kernel): the 16-24x speedup and seamless integration claims rest on an unverified custom kernel; no micro-benchmark tables compare against cuBLAS/FP16 baselines under identical batch/size conditions, nor is there confirmation that the bit-packing preserves numerical equivalence to the quantized forward pass.
minor comments (2)
  1. [Abstract] Abstract and §4: the phrase 'approaches full-precision performance' is used without a precise definition (e.g., within X% of FP32 on all tasks); a table summarizing per-task deltas would improve clarity.
  2. [§3] Notation: the Levelwise Degradation Strategy and Magnitude-Alignment Projection Factor are described at a high level; explicit pseudocode or equations for the projection threshold and degradation schedule would aid reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on experimental rigor, the analysis of the projection factor, and kernel verification. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Experiments] Experimental section: the headline accuracy claims (3.5% GLUE drop, <2% on five tasks, comparable LLM metrics) are presented without error bars, standard deviations across runs, or ablation studies isolating the zero-projection step versus the Smooth Multi-Stage Quantization components; this makes it impossible to assess whether the reported drops are statistically stable or architecture-specific.

    Authors: We agree that error bars and component ablations would improve statistical assessment. In the revised manuscript we will report all GLUE and LLM metrics as means over at least three random seeds with standard deviations. We will also add a dedicated ablation subsection that isolates the zero-projection step from the Levelwise Degradation Strategy and Magnitude-Alignment Projection Factor, quantifying their individual and joint contributions to final accuracy. revision: yes

  2. Referee: [§3.2] §3.2 (Smooth Multi-Stage Quantization): the Magnitude-Alignment Projection Factor is introduced as a free hyper-parameter without a scaling analysis or bound showing when the induced distortion remains negligible as model depth or width increases; the assumption that it generalizes across BERT and LLMs without per-architecture retuning is load-bearing for the central accuracy claim but unsupported by the provided evidence.

    Authors: The factor is fixed at 0.1 for all models without retuning. While we do not derive a closed-form bound on distortion scaling, we empirically validate the choice across BERT-base/large and LLMs up to 7B parameters. In revision we will add a sensitivity study varying the factor on models of different widths and depths, together with an empirical analysis of activation magnitude distributions showing why the induced distortion stays small. revision: partial

  3. Referee: [Inference kernel] Kernel implementation (BWTA MatMul CUDA kernel): the 16-24x speedup and seamless integration claims rest on an unverified custom kernel; no micro-benchmark tables compare against cuBLAS/FP16 baselines under identical batch/size conditions, nor is there confirmation that the bit-packing preserves numerical equivalence to the quantized forward pass.

    Authors: We will add a micro-benchmark table comparing the BWTA MatMul kernel against cuBLAS FP16 for the exact matrix shapes and batch/sequence configurations used in the BERT and LLM experiments. We will also include a numerical equivalence verification section showing that the bit-packed kernel matches a reference quantized implementation to within 1e-6 maximum absolute error. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical co-design validated by experiments

full rationale

The paper presents BWTA as an algorithm-hardware co-design consisting of a zero-projection binarization scheme, Smooth Multi-Stage Quantization training (Levelwise Degradation + Magnitude-Alignment Projection Factor), and a custom CUDA MatMul kernel. No equations, derivations, or predictions are shown that reduce by construction to fitted parameters, self-defined quantities, or self-citation chains. Performance numbers (3.5% GLUE drop, 16-24x speedup) are reported as direct experimental outcomes on BERT and LLMs rather than outputs forced from the method's own inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claims rest on the domain assumption that zero-point projection in binarization can be compensated by magnitude alignment during training, plus standard assumptions about GPU instruction behavior for bit operations.

free parameters (1)
  • Magnitude-Alignment Projection Factor
    Introduced to stabilize convergence in the multi-stage quantization process; value not specified in abstract.
axioms (1)
  • domain assumption Gradual introduction of quantization constraints via levelwise degradation enables stable training of low-bit models.
    Core premise of the Smooth Multi-Stage Quantization strategy described in the abstract.
invented entities (1)
  • BWTA MatMul CUDA kernel no independent evidence
    purpose: Provides instruction-level parallel bit-packing for binary/ternary matrix multiplications in linear and attention layers.
    New implementation artifact required for the claimed speedups; no independent evidence of correctness beyond the abstract's performance numbers.

pith-pipeline@v0.9.0 · 5545 in / 1403 out tokens · 34772 ms · 2026-05-13T17:22:39.541516+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · 4 internal anchors

  1. [1]

    Analyzing the Structure of Attention in a Transformer Language Model

    J. Vig and Y. Belinkov, “Analyzing the structure of attention in a transformer language model,”arXiv preprint arXiv:1906.04284, 2019

  2. [2]

    A survey on semi-supervised learning,

    J. E. Van Engelen and H. H. Hoos, “A survey on semi-supervised learning,”Machine learning, vol. 109, no. 2, pp. 373–440, 2020

  3. [3]

    Multimodal learning with transformers: A survey,

    P . Xu, X. Zhu, and D. A. Clifton, “Multimodal learning with transformers: A survey,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 10, pp. 12 113–12 132, 2023

  4. [4]

    Crossformer++: A versatile vision transformer hinging on cross-scale attention,

    W. Wang, W. Chen, Q. Qiu, L. Chen, B. Wu, B. Lin, X. He, and W. Liu, “Crossformer++: A versatile vision transformer hinging on cross-scale attention,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023

  5. [5]

    A practical survey on faster and lighter transformers,

    Q. Fournier, G. M. Caron, and D. Aloise, “A practical survey on faster and lighter transformers,”ACM Computing Surveys, vol. 55, no. 14s, pp. 1–40, 2023

  6. [6]

    Towards accurate and compact architectures via neural architecture transformer,

    Y. Guo, Y. Zheng, M. Tan, Q. Chen, Z. Li, J. Chen, P . Zhao, and J. Huang, “Towards accurate and compact architectures via neural architecture transformer,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 10, pp. 6501–6516, 2021

  7. [7]

    Q-vit: Accurate and fully quantized low-bit vision transformer,

    Y. Li, S. Xu, B. Zhang, X. Cao, P . Gao, and G. Guo, “Q-vit: Accurate and fully quantized low-bit vision transformer,”Advances in neural information processing systems, vol. 35, pp. 34 451–34 463, 2022

  8. [8]

    SQuant: On-the-fly data-free quantization via diagonal hessian approximation,

    C. Guo, Y. Qiu, J. Leng, X. Gao, C. Zhang, Y. Liu, F. Yang, Y. Zhu, and M. Guo, “SQuant: On-the-fly data-free quantization via diagonal hessian approximation,” inInternational Conference on Learning Representations, 2022

  9. [9]

    Quantization and training of neural networks for efficient integer-arithmetic-only inference,

    B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko, “Quantization and training of neural networks for efficient integer-arithmetic-only inference,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2704–2713

  10. [11]

    Knowledge distillation via the target-aware transformer,

    S. Lin, H. Xie, B. Wang, K. Yu, X. Chang, X. Liang, and G. Wang, “Knowledge distillation via the target-aware transformer,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 10 915–10 924

  11. [12]

    Dearkd: data-efficient early knowledge distillation for vision transformers,

    X. Chen, Q. Cao, Y. Zhong, J. Zhang, S. Gao, and D. Tao, “Dearkd: data-efficient early knowledge distillation for vision transformers,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 12 052–12 062

  12. [13]

    Knowledge distillation and student- teacher learning for visual intelligence: A review and new outlooks,

    L. Wang and K.-J. Yoon, “Knowledge distillation and student- teacher learning for visual intelligence: A review and new outlooks,” IEEE transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 6, pp. 3048–3068, 2021

  13. [14]

    Tprune: Efficient transformer pruning for mobile devices,

    J. Mao, H. Yang, A. Li, H. Li, and Y. Chen, “Tprune: Efficient transformer pruning for mobile devices,”ACM Transactions on Cyber-Physical Systems, vol. 5, no. 3, pp. 1–22, 2021

  14. [15]

    Width & depth pruning for vision transformers,

    F. Yu, K. Huang, M. Wang, Y. Cheng, W. Chu, and L. Cui, “Width & depth pruning for vision transformers,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 3, 2022, pp. 3143–3151

  15. [16]

    A survey on deep neural network pruning: Taxonomy, comparison, analysis, and recom- mendations,

    H. Cheng, M. Zhang, and J. Q. Shi, “A survey on deep neural network pruning: Taxonomy, comparison, analysis, and recom- mendations,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

  16. [17]

    Shapeshifter: a parameter- efficient transformer using factorized reshaped matrices,

    A. Panahi, S. Saeedi, and T. Arodz, “Shapeshifter: a parameter- efficient transformer using factorized reshaped matrices,”Advances in Neural Information Processing Systems, vol. 34, pp. 1337–1350, 2021

  17. [18]

    Subformer: Exploring weight sharing for parameter efficiency in generative transformers,

    M. Reid, E. Marrese-Taylor, and Y. Matsuo, “Subformer: Exploring weight sharing for parameter efficiency in generative transformers,” arXiv preprint arXiv:2101.00234, 2021

  18. [19]

    Qlora: Efficient finetuning of quantized llms,

    T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “Qlora: Efficient finetuning of quantized llms,”Advances in Neural Informa- tion Processing Systems, vol. 36, 2024

  19. [20]

    Lite-mono: A lightweight cnn and transformer architecture for self-supervised monocular depth estimation,

    N. Zhang, F. Nex, G. Vosselman, and N. Kerle, “Lite-mono: A lightweight cnn and transformer architecture for self-supervised monocular depth estimation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 18 537–18 546

  20. [21]

    Towards lightweight transformer via group-wise transfor- mation for vision-and-language tasks,

    G. Luo, Y. Zhou, X. Sun, Y. Wang, L. Cao, Y. Wu, F. Huang, and R. Ji, “Towards lightweight transformer via group-wise transfor- mation for vision-and-language tasks,”IEEE Transactions on Image Processing, vol. 31, pp. 3386–3398, 2022

  21. [22]

    Bibert: Accurate fully binarized bert,

    H. Qin, Y. Ding, M. Zhang, Q. Yan, A. Liu, Q. Dang, Z. Liu, and X. Liu, “Bibert: Accurate fully binarized bert,”arXiv preprint arXiv:2203.06390, 2022

  22. [23]

    Learning efficient binarized object detectors with information compression,

    Z. Wang, J. Lu, Z. Wu, and J. Zhou, “Learning efficient binarized object detectors with information compression,”IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–1, 2021

  23. [24]

    Learning channel-wise interactions for binary convolutional neural networks,

    Z. Wang, J. Lu, and J. Zhou, “Learning channel-wise interactions for binary convolutional neural networks,”IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–1, 2020

  24. [25]

    Hierarchical binary cnns for landmark localization with limited resources,

    A. Bulat and G. Tzimiropoulos, “Hierarchical binary cnns for landmark localization with limited resources,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 2, pp. 343–356, 2020

  25. [26]

    Binaryformer: A hierarchical-adaptive binary vision transformer (vit) for efficient computing,

    M. Wang, Z. Xu, B. Zheng, and W. Xie, “Binaryformer: A hierarchical-adaptive binary vision transformer (vit) for efficient computing,”IEEE Transactions on Industrial Informatics, 2024

  26. [27]

    Bitnet: Scaling 1- bit transformers for large language models.arXiv preprint arXiv:2310.11453,

    H. Wang, S. Ma, L. Dong, S. Huang, H. Wang, L. Ma, F. Yang, R. Wang, Y. Wu, and F. Wei, “Bitnet: Scaling 1-bit transformers for large language models,”arXiv preprint arXiv:2310.11453, 2023

  27. [28]

    Scalable matmul-free language modeling,

    R.-J. Zhu, Y. Zhang, S. Abreu, E. Sifferman, T. Sheaves, Y. Wang, D. Richmond, S. B. Shrestha, P . Zhou, and J. K. Eshraghian, “Scalable matmul-free language modeling,”arXiv preprint arXiv:2406.02528, 2024

  28. [29]

    BinaryBERT: Pushing the limit of BERT quantization,

    H. Bai, W. Zhang, L. Hou, L. Shang, J. Jin, X. Jiang, Q. Liu, M. Lyu, and I. King, “BinaryBERT: Pushing the limit of BERT quantization,” inProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), C. Zong, F. Xia, W. Li, and R....

  29. [30]

    Bit: Robustly binarized multi-distilled trans- former,

    Z. Liu, B. Oguz, A. Pappu, L. Xiao, S. Yih, M. Li, R. Krishnamoorthi, and Y. Mehdad, “Bit: Robustly binarized multi-distilled trans- former,”Advances in neural information processing systems, vol. 35, pp. 14 303–14 316, 2022

  30. [31]

    Mlbert: Multi-level fully binarized bert,

    M. M. Nasab, M. Fakhire, M. E. Salehi, and M. Modarresi, “Mlbert: Multi-level fully binarized bert,” in2024 1st International Confer- ence on Innovative Engineering Sciences and Technological Research (ICIESTR), 2024, pp. 1–6

  31. [32]

    Bipft: Binary pre-trained foundation transformer with low-rank estimation of binarization residual polynomials,

    X. Xing, L. Du, X. Wang, X. Zeng, Y. Wang, Z. Zhang, and J. Zhang, “Bipft: Binary pre-trained foundation transformer with low-rank estimation of binarization residual polynomials,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 14, 2024, pp. 16 094–16 102

  32. [33]

    Bebert: Efficient and robust binary ensemble bert,

    J. Tian, C. Fang, H. Wang, and Z. Wang, “Bebert: Efficient and robust binary ensemble bert,” inICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5

  33. [34]

    Binary and ternary natural language generation,

    Z. Liu, B. Oguz, A. Pappu, Y. Shi, and R. Krishnamoorthi, “Binary and ternary natural language generation,” inProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), A. Rogers, J. Boyd-Graber, and N. Okazaki, Eds. Toronto, Canada: Association for Computational Linguistics, Jul. 2023, pp. 65–77. [Onl...

  34. [35]

    Binaryvit: Pushing binary vision trans- formers towards convolutional models,

    P .-H. C. Le and X. Li, “Binaryvit: Pushing binary vision trans- formers towards convolutional models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2023, pp. 4664–4673

  35. [36]

    Bivit: Extremely compressed binary vision transformers,

    Y. He, Z. Lou, L. Zhang, J. Liu, W. Wu, H. Zhou, and B. Zhuang, “Bivit: Extremely compressed binary vision transformers,” in2023 IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 5628–5640

  36. [37]

    Db-llm: Accurate dual-binarization for efficient llms,

    H. Chen, C. Lv, L. Ding, H. Qin, X. Zhou, Y. Ding, X. Liu, M. Zhang, J. Guo, X. Liu, and D. Tao, “Db-llm: Accurate dual-binarization for efficient llms,” inAnnual Meeting of the Association for Computational Linguistics, 2024

  37. [38]

    Pb-llm: Partially binarized large language models,

    Y. Shang, Z. Yuan, Q. Wu, and Z. Dong, “Pb-llm: Partially binarized large language models,”arXiv preprint arXiv:2310.00034, 2023

  38. [39]

    Billm: Pushing the limit of post-training quantization for llms,

    W. Huang, Y. Liu, H. Qin, Y. Li, S. Zhang, X. Liu, M. Magno, and X. Qi, “Billm: Pushing the limit of post-training quantization for llms,” inInternational Conference on Machine Learning, 2024

  39. [40]

    Overcoming oscillations in quantization-aware training,

    M. Nagel, M. Fournarakis, Y. Bondarenko, and T. Blankevoort, “Overcoming oscillations in quantization-aware training,” inPro- ceedings of the 39th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 162, 17–23 Jul 2022, pp. 16 318–16 330

  40. [41]

    Differentiable soft quantization: Bridging full-precision and low- PREPRINT SUBMITTED TO ARXIV 12 bit neural networks,

    R. Gong, X. Liu, S. Jiang, T. Li, P . Hu, J. Lin, F. Yu, and J. Yan, “Differentiable soft quantization: Bridging full-precision and low- PREPRINT SUBMITTED TO ARXIV 12 bit neural networks,” inInternational Conference on Computer Vision (ICCV), 2019

  41. [42]

    Atom: Low-bit quantization for efficient and accurate llm serving,

    Y. Zhao, C.-Y. Lin, K. Zhu, Z. Ye, L. Chen, S. Zheng, L. Ceze, A. Krishnamurthy, T. Chen, and B. Kasikci, “Atom: Low-bit quantization for efficient and accurate llm serving,” inProceedings of Machine Learning and Systems, P . Gibbons, G. Pekhimenko, and C. D. Sa, Eds., vol. 6, 2024, pp. 196–209. [Online]. Avail- able: https://proceedings.mlsys.org/paper f...

  42. [43]

    Biqgemm: Matrix multiplication with lookup table for binary-coding-based quantized dnns,

    Y. Jeon, B. Park, S. J. Kwon, B. Kim, J. Yun, and D. Lee, “Biqgemm: Matrix multiplication with lookup table for binary-coding-based quantized dnns,” inSC20: International Conference for High Perfor- mance Computing, Networking, Storage and Analysis, 2020, pp. 1–14

  43. [44]

    Bit-slicing fpga accelerator for quantized neural networks,

    O. Bilaniuk, S. Wagner, Y. Savaria, and J.-P . David, “Bit-slicing fpga accelerator for quantized neural networks,” in2019 IEEE International Symposium on Circuits and Systems (ISCAS). IEEE, 2019, pp. 1–5

  44. [45]

    Efficient approaches for gemm acceleration on leading ai-optimized fpgas,

    E. Taka, D. Gourounas, A. Gerstlauer, D. Marculescu, and A. Arora, “Efficient approaches for gemm acceleration on leading ai-optimized fpgas,”arXiv preprint arXiv:2404.11066, 2024

  45. [46]

    FINN: A framework for fast, scalable binarized neural network inference,

    Y. Umuroglu, N. J. Fraser, G. Gambardella, M. Blott, P . H. W. Leong, M. Jahre, and K. A. Vissers, “FINN: A framework for fast, scalable binarized neural network inference,”CoRR, vol. abs/1612.07119, 2016

  46. [47]

    Softmap: Software-hardware co-design for integer-only softmax on associative processors,

    M. Rakka, J. Li, G. Dai, A. Eltawil, M. Fouda, and F. J. Kurdahi, “Softmap: Software-hardware co-design for integer-only softmax on associative processors,” inarXiv.org, 2024

  47. [48]

    Sole: Hardware- software co-design of softmax and layernorm for efficient trans- former inference,

    W. Wang, S. Zhou, W. Sun, P . Sun, and Y. Liu, “Sole: Hardware- software co-design of softmax and layernorm for efficient trans- former inference,” in2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD), 2023, pp. 1–9

  48. [49]

    Quant-LLM: Accelerating the serving of large language models via FP6-Centric Algorithm-System Co-Design on modern GPUs,

    H. Xia, Z. Zheng, X. Wu, S. Chen, Z. Yao, S. Youn, A. Bakhtiari, M. Wyatt, D. Zhuang, Z. Zhou, O. Ruwase, Y. He, and S. L. Song, “Quant-LLM: Accelerating the serving of large language models via FP6-Centric Algorithm-System Co-Design on modern GPUs,” in2024 USENIX Annual Technical Conference (USENIX ATC 24), Jul. 2024, pp. 699–713

  49. [50]

    GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

    A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman, “Glue: A multi-task benchmark and analysis platform for natural language understanding,”arXiv preprint arXiv:1804.07461, 2018

  50. [51]

    Dynabert: Dynamic bert with adaptive width and depth,

    L. Hou, Z. Huang, L. Shang, X. Jiang, X. Chen, and Q. Liu, “Dynabert: Dynamic bert with adaptive width and depth,”Advances in Neural Information Processing Systems, vol. 33, pp. 9782–9793, 2020

  51. [52]

    Q8bert: Quantized 8bit bert,

    O. Zafrir, G. Boudoukh, P . Izsak, and M. Wasserblat, “Q8bert: Quantized 8bit bert,” in2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing-NeurIPS Edition (EMC2- NIPS). IEEE, 2019, pp. 36–39

  52. [53]

    Ternarybert: Distillation-aware ultra-low bit bert,

    W. Zhang, L. Hou, Y. Yin, L. Shang, X. Chen, X. Jiang, and Q. Liu, “Ternarybert: Distillation-aware ultra-low bit bert,”arXiv preprint arXiv:2009.12812, 2020

  53. [54]

    Ptq4vit: Post-training quantization framework for vision transformers with twin uniform quantization,

    Z. Yuan, C. Xue, Y. Chen, Q. Wu, and G. Sun, “Ptq4vit: Post-training quantization framework for vision transformers with twin uniform quantization,”arXiv preprint arXiv:2111.12293, 2021

  54. [55]

    Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

    Y. Bengio, N. Leonard, and A. Courville, “Estimating or prop- agating gradients through stochastic neurons for conditional computation,”arXiv preprint arXiv:1308.3432, 2013

  55. [56]

    Learned step size quantization,

    S. K. Esser, J. L. McKinstry, D. Bablani, R. Appuswamy, and D. S. Modha, “Learned step size quantization,”arXiv preprint arXiv:1902.08153, 2019

  56. [57]

    Training binary neural networks with real-to-binary convolutions,

    B. Martinez, J. Yang, A. Bulat, and G. Tzimiropoulos, “Training binary neural networks with real-to-binary convolutions,”arXiv preprint arXiv:2003.11535, 2020

  57. [58]

    Awq: Activation-aware weight quantization for on-device llm compression and acceleration,

    J. Lin, J. Tang, H. Tang, S. Yang, W.-M. Chen, W.-C. Wang, G. Xiao, X. Dang, C. Gan, and S. Han, “Awq: Activation-aware weight quantization for on-device llm compression and acceleration,” Proceedings of machine learning and systems, vol. 6, pp. 87–100, 2024

  58. [59]

    GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

    E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, “Gptq: Accurate post-training quantization for generative pre-trained transformers,” arXiv preprint arXiv:2210.17323, 2022

  59. [60]

    Omniquant: Omnidirectionally calibrated quantization for large lan- guage models.arXiv preprint arXiv:2308.13137,

    W. Shao, M. Chen, Z. Zhang, P . Xu, L. Zhao, Z. Li, K. Zhang, P . Gao, Y. Qiao, and P . Luo, “Omniquant: Omnidirectionally calibrated quantization for large language models,”arXiv preprint arXiv:2308.13137, 2023

  60. [61]

    LLaMA: Open and Efficient Foundation Language Models

    H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi`ere, N. Goyal, E. Hambro, F. Azharet al., “Llama: Open and efficient foundation language models,”arXiv preprint arXiv:2302.13971, 2023. Yifu Dingis a Ph.D. Candidate under the su- pervision of Prof. Xianglong Liu in the School of Computer Science and Engineering & Shenyuan ...

  61. [62]

    Relevance & comparability. GPUs are the de-facto platform for LLMs; evaluating on a single commodity platform enables fair, reproducible comparisons with existing low-bit baselines and real workloads (pre- fill/decode), without confounds from heterogeneous boards or toolchains

  62. [63]

    Our work co- optimizes BWTA with GPU realities (MMA tile geome- try, register/shared-memory limits, low-bit instruction throughput, packing/layout)

    Co-design within real constraints. Our work co- optimizes BWTA with GPU realities (MMA tile geome- try, register/shared-memory limits, low-bit instruction throughput, packing/layout). This is not software-only; the algorithmic choices were made because they map efficiently to GPU bitwise execution paths

  63. [64]

    GPU kernels can be immediately adopted by the most frameworks and toolkits, benefiting a wide range of models and inference pipelines

    Community choice. GPU kernels can be immediately adopted by the most frameworks and toolkits, benefiting a wide range of models and inference pipelines. We agree that FPGA/ASIC can further improve energy effi- ciency. However, our GPU-oriented kernels and layouts are not a 1:1 drop-in for FPGA/ASIC due to different instruction sets, memory hierarchies, an...

  64. [65]

    https://docs.nvidia.com/cuda/cublas PREPRINT SUBMITTED TO ARXIV 17 Quant #Bits Size (MB) MNLI -m/mm QQP QNLI SST-2 CoLA STS-B MRPC RTE Avg. Full Precision 32/32 418 84.9/85.5 91.4 88.6 93.2 59.7 89.8 86.2 72.2 83.9 Q2BERT 2/8 43.0 47.2/47.3 67.0 61.3 80.6 0 4.4 68.4 52.7 47.7 Q-BERT 2/8 43.0 76.6/77.0 – – 84.6 – – 68.3 52.7 – TernaryBERT 2/8 28.0 83.3/83....