pith. machine review for the scientific record. sign in

arxiv: 2604.19167 · v1 · submitted 2026-04-21 · 💻 cs.LG · cs.AI

Recognition: unknown

LBLLM: Lightweight Binarization of Large Language Models via Three-Stage Distillation

Chuang Wang, Siqing Song, Xu-Yao Zhang, Yi Yang, Yong Lang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:59 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords binarizationquantizationlarge language modelsdistillationmodel compressionefficient inferencelow-bit models
0
0 comments X

The pith

A three-stage distillation process enables effective binarization of large language models to W(1+1)A4 precision using only 0.016 billion tokens on a single GPU.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LBLLM, a framework that binarizes LLMs through three sequential stages: post-training quantization to initialize a high-quality model, layer-wise distillation to binarize weights while keeping activations in full precision, and training of learnable factors to quantize activations to 4 bits. This decoupled sequence reduces interference between weight and activation changes, producing more stable training and higher final accuracy than joint quantization. The method requires far less data and compute than prior approaches yet exceeds state-of-the-art binarization results on language modeling, commonsense question answering, and language understanding benchmarks. It achieves these gains without extra high-precision channels or rotational matrices, pointing to a practical route for running large models on limited hardware.

Core claim

LBLLM achieves superior W(1+1)A4 quantization performance by initializing via PTQ, then applying layer-wise distillation to binarize weights and quantization parameters with full-precision activations, and finally learning activation quantization factors; the separation of stages mitigates interference and delivers better stability and accuracy than existing binarization techniques while using only 0.016B tokens and a single GPU.

What carries the argument

The three-stage quantization strategy that first initializes with PTQ, then distills binarized weights layer-wise with full-precision activations, and finally trains learnable activation quantization factors.

If this is right

  • Extreme low-bit quantization of LLMs becomes feasible without auxiliary high-precision structures or matrices.
  • Training budgets for effective binarization drop to a fraction of those required by prior methods.
  • Inference on resource-constrained devices improves because the quantized model maintains higher task accuracy.
  • The same decoupled schedule can be applied to other bit-width combinations beyond W(1+1)A4.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The stability gained from decoupling may transfer to mixed-precision or non-uniform quantization schemes in other model families.
  • With such low data needs, the method could support repeated on-device adaptation of already-quantized models.
  • Similar staging might reduce the search space when jointly optimizing quantization and pruning or distillation objectives.

Load-bearing premise

Separating weight binarization from activation quantization into distinct stages is enough to prevent their interference and produce more stable training than joint optimization.

What would settle it

An experiment showing that a single joint-training run for both weight binarization and activation quantization, using the same 0.016B tokens and single GPU, reaches equal or higher accuracy and stability than the three-stage LBLLM pipeline.

Figures

Figures reproduced from arXiv: 2604.19167 by Chuang Wang, Siqing Song, Xu-Yao Zhang, Yi Yang, Yong Lang.

Figure 1
Figure 1. Figure 1: Left: Comparison of post-quantization perplexity between LBLLM and other methods across different [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of the three-stage quantization strategy in LBLLM: Stage 1 uses a binarized PTQ method to [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of the hierarchical distillation [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Illustration of activation distribution. The [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Layer-wise reconstruction error. Error Coupling between Weights and Activa￾tions We observe that in joint weight and ac￾tivation quantization, quantization errors are not merely additive. Instead, they are continuously 11 [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
read the original abstract

Deploying large language models (LLMs) in resource-constrained environments is hindered by heavy computational and memory requirements. We present LBLLM, a lightweight binarization framework that achieves effective W(1+1)A4 quantization through a novel three-stage quantization strategy. The framework proceeds as follows: (1) initialize a high-quality quantized model via PTQ; (2) quantize binarized weights, group-wise bitmaps, and quantization parameters through layer-wise distillation while keeping activations in full precision; and (3) training learnable activation quantization factors to dynamically quantize activations to 4 bits. This decoupled design mitigates interference between weight and activation quantization, yielding greater training stability and better inference accuracy. LBLLM, trained only using 0.016B tokens with a single GPU, surpasses existing state-of-the-art binarization methods on W2A4 quantization settings across tasks of language modeling, commonsense QA, and language understanding. These results demonstrate that extreme low-bit quantization of LLMs can be both practical and highly effective without introducing any extra high-precision channels or rotational matrices commonly used in recent PTQ-based works, offering a promising path toward efficient LLM deployment in resource-limited situations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces LBLLM, a three-stage distillation framework for binarizing LLMs to W(1+1)A4 quantization. Stage 1 uses PTQ initialization, stage 2 performs layer-wise distillation of binarized weights and parameters with full-precision activations, and stage 3 trains learnable activation quantization factors. The central claim is that this decoupled approach, trained on only 0.016B tokens with a single GPU, surpasses prior SOTA binarization methods on language modeling, commonsense QA, and language understanding tasks without extra high-precision channels or rotational matrices.

Significance. If the performance claims are substantiated, the work would be significant for enabling practical extreme quantization of LLMs under severe resource constraints, as the minimal data and compute requirements contrast with typical PTQ methods that rely on larger calibration sets or auxiliary structures. The emphasis on decoupling weight and activation quantization could inform future low-bit training strategies if the stability benefit is demonstrated.

major comments (2)
  1. [Abstract] Abstract: the claim that LBLLM 'surpasses existing state-of-the-art binarization methods on W2A4 quantization settings' supplies no quantitative results, tables, baseline comparisons, or error analysis, so the central empirical claim cannot be evaluated from the manuscript as presented.
  2. [Abstract] Abstract: the assertion that the three-stage design 'mitigates interference between weight and activation quantization, yielding greater training stability and better inference accuracy' than joint approaches is presented without any ablation study, joint-quantization baseline, loss-curve comparison, or stability metric, leaving the necessity of the decoupled pipeline unsupported.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below and will incorporate revisions to better substantiate the claims in the abstract.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that LBLLM 'surpasses existing state-of-the-art binarization methods on W2A4 quantization settings' supplies no quantitative results, tables, baseline comparisons, or error analysis, so the central empirical claim cannot be evaluated from the manuscript as presented.

    Authors: We agree that the abstract would benefit from including key quantitative highlights to make the central claim immediately evaluable. The full manuscript contains comprehensive experimental results, including tables comparing LBLLM to prior binarization methods (such as BiLLM and others) on language modeling (perplexity), commonsense QA, and language understanding tasks, demonstrating consistent improvements with only 0.016B tokens and a single GPU. We will revise the abstract to incorporate specific metrics, for example noting average accuracy gains on QA tasks and perplexity reductions, while maintaining conciseness. This addresses the concern without misrepresenting the work. revision: yes

  2. Referee: [Abstract] Abstract: the assertion that the three-stage design 'mitigates interference between weight and activation quantization, yielding greater training stability and better inference accuracy' than joint approaches is presented without any ablation study, joint-quantization baseline, loss-curve comparison, or stability metric, leaving the necessity of the decoupled pipeline unsupported.

    Authors: The manuscript grounds this assertion in the design rationale and the empirical outcomes: the decoupled three-stage process enables effective binarization with minimal data and compute, outperforming prior joint quantization methods that often require larger calibration sets or auxiliary structures. However, we acknowledge that a direct ablation would provide stronger support for the stability and interference-mitigation benefits. In the revised version, we will add an ablation study comparing the three-stage approach against a joint weight-activation quantization baseline, including training loss curves and stability metrics to demonstrate the advantages. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method with no algebraic derivations or self-referential reductions

full rationale

The paper describes an empirical three-stage training procedure for binarization (PTQ initialization, layer-wise weight distillation with full-precision activations, then activation quantization). No equations, fitted parameters, or derivations are presented that reduce to the inputs by construction. Claims of improved stability and accuracy rest on reported benchmark results rather than tautological definitions or self-citation chains. The decoupling premise is a methodological hypothesis tested experimentally, not an algebraic identity. This is a standard non-circular empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No explicit mathematical axioms, free parameters, or invented entities are described in the abstract; the approach relies on standard post-training quantization and knowledge distillation techniques whose details are not supplied.

pith-pipeline@v0.9.0 · 5524 in / 988 out tokens · 54665 ms · 2026-05-10T02:59:26.089135+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

103 extracted references · 44 canonical work pages · 9 internal anchors

  1. [1]

    Aho and Jeffrey D

    Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

  2. [2]

    Publications Manual , year = "1983", publisher =

  3. [3]

    Chandra and Dexter C

    Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

  4. [4]

    Scalable training of

    Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

  5. [5]

    Dan Gusfield , title =. 1997

  6. [6]

    Tetreault , title =

    Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

  7. [7]

    A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

    Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

  8. [8]

    Langley , title =

    P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

  9. [9]

    T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

  10. [10]

    M. J. Kearns , title =

  11. [11]

    Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

  12. [12]

    R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

  13. [13]

    Suppressed for Anonymity , author=

  14. [14]

    Newell and P

    A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

  15. [15]

    A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

  16. [16]

    Proceedings of the 41st International Conference on Machine Learning , pages=

    BiLLM: pushing the limit of post-training quantization for LLMs , author=. Proceedings of the 41st International Conference on Machine Learning , pages=

  17. [17]

    int8 (): 8-bit matrix multiplication for transformers at scale , author=

    Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale , author=. Advances in Neural Information Processing Systems , volume=

  18. [18]

    The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

    QBB: Quantization with Binary Bases for LLMs , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

  19. [19]

    arXiv preprint arXiv:2409.17066 , year=

    Vptq: Extreme low-bit vector post-training quantization for large language models , author=. arXiv preprint arXiv:2409.17066 , year=

  20. [20]

    arXiv preprint arXiv:2410.05265 , year=

    Prefixquant: Static quantization beats dynamic through prefixed outliers in llms , author=. arXiv preprint arXiv:2410.05265 , year=

  21. [21]

    Bitnet: Scaling 1-bit transformers for large language models,

    Bitnet: Scaling 1-bit transformers for large language models , author=. arXiv preprint arXiv:2310.11453 , year=

  22. [22]

    8: 4-bit Activations for 1-bit LLMs , author=

    BitNet a4. 8: 4-bit Activations for 1-bit LLMs , author=. arXiv preprint arXiv:2411.04965 , year=

  23. [23]

    Proceedings of the 38th International Conference on Neural Information Processing Systems , year=

    STBLLM: Breaking the 1-Bit Barrier with Structured Binary LLMs , author=. Proceedings of the 38th International Conference on Neural Information Processing Systems , year=

  24. [24]

    Proceedings of the 38th International Conference on Neural Information Processing Systems , year=

    SpinQuant--LLM quantization with learned rotations , author=. Proceedings of the 38th International Conference on Neural Information Processing Systems , year=

  25. [25]

    arXiv preprint arXiv:2408.08554 , year=

    Abq-llm: Arbitrary-bit quantized inference acceleration for large language models , author=. arXiv preprint arXiv:2408.08554 , year=

  26. [26]

    Proceedings of the 38th International Conference on Neural Information Processing Systems , pages=

    QuaRot: outlier-free 4-bit inference in rotated LLMs , author=. Proceedings of the 38th International Conference on Neural Information Processing Systems , pages=

  27. [27]

    Rptq: Reorder-based post- training quantization for large language models.arXiv preprint arXiv:2304.01089,

    Rptq: Reorder-based post-training quantization for large language models , author=. arXiv preprint arXiv:2304.01089 , year=

  28. [28]

    Advances in Neural Information Processing Systems , volume=

    Outlier suppression: Pushing the limit of low-bit transformer language models , author=. Advances in Neural Information Processing Systems , volume=

  29. [29]

    arXiv preprint arXiv:2304.09145 , year=

    Outlier suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling , author=. arXiv preprint arXiv:2304.09145 , year=

  30. [30]

    Proceedings of Machine Learning and Systems , volume=

    Atom: Low-bit quantization for efficient and accurate llm serving , author=. Proceedings of Machine Learning and Systems , volume=

  31. [31]

    GPTVQ: The Blessing of Dimensionality for LLM Quantization,

    Gptvq: The blessing of dimensionality for llm quantization , author=. arXiv preprint arXiv:2402.15319 , year=

  32. [32]

    The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

    Duquant: Distributing outliers via dual transformation makes stronger quantized llms , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

  33. [33]

    arXiv preprint arXiv:2310.16836 , year=

    Llm-fp4: 4-bit floating-point quantized transformers , author=. arXiv preprint arXiv:2310.16836 , year=

  34. [34]

    arXiv preprint arXiv:2407.07093 , year=

    Fbi-llm: Scaling up fully binarized llms from scratch via autoregressive distillation , author=. arXiv preprint arXiv:2407.07093 , year=

  35. [35]

    Advances in Neural Information Processing Systems , volume=

    Qlora: Efficient finetuning of quantized llms , author=. Advances in Neural Information Processing Systems , volume=

  36. [36]

    The Eleventh International Conference on Learning Representations , year=

    OPTQ: Accurate quantization for generative pre-trained transformers , author=. The Eleventh International Conference on Learning Representations , year=

  37. [37]

    I-llm: Efficient integer-only inference for fully- quantized low-bit large language models.arXiv preprint arXiv:2405.17849,

    I-llm: Efficient integer-only inference for fully-quantized low-bit large language models , author=. arXiv preprint arXiv:2405.17849 , year=

  38. [38]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Norm tweaking: High-performance low-bit quantization of large language models , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  39. [39]

    Extreme compression of large language models via additive quantization,

    Extreme compression of large language models via additive quantization , author=. arXiv preprint arXiv:2401.06118 , year=

  40. [40]

    arXiv preprint arXiv:2405.14917 , year=

    SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models , author=. arXiv preprint arXiv:2405.14917 , year=

  41. [41]

    Advances in Neural Information Processing Systems , volume=

    Understanding neural network binarization with forward and backward proximal quantizers , author=. Advances in Neural Information Processing Systems , volume=

  42. [42]

    arXiv preprint arXiv:2405.16339 , year=

    BOLD: Boolean Logic Deep Learning , author=. arXiv preprint arXiv:2405.16339 , year=

  43. [43]

    and Bengio, Y

    Binarized neural networks: Training neural networks with weights and activations constrained to+ 1 or- 1. arXiv , author=. arXiv preprint arXiv:1602.02830 , year=

  44. [44]

    Proceedings of Machine Learning and Systems , pages=

    AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration , author=. Proceedings of Machine Learning and Systems , pages=

  45. [45]

    Bibert: Accurate fully binarized bert,

    Bibert: Accurate fully binarized bert , author=. arXiv preprint arXiv:2203.06390 , year=

  46. [46]

    arXiv preprint arXiv:2402.11960 , year=

    DB-LLM: Accurate dual-binarization for efficient LLMs , author=. arXiv preprint arXiv:2402.11960 , year=

  47. [47]

    Pb-llm: Partially binarized large language models,

    Pb-llm: Partially binarized large language models , author=. arXiv preprint arXiv:2310.00034 , year=

  48. [48]

    GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

    Gptq: Accurate post-training quantization for generative pre-trained transformers , author=. arXiv preprint arXiv:2210.17323 , year=

  49. [49]

    Findings of the Association for Computational Linguistics: ACL 2024 , pages=

    Llm-qat: Data-free quantization aware training for large language models , author=. Findings of the Association for Computational Linguistics: ACL 2024 , pages=

  50. [50]

    Advances in Neural Information Processing Systems , volume=

    Onebit: Towards extremely low-bit large language models , author=. Advances in Neural Information Processing Systems , volume=

  51. [51]

    Advances in Neural Information Processing Systems , year=

    Attention is all you need , author=. Advances in Neural Information Processing Systems , year=

  52. [52]

    LLaMA: Open and Efficient Foundation Language Models

    Llama: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=

  53. [53]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=

  54. [54]

    See https://vicuna

    Vicuna: An open-source chatbot impressing gpt-4 with 90\ author=. See https://vicuna. lmsys. org (accessed 14 April 2023) , volume=

  55. [55]

    Proceedings of the 30th International Conference on Neural Information Processing Systems , year=

    Pointer sentinel mixture models , author=. Proceedings of the 30th International Conference on Neural Information Processing Systems , year=

  56. [56]

    Human Language Technology: Proceedings of a Workshop held at Plainsboro, New Jersey, March 8-11, 1994 , year=

    The penn treebank: Annotating predicate argument structure , author=. Human Language Technology: Proceedings of a Workshop held at Plainsboro, New Jersey, March 8-11, 1994 , year=

  57. [57]

    Journal of machine learning research , volume=

    Exploring the limits of transfer learning with a unified text-to-text transformer , author=. Journal of machine learning research , volume=

  58. [58]

    Proceedings of the AAAI conference on artificial intelligence , volume=

    Piqa: Reasoning about physical commonsense in natural language , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

  59. [59]

    Proceedings of NAACL-HLT , pages=

    BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions , author=. Proceedings of NAACL-HLT , pages=

  60. [60]

    Communications of the ACM , volume=

    Winogrande: An adversarial winograd schema challenge at scale , author=. Communications of the ACM , volume=. 2021 , publisher=

  61. [61]

    HellaSwag: Can a Machine Really Finish Your Sentence?

    Hellaswag: Can a machine really finish your sentence? , author=. arXiv preprint arXiv:1905.07830 , year=

  62. [62]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Think you have solved question answering? try arc, the ai2 reasoning challenge , author=. arXiv preprint arXiv:1803.05457 , year=

  63. [63]

    International Conference on Machine Learning , pages=

    Smoothquant: Accurate and efficient post-training quantization for large language models , author=. International Conference on Machine Learning , pages=. 2023 , organization=

  64. [64]

    Advances in neural information processing systems , volume=

    Pytorch: An imperative style, high-performance deep learning library , author=. Advances in neural information processing systems , volume=

  65. [65]

    arXiv e-prints , pages=

    How good are low-bit quantized llama3 models? an empirical study , author=. arXiv e-prints , pages=

  66. [66]

    Spqr: A sparse-quantized rep- resentation for near-lossless llm weight compression.arXiv preprint arXiv:2306.03078, 2023

    Spqr: A sparse-quantized representation for near-lossless llm weight compression , author=. arXiv preprint arXiv:2306.03078 , year=

  67. [67]

    A white paper on neural network quantization.arXiv preprint arXiv:2106.08295,

    A white paper on neural network quantization , author=. arXiv preprint arXiv:2106.08295 , year=

  68. [68]

    Second order derivatives for network pruning: Optimal Brain Surgeon , url =

    Hassibi, Babak and Stork, David , booktitle =. Second order derivatives for network pruning: Optimal Brain Surgeon , url =

  69. [69]

    Advances in Neural Information Processing Systems , volume=

    Optimal brain compression: A framework for accurate post-training quantization and pruning , author=. Advances in Neural Information Processing Systems , volume=

  70. [70]

    arXiv preprint arXiv:2409.01179 , year=

    Recoverable compression: A multimodal vision token recovery mechanism guided by text information , author=. arXiv preprint arXiv:2409.01179 , year=

  71. [71]

    arXiv preprint arXiv:2409.01162 , year=

    Balancing performance and efficiency: A multimodal large language model pruning method based image text interaction , author=. arXiv preprint arXiv:2409.01162 , year=

  72. [72]

    Measuring Massive Multitask Language Understanding

    Measuring massive multitask language understanding , author=. arXiv preprint arXiv:2009.03300 , year=

  73. [73]

    Proceedings of the 38th International Conference on Neural Information Processing Systems , year=

    Cbq: Cross-block quantization for large language models , author=. Proceedings of the 38th International Conference on Neural Information Processing Systems , year=

  74. [74]

    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Bitdistiller: Unleashing the potential of sub-4-bit llms via self-distillation , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  75. [75]

    E fficient QAT : Efficient Quantization-Aware Training for Large Language Models

    Chen, Mengzhao and Shao, Wenqi and Xu, Peng and Wang, Jiahao and Gao, Peng and Zhang, Kaipeng and Luo, Ping. E fficient QAT : Efficient Quantization-Aware Training for Large Language Models. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics. 2025

  76. [76]

    Proceedings of the 30th International Conference on Neural Information Processing Systems , year=

    Omniquant: Omnidirectionally calibrated quantization for large language models , author=. Proceedings of the 30th International Conference on Neural Information Processing Systems , year=

  77. [77]

    Findings of the Association for Computational Linguistics: ACL 2025 , pages=

    Achieving binary weight and activation for LLMs using Post-Training Quantization , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

  78. [78]

    RedPajama: an Open Dataset for Training Large Language Models , author =

  79. [79]

    2016 , eprint=

    XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks , author=. 2016 , eprint=

  80. [80]

    2023 , eprint=

    BiViT: Extremely Compressed Binary Vision Transformer , author=. 2023 , eprint=

Showing first 80 references.