arxiv: 2604.19167 · v1 · submitted 2026-04-21 · 💻 cs.LG · cs.AI

Recognition: unknown

LBLLM: Lightweight Binarization of Large Language Models via Three-Stage Distillation

Chuang Wang, Siqing Song, Xu-Yao Zhang, Yi Yang, Yong Lang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:59 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords binarizationquantizationlarge language modelsdistillationmodel compressionefficient inferencelow-bit models

0 comments

The pith

A three-stage distillation process enables effective binarization of large language models to W(1+1)A4 precision using only 0.016 billion tokens on a single GPU.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LBLLM, a framework that binarizes LLMs through three sequential stages: post-training quantization to initialize a high-quality model, layer-wise distillation to binarize weights while keeping activations in full precision, and training of learnable factors to quantize activations to 4 bits. This decoupled sequence reduces interference between weight and activation changes, producing more stable training and higher final accuracy than joint quantization. The method requires far less data and compute than prior approaches yet exceeds state-of-the-art binarization results on language modeling, commonsense question answering, and language understanding benchmarks. It achieves these gains without extra high-precision channels or rotational matrices, pointing to a practical route for running large models on limited hardware.

Core claim

LBLLM achieves superior W(1+1)A4 quantization performance by initializing via PTQ, then applying layer-wise distillation to binarize weights and quantization parameters with full-precision activations, and finally learning activation quantization factors; the separation of stages mitigates interference and delivers better stability and accuracy than existing binarization techniques while using only 0.016B tokens and a single GPU.

What carries the argument

The three-stage quantization strategy that first initializes with PTQ, then distills binarized weights layer-wise with full-precision activations, and finally trains learnable activation quantization factors.

If this is right

Extreme low-bit quantization of LLMs becomes feasible without auxiliary high-precision structures or matrices.
Training budgets for effective binarization drop to a fraction of those required by prior methods.
Inference on resource-constrained devices improves because the quantized model maintains higher task accuracy.
The same decoupled schedule can be applied to other bit-width combinations beyond W(1+1)A4.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The stability gained from decoupling may transfer to mixed-precision or non-uniform quantization schemes in other model families.
With such low data needs, the method could support repeated on-device adaptation of already-quantized models.
Similar staging might reduce the search space when jointly optimizing quantization and pruning or distillation objectives.

Load-bearing premise

Separating weight binarization from activation quantization into distinct stages is enough to prevent their interference and produce more stable training than joint optimization.

What would settle it

An experiment showing that a single joint-training run for both weight binarization and activation quantization, using the same 0.016B tokens and single GPU, reaches equal or higher accuracy and stability than the three-stage LBLLM pipeline.

Figures

Figures reproduced from arXiv: 2604.19167 by Chuang Wang, Siqing Song, Xu-Yao Zhang, Yi Yang, Yong Lang.

**Figure 2.** Figure 2: Illustration of the three-stage quantization strategy in LBLLM: Stage 1 uses a binarized PTQ method to [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Illustration of the hierarchical distillation [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Illustration of activation distribution. The [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: Layer-wise reconstruction error. Error Coupling between Weights and Activations We observe that in joint weight and activation quantization, quantization errors are not merely additive. Instead, they are continuously 11 [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

read the original abstract

Deploying large language models (LLMs) in resource-constrained environments is hindered by heavy computational and memory requirements. We present LBLLM, a lightweight binarization framework that achieves effective W(1+1)A4 quantization through a novel three-stage quantization strategy. The framework proceeds as follows: (1) initialize a high-quality quantized model via PTQ; (2) quantize binarized weights, group-wise bitmaps, and quantization parameters through layer-wise distillation while keeping activations in full precision; and (3) training learnable activation quantization factors to dynamically quantize activations to 4 bits. This decoupled design mitigates interference between weight and activation quantization, yielding greater training stability and better inference accuracy. LBLLM, trained only using 0.016B tokens with a single GPU, surpasses existing state-of-the-art binarization methods on W2A4 quantization settings across tasks of language modeling, commonsense QA, and language understanding. These results demonstrate that extreme low-bit quantization of LLMs can be both practical and highly effective without introducing any extra high-precision channels or rotational matrices commonly used in recent PTQ-based works, offering a promising path toward efficient LLM deployment in resource-limited situations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LBLLM lays out a three-stage binarization pipeline that claims big efficiency wins with tiny data, but the abstract supplies no numbers or ablations to test the interference-mitigation story.

read the letter

The paper's main contribution is a specific three-stage process for W(1+1)A4 quantization of LLMs. It begins with PTQ initialization, moves to layer-wise distillation that binarizes weights while activations remain full precision, and finishes by training learnable factors to quantize activations to 4 bits. The authors say this order avoids the interference that happens when weights and activations are quantized together, which lets them train on only 0.016B tokens using one GPU and beat prior binarization methods on language modeling, commonsense QA, and understanding tasks. They also avoid the extra high-precision channels or rotation matrices that some recent PTQ papers use. That low-data, low-compute angle is the part worth noticing if the results hold, because practical edge deployment cares about exactly those constraints. The description of the pipeline itself is straightforward and easy to follow. The soft spot is that none of the performance claims are backed by numbers, tables, or controls in the abstract. There is no ablation comparing the staged approach to a single joint optimization of binarized weights and 4-bit activations under the same data budget, no loss curves or stability metrics, and no error analysis. The stress-test concern lands: without those checks it is impossible to know whether the decoupling is what produces the gains or whether the initialization and distillation steps would work just as well on their own. The central premise about reduced interference therefore stays unverified. This is aimed at researchers and engineers working on low-power LLM inference for mobile or embedded hardware. A reader already deep in quantization would get value from the pipeline sketch, but only once the full paper shows the actual accuracies and the missing comparisons. I would send it for peer review so the experiments can be examined directly, but the current description alone does not yet make a strong case.

Referee Report

2 major / 0 minor

Summary. The paper introduces LBLLM, a three-stage distillation framework for binarizing LLMs to W(1+1)A4 quantization. Stage 1 uses PTQ initialization, stage 2 performs layer-wise distillation of binarized weights and parameters with full-precision activations, and stage 3 trains learnable activation quantization factors. The central claim is that this decoupled approach, trained on only 0.016B tokens with a single GPU, surpasses prior SOTA binarization methods on language modeling, commonsense QA, and language understanding tasks without extra high-precision channels or rotational matrices.

Significance. If the performance claims are substantiated, the work would be significant for enabling practical extreme quantization of LLMs under severe resource constraints, as the minimal data and compute requirements contrast with typical PTQ methods that rely on larger calibration sets or auxiliary structures. The emphasis on decoupling weight and activation quantization could inform future low-bit training strategies if the stability benefit is demonstrated.

major comments (2)

[Abstract] Abstract: the claim that LBLLM 'surpasses existing state-of-the-art binarization methods on W2A4 quantization settings' supplies no quantitative results, tables, baseline comparisons, or error analysis, so the central empirical claim cannot be evaluated from the manuscript as presented.
[Abstract] Abstract: the assertion that the three-stage design 'mitigates interference between weight and activation quantization, yielding greater training stability and better inference accuracy' than joint approaches is presented without any ablation study, joint-quantization baseline, loss-curve comparison, or stability metric, leaving the necessity of the decoupled pipeline unsupported.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below and will incorporate revisions to better substantiate the claims in the abstract.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that LBLLM 'surpasses existing state-of-the-art binarization methods on W2A4 quantization settings' supplies no quantitative results, tables, baseline comparisons, or error analysis, so the central empirical claim cannot be evaluated from the manuscript as presented.

Authors: We agree that the abstract would benefit from including key quantitative highlights to make the central claim immediately evaluable. The full manuscript contains comprehensive experimental results, including tables comparing LBLLM to prior binarization methods (such as BiLLM and others) on language modeling (perplexity), commonsense QA, and language understanding tasks, demonstrating consistent improvements with only 0.016B tokens and a single GPU. We will revise the abstract to incorporate specific metrics, for example noting average accuracy gains on QA tasks and perplexity reductions, while maintaining conciseness. This addresses the concern without misrepresenting the work. revision: yes
Referee: [Abstract] Abstract: the assertion that the three-stage design 'mitigates interference between weight and activation quantization, yielding greater training stability and better inference accuracy' than joint approaches is presented without any ablation study, joint-quantization baseline, loss-curve comparison, or stability metric, leaving the necessity of the decoupled pipeline unsupported.

Authors: The manuscript grounds this assertion in the design rationale and the empirical outcomes: the decoupled three-stage process enables effective binarization with minimal data and compute, outperforming prior joint quantization methods that often require larger calibration sets or auxiliary structures. However, we acknowledge that a direct ablation would provide stronger support for the stability and interference-mitigation benefits. In the revised version, we will add an ablation study comparing the three-stage approach against a joint weight-activation quantization baseline, including training loss curves and stability metrics to demonstrate the advantages. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method with no algebraic derivations or self-referential reductions

full rationale

The paper describes an empirical three-stage training procedure for binarization (PTQ initialization, layer-wise weight distillation with full-precision activations, then activation quantization). No equations, fitted parameters, or derivations are presented that reduce to the inputs by construction. Claims of improved stability and accuracy rest on reported benchmark results rather than tautological definitions or self-citation chains. The decoupling premise is a methodological hypothesis tested experimentally, not an algebraic identity. This is a standard non-circular empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No explicit mathematical axioms, free parameters, or invented entities are described in the abstract; the approach relies on standard post-training quantization and knowledge distillation techniques whose details are not supplied.

pith-pipeline@v0.9.0 · 5524 in / 988 out tokens · 54665 ms · 2026-05-10T02:59:26.089135+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

103 extracted references · 44 canonical work pages · 9 internal anchors

[1]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

1972
[2]

Publications Manual , year = "1983", publisher =

1983
[3]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981
[4]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of
[5]

Dan Gusfield , title =. 1997

1997
[6]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

2015
[7]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =
[8]

Langley , title =

P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

2000
[9]

T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

1980
[10]

M. J. Kearns , title =
[11]

Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

1983
[12]

R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

2000
[13]

Suppressed for Anonymity , author=
[14]

Newell and P

A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

1981
[15]

A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

1959
[16]

Proceedings of the 41st International Conference on Machine Learning , pages=

BiLLM: pushing the limit of post-training quantization for LLMs , author=. Proceedings of the 41st International Conference on Machine Learning , pages=
[17]

int8 (): 8-bit matrix multiplication for transformers at scale , author=

Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale , author=. Advances in Neural Information Processing Systems , volume=
[18]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

QBB: Quantization with Binary Bases for LLMs , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
[19]

arXiv preprint arXiv:2409.17066 , year=

Vptq: Extreme low-bit vector post-training quantization for large language models , author=. arXiv preprint arXiv:2409.17066 , year=

work page arXiv
[20]

arXiv preprint arXiv:2410.05265 , year=

Prefixquant: Static quantization beats dynamic through prefixed outliers in llms , author=. arXiv preprint arXiv:2410.05265 , year=

work page arXiv
[21]

Bitnet: Scaling 1-bit transformers for large language models,

Bitnet: Scaling 1-bit transformers for large language models , author=. arXiv preprint arXiv:2310.11453 , year=

work page arXiv
[22]

8: 4-bit Activations for 1-bit LLMs , author=

BitNet a4. 8: 4-bit Activations for 1-bit LLMs , author=. arXiv preprint arXiv:2411.04965 , year=

work page arXiv
[23]

Proceedings of the 38th International Conference on Neural Information Processing Systems , year=

STBLLM: Breaking the 1-Bit Barrier with Structured Binary LLMs , author=. Proceedings of the 38th International Conference on Neural Information Processing Systems , year=
[24]

Proceedings of the 38th International Conference on Neural Information Processing Systems , year=

SpinQuant--LLM quantization with learned rotations , author=. Proceedings of the 38th International Conference on Neural Information Processing Systems , year=
[25]

arXiv preprint arXiv:2408.08554 , year=

Abq-llm: Arbitrary-bit quantized inference acceleration for large language models , author=. arXiv preprint arXiv:2408.08554 , year=

work page arXiv
[26]

Proceedings of the 38th International Conference on Neural Information Processing Systems , pages=

QuaRot: outlier-free 4-bit inference in rotated LLMs , author=. Proceedings of the 38th International Conference on Neural Information Processing Systems , pages=
[27]

Rptq: Reorder-based post- training quantization for large language models.arXiv preprint arXiv:2304.01089,

Rptq: Reorder-based post-training quantization for large language models , author=. arXiv preprint arXiv:2304.01089 , year=

work page arXiv
[28]

Advances in Neural Information Processing Systems , volume=

Outlier suppression: Pushing the limit of low-bit transformer language models , author=. Advances in Neural Information Processing Systems , volume=
[29]

arXiv preprint arXiv:2304.09145 , year=

Outlier suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling , author=. arXiv preprint arXiv:2304.09145 , year=

work page arXiv
[30]

Proceedings of Machine Learning and Systems , volume=

Atom: Low-bit quantization for efficient and accurate llm serving , author=. Proceedings of Machine Learning and Systems , volume=
[31]

GPTVQ: The Blessing of Dimensionality for LLM Quantization,

Gptvq: The blessing of dimensionality for llm quantization , author=. arXiv preprint arXiv:2402.15319 , year=

work page arXiv
[32]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

Duquant: Distributing outliers via dual transformation makes stronger quantized llms , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
[33]

arXiv preprint arXiv:2310.16836 , year=

Llm-fp4: 4-bit floating-point quantized transformers , author=. arXiv preprint arXiv:2310.16836 , year=

work page arXiv
[34]

arXiv preprint arXiv:2407.07093 , year=

Fbi-llm: Scaling up fully binarized llms from scratch via autoregressive distillation , author=. arXiv preprint arXiv:2407.07093 , year=

work page arXiv
[35]

Advances in Neural Information Processing Systems , volume=

Qlora: Efficient finetuning of quantized llms , author=. Advances in Neural Information Processing Systems , volume=
[36]

The Eleventh International Conference on Learning Representations , year=

OPTQ: Accurate quantization for generative pre-trained transformers , author=. The Eleventh International Conference on Learning Representations , year=
[37]

I-llm: Efficient integer-only inference for fully- quantized low-bit large language models.arXiv preprint arXiv:2405.17849,

I-llm: Efficient integer-only inference for fully-quantized low-bit large language models , author=. arXiv preprint arXiv:2405.17849 , year=

work page arXiv
[38]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Norm tweaking: High-performance low-bit quantization of large language models , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
[39]

Extreme compression of large language models via additive quantization,

Extreme compression of large language models via additive quantization , author=. arXiv preprint arXiv:2401.06118 , year=

work page arXiv
[40]

arXiv preprint arXiv:2405.14917 , year=

SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models , author=. arXiv preprint arXiv:2405.14917 , year=

work page arXiv
[41]

Advances in Neural Information Processing Systems , volume=

Understanding neural network binarization with forward and backward proximal quantizers , author=. Advances in Neural Information Processing Systems , volume=
[42]

arXiv preprint arXiv:2405.16339 , year=

BOLD: Boolean Logic Deep Learning , author=. arXiv preprint arXiv:2405.16339 , year=

work page arXiv
[43]

and Bengio, Y

Binarized neural networks: Training neural networks with weights and activations constrained to+ 1 or- 1. arXiv , author=. arXiv preprint arXiv:1602.02830 , year=

work page arXiv
[44]

Proceedings of Machine Learning and Systems , pages=

AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration , author=. Proceedings of Machine Learning and Systems , pages=
[45]

Bibert: Accurate fully binarized bert,

Bibert: Accurate fully binarized bert , author=. arXiv preprint arXiv:2203.06390 , year=

work page arXiv
[46]

arXiv preprint arXiv:2402.11960 , year=

DB-LLM: Accurate dual-binarization for efficient LLMs , author=. arXiv preprint arXiv:2402.11960 , year=

work page arXiv
[47]

Pb-llm: Partially binarized large language models,

Pb-llm: Partially binarized large language models , author=. arXiv preprint arXiv:2310.00034 , year=

work page arXiv
[48]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Gptq: Accurate post-training quantization for generative pre-trained transformers , author=. arXiv preprint arXiv:2210.17323 , year=

work page internal anchor Pith review arXiv
[49]

Findings of the Association for Computational Linguistics: ACL 2024 , pages=

Llm-qat: Data-free quantization aware training for large language models , author=. Findings of the Association for Computational Linguistics: ACL 2024 , pages=

2024
[50]

Advances in Neural Information Processing Systems , volume=

Onebit: Towards extremely low-bit large language models , author=. Advances in Neural Information Processing Systems , volume=
[51]

Advances in Neural Information Processing Systems , year=

Attention is all you need , author=. Advances in Neural Information Processing Systems , year=
[52]

LLaMA: Open and Efficient Foundation Language Models

Llama: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[53]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[54]

See https://vicuna

Vicuna: An open-source chatbot impressing gpt-4 with 90\ author=. See https://vicuna. lmsys. org (accessed 14 April 2023) , volume=

2023
[55]

Proceedings of the 30th International Conference on Neural Information Processing Systems , year=

Pointer sentinel mixture models , author=. Proceedings of the 30th International Conference on Neural Information Processing Systems , year=
[56]

Human Language Technology: Proceedings of a Workshop held at Plainsboro, New Jersey, March 8-11, 1994 , year=

The penn treebank: Annotating predicate argument structure , author=. Human Language Technology: Proceedings of a Workshop held at Plainsboro, New Jersey, March 8-11, 1994 , year=

1994
[57]

Journal of machine learning research , volume=

Exploring the limits of transfer learning with a unified text-to-text transformer , author=. Journal of machine learning research , volume=
[58]

Proceedings of the AAAI conference on artificial intelligence , volume=

Piqa: Reasoning about physical commonsense in natural language , author=. Proceedings of the AAAI conference on artificial intelligence , volume=
[59]

Proceedings of NAACL-HLT , pages=

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions , author=. Proceedings of NAACL-HLT , pages=
[60]

Communications of the ACM , volume=

Winogrande: An adversarial winograd schema challenge at scale , author=. Communications of the ACM , volume=. 2021 , publisher=

2021
[61]

HellaSwag: Can a Machine Really Finish Your Sentence?

Hellaswag: Can a machine really finish your sentence? , author=. arXiv preprint arXiv:1905.07830 , year=

work page internal anchor Pith review arXiv 1905
[62]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Think you have solved question answering? try arc, the ai2 reasoning challenge , author=. arXiv preprint arXiv:1803.05457 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[63]

International Conference on Machine Learning , pages=

Smoothquant: Accurate and efficient post-training quantization for large language models , author=. International Conference on Machine Learning , pages=. 2023 , organization=

2023
[64]

Advances in neural information processing systems , volume=

Pytorch: An imperative style, high-performance deep learning library , author=. Advances in neural information processing systems , volume=
[65]

arXiv e-prints , pages=

How good are low-bit quantized llama3 models? an empirical study , author=. arXiv e-prints , pages=
[66]

Spqr: A sparse-quantized rep- resentation for near-lossless llm weight compression.arXiv preprint arXiv:2306.03078, 2023

Spqr: A sparse-quantized representation for near-lossless llm weight compression , author=. arXiv preprint arXiv:2306.03078 , year=

work page arXiv
[67]

A white paper on neural network quantization.arXiv preprint arXiv:2106.08295,

A white paper on neural network quantization , author=. arXiv preprint arXiv:2106.08295 , year=

work page arXiv
[68]

Second order derivatives for network pruning: Optimal Brain Surgeon , url =

Hassibi, Babak and Stork, David , booktitle =. Second order derivatives for network pruning: Optimal Brain Surgeon , url =
[69]

Advances in Neural Information Processing Systems , volume=

Optimal brain compression: A framework for accurate post-training quantization and pruning , author=. Advances in Neural Information Processing Systems , volume=
[70]

arXiv preprint arXiv:2409.01179 , year=

Recoverable compression: A multimodal vision token recovery mechanism guided by text information , author=. arXiv preprint arXiv:2409.01179 , year=

work page arXiv
[71]

arXiv preprint arXiv:2409.01162 , year=

Balancing performance and efficiency: A multimodal large language model pruning method based image text interaction , author=. arXiv preprint arXiv:2409.01162 , year=

work page arXiv
[72]

Measuring Massive Multitask Language Understanding

Measuring massive multitask language understanding , author=. arXiv preprint arXiv:2009.03300 , year=

work page internal anchor Pith review arXiv 2009
[73]

Proceedings of the 38th International Conference on Neural Information Processing Systems , year=

Cbq: Cross-block quantization for large language models , author=. Proceedings of the 38th International Conference on Neural Information Processing Systems , year=
[74]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Bitdistiller: Unleashing the potential of sub-4-bit llms via self-distillation , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[75]

E fficient QAT : Efficient Quantization-Aware Training for Large Language Models

Chen, Mengzhao and Shao, Wenqi and Xu, Peng and Wang, Jiahao and Gao, Peng and Zhang, Kaipeng and Luo, Ping. E fficient QAT : Efficient Quantization-Aware Training for Large Language Models. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics. 2025

2025
[76]

Proceedings of the 30th International Conference on Neural Information Processing Systems , year=

Omniquant: Omnidirectionally calibrated quantization for large language models , author=. Proceedings of the 30th International Conference on Neural Information Processing Systems , year=
[77]

Findings of the Association for Computational Linguistics: ACL 2025 , pages=

Achieving binary weight and activation for LLMs using Post-Training Quantization , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

2025
[78]

RedPajama: an Open Dataset for Training Large Language Models , author =
[79]

2016 , eprint=

XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks , author=. 2016 , eprint=

2016
[80]

2023 , eprint=

BiViT: Extremely Compressed Binary Vision Transformer , author=. 2023 , eprint=

2023

Showing first 80 references.