pith. sign in

arxiv: 2508.06974 · v2 · pith:42K3IEQFnew · submitted 2025-08-09 · 💻 cs.CL

Rethinking 1-bit Optimization Leveraging Pre-trained Large Language Models

Pith reviewed 2026-05-21 23:40 UTC · model grok-4.3

classification 💻 cs.CL
keywords 1-bit quantizationlarge language modelsprogressive trainingmodel compressionbinary weightspre-trained modelsquantization-aware trainingefficient inference
0
0 comments X

The pith

Pre-trained large language models can be adapted into high-performance 1-bit versions using progressive training instead of training from scratch.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that effective 1-bit quantized LLMs can be obtained by adapting existing pre-trained full-precision models rather than building binarized models from scratch. It addresses the accuracy loss and high costs of prior methods by using consistent progressive training to gradually close the gap between full-precision and 1-bit weights during both forward and backward passes. Supporting steps include binary-aware initialization and dual-scaling compensation to ease the conversion. If the approach holds, it would let practitioners turn current pre-trained models into storage-efficient and compute-light versions without repeating expensive initial training. Results across model sizes indicate better performance than existing 1-bit techniques.

Core claim

We identify that the large gap between full precision and 1-bit representations makes naive adaptation difficult. In this paper, we introduce a consistent progressive training for both forward and backward, smoothly converting the full-precision weights into the binarized ones. Additionally, we incorporate binary-aware initialization and dual-scaling compensation to reduce the difficulty of progressive training and improve the performance. Experimental results on LLMs of various sizes demonstrate that our method outperforms existing approaches. Our results show that high-performance 1-bit LLMs can be achieved using pre-trained models, eliminating the need for expensive training from scratch.

What carries the argument

Consistent progressive training applied to both forward and backward passes, together with binary-aware initialization and dual-scaling compensation, that gradually converts full-precision weights to binary form.

If this is right

  • High-performance 1-bit LLMs become reachable directly from pre-trained checkpoints.
  • Training costs drop because full from-scratch binarized training is no longer required.
  • Accuracy degradation typical in 1-bit quantization is reduced across models of different sizes.
  • Storage and compute savings are realized while retaining competitive task performance.
  • Existing pre-trained assets can be reused for efficient quantized deployment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same gradual-adaptation idea could be tested on other low-bit or mixed-precision schemes beyond strict 1-bit.
  • Practitioners might combine this conversion step with existing fine-tuning pipelines to produce task-specific 1-bit models more quickly.
  • If the progressive schedule proves robust, it may encourage re-examination of other large representation-gap problems in model compression.
  • Deployment on edge hardware could become more routine once pre-trained models are routinely turned into 1-bit versions.

Load-bearing premise

The large gap between full-precision and 1-bit representations can be bridged smoothly by consistent progressive training in forward and backward passes without irrecoverable accuracy loss.

What would settle it

If side-by-side tests on standard LLM benchmarks show that the resulting 1-bit models still lose substantial accuracy relative to full-precision versions or to from-scratch 1-bit baselines, the claim of successful smooth conversion would not hold.

Figures

Figures reproduced from arXiv: 2508.06974 by Chuanjian Liu, Hanting Chen, Jian Li, Jie Hu, Siqi Liu, Yuanyuan Xi, Yunhe Wang, Zhijun Tu.

Figure 1
Figure 1. Figure 1: An overview of the BinaryLLM framework. We apply 1-bit quantization to all the linear [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Quantization error and initial loss. where L represents the loss function. In a linear layer, the floating-point multiplications are replaced by efficient bit-wise operations, and memory storage can be reduced by up to 16× compared to FP16 precision. This is particularly beneficial for reducing the inference cost of LLMs. Training large language models (LLMs) from scratch is highly resource-intensive, requ… view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of different progressive training in binary neural networks. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Latency with different threads. initialization using the LLaMA-1B model. As shown in [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of initial perplexity and loss. [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Different progressive scheduler on t. The top and bottom of each column represents the function of t and the corresponding progressive approximation curves. We also list the perplexity and zero-shot accuracy of final 1-bit models. [2] Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward … view at source ↗
Figure 7
Figure 7. Figure 7: Weights distribution of different training chunks. [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
read the original abstract

1-bit LLM quantization offers significant advantages in reducing storage and computational costs. However, existing methods typically train 1-bit LLMs from scratch, failing to fully leverage pre-trained models. This results in high training costs and notable accuracy degradation. We identify that the large gap between full precision and 1-bit representations makes naive adaptation difficult. In this paper, we introduce a consistent progressive training for both forward and backward, smoothly converting the full-precision weights into the binarized ones. Additionally, we incorporate binary-aware initialization and dual-scaling compensation to reduce the difficulty of progressive training and improve the performance. Experimental results on LLMs of various sizes demonstrate that our method outperforms existing approaches. Our results show that high-performance 1-bit LLMs can be achieved using pre-trained models, eliminating the need for expensive training from scratch.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that existing 1-bit LLM quantization methods require training from scratch and suffer from accuracy degradation due to the large gap between full-precision and binarized representations. It proposes consistent progressive training applied to both forward and backward passes, combined with binary-aware initialization and dual-scaling compensation, to smoothly convert pre-trained full-precision weights into 1-bit models. Experiments on LLMs of various sizes are reported to show outperformance over prior approaches, supporting the conclusion that high-performance 1-bit LLMs can be obtained from pre-trained checkpoints without expensive from-scratch training.

Significance. If the empirical results hold with proper controls, the work would be significant for reducing the training cost of 1-bit LLMs by reusing pre-trained models rather than starting from random initialization. This could lower barriers to deploying efficient quantized models. The approach is presented as an empirical training procedure rather than a parameter-free derivation or machine-checked proof.

major comments (2)
  1. [Abstract and method description] The central claim that consistent progressive training (forward and backward) plus binary-aware init and dual-scaling bridges the representation gap without irrecoverable loss is load-bearing, yet the manuscript provides no ablation studies or intermediate training dynamics that isolate the progressive schedule's contribution. Without these, performance gains could be attributable to the pre-trained starting point alone rather than the proposed techniques.
  2. [Abstract and experimental results] The abstract states outperformance on LLMs of various sizes, but the provided text contains no error bars, statistical significance tests, or full experimental details (e.g., training hyperparameters, dataset splits, or exact baselines). This undermines verification of the claim that the method eliminates the need for training from scratch.
minor comments (1)
  1. [Method] Notation for the dual-scaling compensation and binary-aware initialization should be defined more explicitly with equations to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback on our manuscript. We have carefully reviewed the major comments and provide detailed point-by-point responses below. We agree that certain aspects of the presentation can be strengthened and outline the revisions we will make to address them.

read point-by-point responses
  1. Referee: [Abstract and method description] The central claim that consistent progressive training (forward and backward) plus binary-aware init and dual-scaling bridges the representation gap without irrecoverable loss is load-bearing, yet the manuscript provides no ablation studies or intermediate training dynamics that isolate the progressive schedule's contribution. Without these, performance gains could be attributable to the pre-trained starting point alone rather than the proposed techniques.

    Authors: We appreciate the referee's emphasis on the need for ablations to rigorously isolate the contribution of the consistent progressive training schedule. The manuscript presents the progressive training applied to both forward and backward passes, together with binary-aware initialization and dual-scaling compensation, as the key mechanisms for smoothly bridging the full-precision to 1-bit representation gap. While the current version focuses on the overall empirical outcomes, we acknowledge that dedicated ablation studies and intermediate training dynamics (such as loss or accuracy curves across training steps) are not explicitly included. In the revised manuscript, we will add these elements, including comparisons of the full proposed method against ablated variants that omit the progressive schedule, as well as plots illustrating training dynamics. This will help demonstrate that the performance improvements stem from the proposed techniques rather than the pre-trained initialization alone. revision: yes

  2. Referee: [Abstract and experimental results] The abstract states outperformance on LLMs of various sizes, but the provided text contains no error bars, statistical significance tests, or full experimental details (e.g., training hyperparameters, dataset splits, or exact baselines). This undermines verification of the claim that the method eliminates the need for training from scratch.

    Authors: We thank the referee for highlighting the importance of detailed experimental reporting to support the claims. The manuscript reports that our method outperforms existing approaches on LLMs of various sizes and concludes that high-performance 1-bit models can be obtained from pre-trained checkpoints without from-scratch training. However, we agree that the current text does not include error bars, statistical significance tests, or exhaustive details on training hyperparameters, dataset splits, and exact baseline configurations. In the revision, we will expand the experimental section to incorporate these: reporting means and standard deviations over multiple runs, including statistical significance tests where appropriate, and providing complete specifications of hyperparameters, dataset splits, and baseline implementations. These additions will improve the reproducibility and strengthen the evidence for our central claim. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical training procedure validated on external benchmarks

full rationale

The paper presents an empirical method consisting of consistent progressive training in forward and backward passes, binary-aware initialization, and dual-scaling compensation to adapt pre-trained full-precision LLMs to 1-bit representations. No equations, derivations, or parameter-fitting steps are described that reduce by construction to the method's own inputs or outputs. Performance claims rest on experimental results across LLMs of various sizes, which constitute external validation rather than internal tautologies or self-referential definitions. The derivation chain is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that gradual conversion can overcome the representation gap; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)
  • domain assumption The large gap between full-precision and 1-bit representations makes naive adaptation difficult and can be addressed by consistent progressive training.
    Stated directly in the abstract as the identified core difficulty.

pith-pipeline@v0.9.0 · 5682 in / 1278 out tokens · 32625 ms · 2026-05-21T23:40:35.185475+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 14 internal anchors

  1. [1]

    Smollm - blazingly fast and remarkably powerful, 2024

    Loubna Ben Allal, Anton Lozhkov, Elie Bakouch, Leandro von Werra, and Thomas Wolf. Smollm - blazingly fast and remarkably powerful, 2024. 12 1 3 5 7 9 11 13 15 chunk 0 5 10 15 20 25 30 35The value of t Uniform Progressive 1 0 1 1.5 1.0 0.5 0.0 0.5 1.0 1.5 PPL=36.9, Zero-shot Acc.=40.5 1 3 5 7 9 11 13 15 chunk 0 5 10 15 20 25 30 35The value of t Logarithm ...

  2. [2]

    Pythia: A suite for analyzing large language models across training and scaling

    Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR, 2023

  3. [3]

    Piqa: Reasoning about physical common- sense in natural language

    Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about physical common- sense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020

  4. [4]

    Db-llm: Accurate dual-binarization for efficient llms

    Hong Chen, Chengtao Lv, Liang Ding, Haotong Qin, Xiabin Zhou, Yifu Ding, Xuebo Liu, Min Zhang, Jinyang Guo, Xianglong Liu, et al. Db-llm: Accurate dual-binarization for efficient llms. arXiv preprint arXiv:2402.11960, 2024

  5. [5]

    BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

    Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. ArXiv, abs/1905.10044, 2019

  6. [6]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. ArXiv, abs/1803.05457, 2018

  7. [7]

    Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1

    Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1. arXiv preprint arXiv:1602.02830, 2016

  8. [8]

    QLoRA: Efficient Finetuning of Quantized LLMs

    Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023

  9. [9]

    Cbq: Cross-block quantization for large language models

    Xin Ding, Xiaoyu Liu, Zhijun Tu, Yun Zhang, Wei Li, Jie Hu, Hanting Chen, Yehui Tang, Zhiwei Xiong, Baoqun Yin, et al. Cbq: Cross-block quantization for large language models. arXiv preprint arXiv:2312.07950, 2023

  10. [10]

    The Llama 3 Herd of Models

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

  11. [11]

    lm-evaluation-harness, 2021

    EleutherAI. lm-evaluation-harness, 2021. Accessed: 2025-01-14

  12. [12]

    GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

    Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022. 13 Chunk 1 Chunk 2 Chunk 3 Chunk 4 Chunk 5 Chunk 6 Chunk 7 Chunk 8 Chunk 9 Chunk 10 Chunk 11 Chunk 12 Chunk 13 Chunk 14 Chunk 15 Chunk 16 Chunk 17 Chunk 18 Chunk 19 Chun...

  13. [13]

    Pt-bitnet: 1-bit large language model with post-training quantization.Available at SSRN 4987078

    Yufei Guo, Zecheng Hao, Jiahang Shao, Jie Zhou, Xiaode Liu, Xin Tong, Yuhan Zhang, Yuanpei Chen, Weihang Peng, and Zhe Ma. Pt-bitnet: 1-bit large language model with post-training quantization.Available at SSRN 4987078

  14. [14]

    Billm: Pushing the limit of post-training quantization for llms

    Wei Huang, Yangdong Liu, Haotong Qin, Ying Li, Shiming Zhang, Xianglong Liu, Michele Magno, and Xi- aojuan Qi. Billm: Pushing the limit of post-training quantization for llms. arXiv preprint arXiv:2402.04291, 2024

  15. [15]

    Loftq: Lora-fine-tuning-aware quantization for large language models

    Yixiao Li, Yifan Yu, Chen Liang, Pengcheng He, Nikos Karampatziakis, Weizhu Chen, and Tuo Zhao. Loftq: Lora-fine-tuning-aware quantization for large language models. arXiv preprint arXiv:2310.08659, 2023

  16. [16]

    Arb-llm: Alternating refined binarizations for large language models

    Zhiteng Li, Xianglong Yan, Tianao Zhang, Haotong Qin, Dong Xie, Jiang Tian, Linghe Kong, Yulun Zhang, Xiaokang Yang, et al. Arb-llm: Alternating refined binarizations for large language models. arXiv preprint arXiv:2410.03129, 2024

  17. [17]

    Awq: Activation-aware weight quantization for on-device llm compression and acceleration

    Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization for on-device llm compression and acceleration. Proceedings of Machine Learning and Systems, 6:87–100, 2024

  18. [18]

    Rotated binary neural network

    Mingbao Lin, Rongrong Ji, Zihan Xu, Baochang Zhang, Yan Wang, Yongjian Wu, Feiyue Huang, and Chia-Wen Lin. Rotated binary neural network. Advances in neural information processing systems , 33:7474–7485, 2020

  19. [19]

    LLM-QAT: Data-free quantization aware training for large language models.arXiv preprint arXiv:2305.17888, 2023

    Zechun Liu, Barlas Oguz, Changsheng Zhao, Ernie Chang, Pierre Stock, Yashar Mehdad, Yangyang Shi, Raghuraman Krishnamoorthi, and Vikas Chandra. Llm-qat: Data-free quantization aware training for large language models. arXiv preprint arXiv:2305.17888, 2023. 14

  20. [20]

    Reactnet: Towards precise binary neural network with generalized activation functions

    Zechun Liu, Zhiqiang Shen, Marios Savvides, and Kwang-Ting Cheng. Reactnet: Towards precise binary neural network with generalized activation functions. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16, pages 143–159. Springer, 2020

  21. [21]

    Fbi-llm: Scaling up fully binarized llms from scratch via autoregressive distillation

    Liqun Ma, Mingjie Sun, and Zhiqiang Shen. Fbi-llm: Scaling up fully binarized llms from scratch via autoregressive distillation. arXiv preprint arXiv:2407.07093, 2024

  22. [22]

    Bitnet b1

    Shuming Ma, Hongyu Wang, Shaohan Huang, Xingxing Zhang, Ying Hu, Ting Song, Yan Xia, and Furu Wei. Bitnet b1. 58 2b4t technical report. arXiv preprint arXiv:2504.12285, 2025

  23. [23]

    The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

    Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan Huang, Li Dong, Ruiping Wang, Jilong Xue, and Furu Wei. The era of 1-bit llms: All large language models are in 1.58 bits. arXiv preprint arXiv:2402.17764, 2024

  24. [24]

    On the State of the Art of Evaluation in Neural Language Models

    Gábor Melis, Chris Dyer, and Phil Blunsom. On the state of the art of evaluation in neural language models. arXiv preprint arXiv:1707.05589, 2017

  25. [25]

    Pointer Sentinel Mixture Models

    Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016

  26. [26]

    Can a suit of armor conduct electricity? a new dataset for open book question answering

    Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. InConference on Empirical Methods in Natural Language Processing, 2018

  27. [27]

    Fine-tuning llms to 1.58bit: extreme quantization made easy, 2024

    Leandro von Werra Pedro Cuenca Omar Sanseviero Mohamed Mekkouri, Marc Sun and Thomas Wolf. Fine-tuning llms to 1.58bit: extreme quantization made easy, 2024

  28. [28]

    Low-bit quantization favors undertrained llms: Scaling laws for quantized llms with 100t training tokens

    Xu Ouyang, Tao Ge, Thomas Hartvigsen, Zhisong Zhang, Haitao Mi, and Dong Yu. Low-bit quantization favors undertrained llms: Scaling laws for quantized llms with 100t training tokens. arXiv preprint arXiv:2411.17691, 2024

  29. [29]

    Forward and backward information retention for accurate binary neural networks

    Haotong Qin, Ruihao Gong, Xianglong Liu, Mingzhu Shen, Ziran Wei, Fengwei Yu, and Jingkuan Song. Forward and backward information retention for accurate binary neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2250–2259, 2020

  30. [30]

    Exploring the limits of transfer learning with a unified text-to-text transformer

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020

  31. [31]

    Xnor-net: Imagenet classifi- cation using binary convolutional neural networks

    Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet classifi- cation using binary convolutional neural networks. In European conference on computer vision, pages 525–542. Springer, 2016

  32. [32]

    Winogrande: An adversarial winograd schema challenge at scale

    Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021

  33. [33]

    Pb-llm: Partially binarized large language models

    Yuzhang Shang, Zhihang Yuan, Qiang Wu, and Zhen Dong. Pb-llm: Partially binarized large language models. arXiv preprint arXiv:2310.00034, 2023

  34. [34]

    Omniquant: Omnidirectionally calibrated quantization for large language models

    Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, and Ping Luo. Omniquant: Omnidirectionally calibrated quantization for large language models. arXiv preprint arXiv:2308.13137, 2023

  35. [35]

    A survey on transformer compression

    Yehui Tang, Yunhe Wang, Jianyuan Guo, Zhijun Tu, Kai Han, Hailin Hu, and Dacheng Tao. A survey on transformer compression. arXiv preprint arXiv:2402.05964, 2024

  36. [36]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

  37. [37]

    Adabin: Improving binary neural networks with adaptive binary sets

    Zhijun Tu, Xinghao Chen, Pengju Ren, and Yunhe Wang. Adabin: Improving binary neural networks with adaptive binary sets. In European conference on computer vision, pages 379–395. Springer, 2022

  38. [38]

    BitNet: Scaling 1-bit Transformers for Large Language Models

    Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Huaijie Wang, Lingxiao Ma, Fan Yang, Ruiping Wang, Yi Wu, and Furu Wei. Bitnet: Scaling 1-bit transformers for large language models. arXiv preprint arXiv:2310.11453, 2023

  39. [39]

    Redpajama: an open dataset for training large language models

    Maurice Weber, Daniel Fu, Quentin Anthony, Yonatan Oren, Shane Adams, Anton Alexandrov, Xiaozhong Lyu, Huu Nguyen, Xiaozhe Yao, Virginia Adams, et al. Redpajama: an open dataset for training large language models. arXiv preprint arXiv:2411.12372, 2024. 15

  40. [40]

    T-mac: Cpu renaissance via table lookup for low-bit llm deployment on edge, 2024

    Jianyu Wei, Shijie Cao, Ting Cao, Lingxiao Ma, Lei Wang, Yanyong Zhang, and Mao Yang. T-mac: Cpu renaissance via table lookup for low-bit llm deployment on edge, 2024

  41. [41]

    Smoothquant: Accurate and efficient post-training quantization for large language models

    Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, pages 38087–38099. PMLR, 2023

  42. [42]

    Onebit: Towards extremely low-bit large language models

    Yuzhuang Xu, Xu Han, Zonghan Yang, Shuo Wang, Qingfu Zhu, Zhiyuan Liu, Weidong Liu, and Wanxiang Che. Onebit: Towards extremely low-bit large language models. arXiv preprint arXiv:2402.11295, 2024

  43. [43]

    Qwen3 technical report, 2025

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

  44. [44]

    Hellaswag: Can a machine really finish your sentence? In Annual Meeting of the Association for Computational Linguistics, 2019

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? In Annual Meeting of the Association for Computational Linguistics, 2019

  45. [45]

    OPT: Open Pre-trained Transformer Language Models

    Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022

  46. [46]

    A Survey of Large Language Models

    Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Be- ichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023

  47. [47]

    An empirical study of qwen3 quantization

    Xingyu Zheng, Yuye Li, Haoran Chu, Yue Feng, Xudong Ma, Jie Luo, Jinyang Guo, Haotong Qin, Michele Magno, and Xianglong Liu. An empirical study of qwen3 quantization. arXiv preprint arXiv:2505.02214, 2025

  48. [48]

    Incremental Network Quantization: Towards Lossless CNNs with Low-Precision Weights

    Aojun Zhou, Anbang Yao, Yiwen Guo, Lin Xu, and Yurong Chen. Incremental network quantization: Towards lossless cnns with low-precision weights. arXiv preprint arXiv:1702.03044, 2017. 16