pith. machine review for the scientific record. sign in

arxiv: 2605.04062 · v1 · submitted 2026-04-10 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

EdgeRazor: A Lightweight Framework for Large Language Models via Mixed-Precision Quantization-Aware Distillation

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:00 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords quantizationlarge language modelsknowledge distillationmixed precisionmodel compressionedge deployment
0
0 comments X

The pith

EdgeRazor compresses large language models to 1.88 bits while beating all 3-bit methods and using 4-10 times less training compute.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents EdgeRazor, a framework that combines mixed-precision quantization-aware distillation, adaptive feature selection from a 16-bit teacher, and entropy-aware KL divergence to produce extremely low-bit LLMs. The entropy-aware loss balances forward and reverse signals using only the teacher's output distribution, removing the need for manual feature choices or extra tuning. This setup yields models that outperform higher-bit baselines on base, instruction-tuned, and multimodal LLMs. A sympathetic reader would care because the approach makes high-performance models feasible on resource-limited hardware with dramatically lower training costs and storage needs.

Core claim

EdgeRazor with 1.88-bit mixed-precision quantization surpasses all 3-bit contenders and outperforms leading 2-bit post-training quantization methods by 11.3 points, while requiring 4-10 times lower training budget than leading quantization-aware training; the 1.58-bit Qwen3-0.6B model reduces storage from 1.41 GB to 0.28 GB and accelerates decoding by 15.1 times relative to the 16-bit baseline.

What carries the argument

Adaptive Feature Distillation that derives an n-bit student from its 16-bit teacher, combined with Entropy-Aware KL Divergence whose forward-reverse balance is determined solely by the teacher's output distribution.

If this is right

  • EdgeRazor achieves higher compression ratios than existing PTQ and QAT methods at every tested bit width.
  • The 1.88-bit models maintain superior performance to 3-bit methods across base, instruction-tuned, and multimodal LLMs.
  • Training budgets drop by 4-10 times compared with leading quantization-aware training approaches.
  • Storage for a 0.6B model falls from 1.41 GB to 0.28 GB while decoding speeds increase by 15.1 times.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The entropy-balancing idea could extend to other distillation settings to cut hyperparameter search in non-quantization tasks.
  • If the method works without per-model tuning, it opens the door to automated low-bit pipelines for rapidly emerging LLM families.
  • Further tests on sub-1-bit regimes or on-device inference hardware would show where the compression-accuracy tradeoff breaks.

Load-bearing premise

The entropy-aware KL divergence, whose forward-reverse balance is set solely by the teacher's output distribution, provides stable and generalizable training signals across different model families and datasets without requiring per-model hyperparameter search or additional validation sets.

What would settle it

Apply EdgeRazor without any per-model adjustments to a new LLM family such as Llama-3 variants and measure whether the resulting 1.88-bit accuracy falls below that of standard 3-bit post-training quantization methods on the same benchmarks.

Figures

Figures reproduced from arXiv: 2605.04062 by Chen Wu, Le-Tong Huang, Nan Li, Shao-Qun Zhang, Shu-Hao Zhang, Xiang-Sheng Deng, Xin-Yi Zou.

Figure 1
Figure 1. Figure 1: Overview of the EDGERAZOR framework. A 16-bit teacher guides an n-bit mixed-precision student through a joint objective of task-specific cross-entropy, AFD, and EAKLD. sub-billion [36] to hundreds of billions of parameters [1, 52], a compelling demand has emerged for the lightweight deployment of LLMs on resource-constrained devices, where limited storage, memory, and computational capacity impose stringen… view at source ↗
Figure 2
Figure 2. Figure 2: Average performance of quantized Qwen3 under E [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
read the original abstract

Recent years have witnessed an increasing interest in deploying LLMs on resource-constrained devices, among which quantization has emerged as a promising lightweight technique that converts full-precision model weights and activations into lower-bit formats. Existing weight quantization approaches can be roughly divided into three categories: Post-Training Quantization (PTQ) that calibrates quantized parameters on a small dataset without retraining but suffers from severe performance degradation below 4-bit, Quantization-Aware Training (QAT) that searches low-bit parameters using surrogate gradients but demands substantial computational resources, and Quantization-Aware Distillation that integrates QAT with knowledge transfer from a full-precision teacher but manually selects features to distill and relies heavily on teacher-specific data. In this paper, we propose EdgeRazor, a lightweight framework for LLMs with mixed-precision and extremely low-bit weight quantization. The EdgeRazor framework contains three modules: Mixed-Precision Quantization-Aware Distillation for the fine-grained control of precision, Adaptive Feature Distillation that derives an $n$-bit student from its 16-bit teacher, and Entropy-Aware KL Divergence on both human-annotated and distilled datasets, whose forward-reverse balance is determined solely by the teacher's output distribution. Empirical investigations of EdgeRazor are conducted on base, instruction-tuned, and multimodal LLMs. Notably, EdgeRazor with 1.88-bit surpasses all contenders with the 3-bit precision, especially outperforms the leading 2-bit PTQ methods by 11.3 points, within a 4-10$\times$ lower training budget than the leading QAT approach. EdgeRazor delivers higher compression ratios at all bit width; the 1.58-bit Qwen3-0.6B reduces storage from 1.41 GB to 0.28 GB while accelerating decoding by 15.1$\times$ relative to the 16-bit baseline.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes EdgeRazor, a lightweight framework for mixed-precision and extremely low-bit weight quantization of LLMs via quantization-aware distillation. It consists of three modules: Mixed-Precision Quantization-Aware Distillation for fine-grained precision control, Adaptive Feature Distillation to derive an n-bit student from a 16-bit teacher, and Entropy-Aware KL Divergence (applied to both human-annotated and distilled data) whose forward-reverse weighting is set solely by the teacher's output distribution. Evaluations on base, instruction-tuned, and multimodal LLMs claim that the 1.88-bit variant surpasses all 3-bit contenders and outperforms leading 2-bit PTQ methods by 11.3 points, while using 4-10× lower training budget than leading QAT; additional claims include higher compression ratios and up to 15.1× decoding speedup (e.g., 1.58-bit Qwen3-0.6B reducing storage from 1.41 GB to 0.28 GB).

Significance. If the performance and efficiency claims hold, the work could meaningfully advance practical deployment of LLMs on edge devices by combining the low overhead of PTQ with the accuracy recovery of QAT and distillation, while introducing an entropy-driven loss that avoids manual feature selection. The reported gains at sub-2-bit precision with reduced training cost would be notable if the entropy-aware mechanism proves robust without per-model tuning. The framework's applicability to multimodal models is a secondary strength.

major comments (2)
  1. [Experiments] Experiments section: The headline result that 1.88-bit EdgeRazor outperforms leading 2-bit PTQ methods by 11.3 points (and all 3-bit methods) must be accompanied by explicit baseline re-implementations, identical calibration data, and statistical significance (multiple seeds or error bars); without these, the cross-method comparison cannot be evaluated as load-bearing evidence for the framework's superiority.
  2. [Entropy-Aware KL Divergence] Entropy-Aware KL Divergence module: The central claim that the forward-reverse balance is determined solely by the teacher's softmax entropy (requiring no per-model hyperparameter search or extra validation sets) is load-bearing for the 'lightweight' and 'generalizable' framing; this requires explicit ablations across model families (e.g., Llama vs. Qwen) and task distributions showing stable performance without dataset-specific calibration.
minor comments (2)
  1. [Abstract] Abstract: Concrete numerical claims (11.3 points, 4-10× budget, 15.1× speedup) are presented without naming the exact models, datasets, or task metrics used; adding one sentence with these details would improve readability.
  2. [Adaptive Feature Distillation] Notation: The description of 'Adaptive Feature Distillation that derives an n-bit student' uses n without defining its range or selection criterion in the main text; a brief equation or table entry would clarify.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback. The two major comments highlight important aspects of experimental rigor and the robustness of our proposed Entropy-Aware KL Divergence. We address each point below and will incorporate the suggested improvements in the revised manuscript.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: The headline result that 1.88-bit EdgeRazor outperforms leading 2-bit PTQ methods by 11.3 points (and all 3-bit methods) must be accompanied by explicit baseline re-implementations, identical calibration data, and statistical significance (multiple seeds or error bars); without these, the cross-method comparison cannot be evaluated as load-bearing evidence for the framework's superiority.

    Authors: We agree that direct re-implementations with matched conditions are necessary to make the 11.3-point claim fully load-bearing. The original manuscript relied on publicly reported numbers from the respective baseline papers together with standard calibration sets. In the revision we will re-implement the leading 2-bit PTQ baselines using exactly the same calibration data employed for EdgeRazor, run all methods with at least three random seeds, and report means with standard deviations or error bars in the Experiments section. This will be added as a new comparative table. revision: yes

  2. Referee: [Entropy-Aware KL Divergence] Entropy-Aware KL Divergence module: The central claim that the forward-reverse balance is determined solely by the teacher's softmax entropy (requiring no per-model hyperparameter search or extra validation sets) is load-bearing for the 'lightweight' and 'generalizable' framing; this requires explicit ablations across model families (e.g., Llama vs. Qwen) and task distributions showing stable performance without dataset-specific calibration.

    Authors: We concur that dedicated ablations are required to substantiate the claim of no per-model tuning. While the manuscript already evaluates EdgeRazor on Llama, Qwen, and multimodal families, we will add a focused ablation subsection in the revision. It will compare entropy-aware weighting against fixed-weight KL variants on Llama-2 and Qwen models across language-modeling, instruction-following, and multimodal tasks, confirming that performance remains stable without dataset-specific calibration or extra validation sets. Results will appear in a new table. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's core contributions—Mixed-Precision Quantization-Aware Distillation, Adaptive Feature Distillation, and Entropy-Aware KL Divergence with balance set by the teacher's output distribution—are presented as algorithmic modules that take external teacher outputs and standard distillation losses as inputs. No equations or performance claims are shown to reduce by construction to fitted parameters that directly encode the reported accuracy gains or compression ratios. The framework relies on independent teacher models and conventional objectives rather than self-referential definitions or self-citation load-bearing uniqueness theorems. This leaves the empirical results as falsifiable outcomes of the proposed training procedure rather than tautological restatements of the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on standard quantization and distillation assumptions plus one domain-specific assumption about teacher uncertainty guiding loss balance; no new physical entities or heavily fitted constants are introduced in the abstract.

axioms (1)
  • domain assumption The teacher's output distribution alone suffices to determine the forward-reverse KL balance without additional validation data or model-specific tuning.
    Directly stated in the description of Entropy-Aware KL Divergence.

pith-pipeline@v0.9.0 · 5675 in / 1232 out tokens · 42743 ms · 2026-05-10T18:00:49.049966+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

60 extracted references · 17 canonical work pages · 11 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. GPT-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  2. [2]

    QuaRot: Outlier-free 4-bit inference in rotated LLMs

    Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. QuaRot: Outlier-free 4-bit inference in rotated LLMs. InAdvances in Neural Information Processing Systems 37, pages 100213–100240, 2024

  3. [3]

    Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

    Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation.arXiv preprint arXiv:1308.3432, 2013

  4. [4]

    PIQA: Reasoning about physical commonsense in natural language

    Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. PIQA: Reasoning about physical commonsense in natural language. InProceedings of the 34th AAAI Conference on Artificial Intelligence, pages 7432–7439, 2020. 15

  5. [5]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

  6. [6]

    EfficientQAT: Efficient quantization-aware training for large language models

    Mengzhao Chen, Wenqi Shao, Peng Xu, Jiahao Wang, Peng Gao, Kaipeng Zhang, Yu Qiao, and Ping Luo. EfficientQAT: Efficient quantization-aware training for large language models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, pages 10081–10100, 2025

  7. [7]

    Optimize weight rounding via signed gradient descent for the quantization of LLMs

    Wenhua Cheng, Weiwei Zhang, Haihao Shen, Yiyang Cai, Xin He, Lv Kaokao, and Yi Liu. Optimize weight rounding via signed gradient descent for the quantization of LLMs. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 11332–11350, 2024

  8. [8]

    BoolQ: Exploring the surprising difficulty of natural yes/no questions

    Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. BoolQ: Exploring the surprising difficulty of natural yes/no questions. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, pages 2924–2936, 2019

  9. [9]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? Try ARC, the AI2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

  10. [10]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

  11. [11]

    The case for 4-bit precision: K-bit inference scaling laws

    Tim Dettmers and Luke Zettlemoyer. The case for 4-bit precision: K-bit inference scaling laws. In Proceedings of the 40th International Conference on Machine Learning, pages 7750–7774, 2023

  12. [12]

    BitDistiller: Unleashing the potential of sub-4-bit LLMs via self-distillation

    Dayou Du, Yijia Zhang, Shijie Cao, Jiaqi Guo, Ting Cao, Xiaowen Chu, and Ningyi Xu. BitDistiller: Unleashing the potential of sub-4-bit LLMs via self-distillation. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, pages 102–116, 2024

  13. [13]

    Extreme compression of large language models via additive quantization

    Vage Egiazarian, Andrei Panferov, Denis Kuznedelev, Elias Frantar, Artem Babenko, and Dan Alistarh. Extreme compression of large language models via additive quantization. InProceedings of the 41st International Conference on Machine Learning, pages 12284–12303, 2024

  14. [14]

    How contextual are contextualized word representations? Comparing the geometry of BERT, ELMo, and GPT-2 embeddings

    Kawin Ethayarajh. How contextual are contextualized word representations? Comparing the geometry of BERT, ELMo, and GPT-2 embeddings. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, pages 55–65, 2019

  15. [15]

    GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

    Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. GPTQ: Accurate post-training quantiza- tion for generative pre-trained transformers.arXiv preprint arXiv:2210.17323, 2022

  16. [16]

    Video-MME: The first-ever comprehensive evaluation benchmark of multi-modal LLMs in video analysis

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-MME: The first-ever comprehensive evaluation benchmark of multi-modal LLMs in video analysis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24108–24118, 2025

  17. [17]

    APTQ: Attention-aware post- training mixed-precision quantization for large language models

    Ziyi Guan, Hantao Huang, Yupeng Su, Hong Huang, Ngai Wong, and Hao Yu. APTQ: Attention-aware post- training mixed-precision quantization for large language models. InProceedings of the 61st ACM/IEEE Design Automation Conference, pages 1–6, 2024

  18. [18]

    2023 , month = feb, journal =

    Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, and Jacob Steinhardt. Aligning AI with shared human values.arXiv preprint arXiv:2008.02275, 2020

  19. [19]

    Measuring Massive Multitask Language Understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020

  20. [20]

    Rethinking channel dimensions to isolate outliers for low-bit weight quantization of large language models

    Jung Hwan Heo, Jeonghoon Kim, Beomseok Kwon, Byeongwook Kim, Se Jung Kwon, and Dongsoo Lee. Rethinking channel dimensions to isolate outliers for low-bit weight quantization of large language models. InProceedings of the 12th International Conference on Learning Representations, pages 12744–12762, 2024

  21. [21]

    Distilling the Knowledge in a Neural Network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015

  22. [22]

    BiLLM: Pushing the limit of post-training quantization for LLMs

    Wei Huang, Yangdong Liu, Haotong Qin, Ying Li, Shiming Zhang, Xianglong Liu, Michele Magno, and Xiaojuan Qi. BiLLM: Pushing the limit of post-training quantization for LLMs. InProceedings of the 41st International Conference on Machine Learning, pages 20023–20042, 2024. 16

  23. [23]

    SliM-LLM: Salience-driven mixed-precision quantization for large language models

    Wei Huang, Haotong Qin, Yangdong Liu, Yawei Li, Qinshuo Liu, Xianglong Liu, Luca Benini, Michele Magno, Shiming Zhang, and Xiaojuan Qi. SliM-LLM: Salience-driven mixed-precision quantization for large language models. InProceedings of the 42nd International Conference on Machine Learning, pages 25672–25692, 2025

  24. [24]

    Q-Palette: Fractional-bit quantizers toward optimal bit allocation for efficient LLM deployment.arXiv preprint arXiv:2509.20214, 2025

    Deokjae Lee and Hyun Oh Song. Q-Palette: Fractional-bit quantizers toward optimal bit allocation for efficient LLM deployment.arXiv preprint arXiv:2509.20214, 2025

  25. [25]

    Infinity instruct: Scaling instruction selection and synthesis to enhance language models

    Jijie Li, Li Du, Hanyu Zhao, Bo wen Zhang, Liangdong Wang, Boyan Gao, Guang Liu, and Yonghua Lin. Infinity Instruct: Scaling instruction selection and synthesis to enhance language models.arXiv preprint arXiv:2506.11116, 2025

  26. [26]

    GPTAQ: Efficient finetuning-free quantization for asymmetric calibration

    Yuhang Li, Ruokai Yin, Donghyun Lee, Shiting Xiao, and Priyadarshini Panda. GPTAQ: Efficient finetuning-free quantization for asymmetric calibration. InProceedings of the 42nd International Confer- ence on Machine Learning, pages 36690–36706, 2025

  27. [27]

    TGIF: A new dataset and benchmark on animated gif description

    Yuncheng Li, Yale Song, Liangliang Cao, Joel Tetreault, Larry Goldberg, Alejandro Jaimes, and Jiebo Luo. TGIF: A new dataset and benchmark on animated gif description. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4641–4650, 2016

  28. [28]

    ARB-LLM: Alternating refined binarizations for large language models

    Zhiteng Li, Xianglong Yan, Tianao Zhang, Haotong Qin, Dong Xie, Jiang Tian, Zhongchao Shi, Linghe Kong, Yulun Zhang, and Xiaokang Yang. ARB-LLM: Alternating refined binarizations for large language models. InProceedings of the 13th International Conference on Learning Representations, pages 93900– 93912, 2025

  29. [29]

    AWQ: Activation-aware weight quantization for on-device LLM compression and acceleration

    Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. AWQ: Activation-aware weight quantization for on-device LLM compression and acceleration. InProceedings of the 6th Conference on Machine Learning and Systems, volume 6, pages 87–100, 2024

  30. [30]

    TruthfulQA: Measuring how models mimic human falsehoods

    Stephanie Lin, Jacob Hilton, and Owain Evans. TruthfulQA: Measuring how models mimic human falsehoods. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics, pages 3214–3252, 2022

  31. [31]

    QServe: W4A8KV4 quantization and system co-design for efficient LLM serving

    Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, and Song Han. QServe: W4A8KV4 quantization and system co-design for efficient LLM serving. InProceedings of the 7th Conference on Machine Learning and Systems, 2025

  32. [32]

    VPTQ: Extreme low-bit vector post-training quantization for large language models

    Yifei Liu, Jicheng Wen, Yang Wang, Shengyu Ye, Li Lyna Zhang, Ting Cao, Cheng Li, and Mao Yang. VPTQ: Extreme low-bit vector post-training quantization for large language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 8181–8196, 2024

  33. [33]

    LLM-QAT: Data-free quantization aware training for large language models.arXiv:2305.17888, 2023

    Zechun Liu, Barlas Oguz, Changsheng Zhao, Ernie Chang, Pierre Stock, Yashar Mehdad, Yangyang Shi, Raghuraman Krishnamoorthi, and Vikas Chandra. LLM-QAT: Data-free quantization aware training for large language models.arXiv preprint arXiv:2305.17888, 2023

  34. [34]

    ParetoQ: Scaling Laws in Extremely Low-bit LLM Quantization,

    Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, and Tijmen Blankevoort. ParetoQ: Scaling laws in extremely low-bit LLM quantization.arXiv preprint arXiv:2502.02631, 2025

  35. [35]

    SpinQuant: LLM quantization with learned rotations

    Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, and Tijmen Blankevoort. SpinQuant: LLM quantization with learned rotations. InProceedings of the 13th International Conference on Learning Representations, pages 92009–92032, 2025

  36. [36]

    MobileLLM: Optimizing sub-billion parameter language models for on-device use cases

    Zechun Liu, Changsheng Zhao, Forrest Iandola, Chen Lai, Yuandong Tian, Igor Fedorov, Yunyang Xiong, Ernie Chang, Yangyang Shi, Raghuraman Krishnamoorthi, et al. MobileLLM: Optimizing sub-billion parameter language models for on-device use cases. InProceedings of the 41st International Conference on Machine Learning, pages 31267–31289, 2024

  37. [37]

    Can a suit of armor conduct electricity? A new dataset for open book question answering

    Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? A new dataset for open book question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2381–2391, 2018

  38. [38]

    WinoGrande: An adversarial Winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

    Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. WinoGrande: An adversarial Winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021. 17

  39. [39]

    Social IQa: Commonsense reasoning about social interactions

    Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi. Social IQa: Commonsense reasoning about social interactions. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, pages 4463–4473, 2019

  40. [40]

    OmniQuant: Omnidirectionally calibrated quantization for large language models

    Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, and Ping Luo. OmniQuant: Omnidirectionally calibrated quantization for large language models. InProceedings of the 12th International Conference on Learning Representations, pages 45472–45496, 2024

  41. [41]

    FlatQuant: Flatness matters for LLM quantization

    Yuxuan Sun, Ruikang Liu, Haoli Bai, Han Bao, Kang Zhao, Yuening Li, Jiaxin Hu, Xianzhi Yu, Lu Hou, Chun Yuan, Xin Jiang, Wulong Liu, and Jun Yao. FlatQuant: Flatness matters for LLM quantization. In Proceedings of the 42nd International Conference on Machine Learning, pages 57587–57613, 2025

  42. [42]

    MobileQuant: Mobile-friendly quantization for on-device language models

    Fuwen Tan, Royson Lee, Łukasz Dudziak, Shell Xu Hu, Sourav Bhattacharya, Timothy Hospedales, Georgios Tzimiropoulos, and Brais Martinez. MobileQuant: Mobile-friendly quantization for on-device language models. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 9761–9771, 2024

  43. [43]

    BERT rediscovers the classical NLP pipeline

    Ian Tenney, Dipanjan Das, and Ellie Pavlick. BERT rediscovers the classical NLP pipeline. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4593–4601, 2019

  44. [44]

    QuIP#: Even better LLM quantization with hadamard incoherence and lattice codebooks

    Albert Tseng, Jerry Chee, Qingyao Sun, V olodymyr Kuleshov, and Christopher De Sa. QuIP#: Even better LLM quantization with hadamard incoherence and lattice codebooks. InProceedings of the 41st International Conference on Machine Learning, pages 48630–48656, 2024

  45. [45]

    QTIP: Quantization with trellises and incoherence processing

    Albert Tseng, Qingyao Sun, David Hou, and Christopher M De Sa. QTIP: Quantization with trellises and incoherence processing. InAdvances in Neural Information Processing Systems 37, pages 59597–59620, 2024

  46. [46]

    BitNet: 1-bit pre-training for large language models.Journal of Machine Learning Research, 26(125):1–29, 2025

    Hongyu Wang, Shuming Ma, Lingxiao Ma, Lei Wang, Wenhui Wang, Li Dong, Shaohan Huang, Huaijie Wang, Jilong Xue, Ruiping Wang, et al. BitNet: 1-bit pre-training for large language models.Journal of Machine Learning Research, 26(125):1–29, 2025

  47. [47]

    MiniLM: Deep self-attention distillation for task-agnostic compression of pre-trained transformers

    Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. MiniLM: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. InAdvances in Neural Information Processing Systems 33, pages 5776–5788, 2020

  48. [48]

    Rethinking kullback-leibler divergence in knowledge distillation for large language models

    Taiqiang Wu, Chaofan Tao, Jiahao Wang, Runming Yang, Zhe Zhao, and Ngai Wong. Rethinking kullback-leibler divergence in knowledge distillation for large language models. InProceedings of the 31st International Conference on Computational Linguistics, pages 5737–5755, 2025

  49. [49]

    SmoothQuant: Accurate and efficient post-training quantization for large language models

    Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. SmoothQuant: Accurate and efficient post-training quantization for large language models. InProceedings of the 40th International Conference on Machine Learning, pages 38087–38099, 2023

  50. [50]

    Qwen2.5-Omni Technical Report

    Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, et al. Qwen2.5-omni technical report.arXiv preprint arXiv:2503.20215, 2025

  51. [51]

    OneBit: Towards extremely low-bit large language models

    Yuzhuang Xu, Xu Han, Zonghan Yang, Shuo Wang, Qingfu Zhu, Zhiyuan Liu, Weidong Liu, and Wanxiang Che. OneBit: Towards extremely low-bit large language models. InAdvances in Neural Information Processing Systems 37, pages 66357–66382, 2024

  52. [52]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  53. [53]

    HellaSwag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, 2019

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, 2019

  54. [54]

    ABQ-LLM: Arbitrary-bit quantized inference acceleration for large language models

    Chao Zeng, Songwei Liu, Yusheng Xie, Hong Liu, Xiaojian Wang, Miao Wei, Shu Yang, Fangmin Chen, and Xing Mei. ABQ-LLM: Arbitrary-bit quantized inference acceleration for large language models. In Proceedings of the 39th AAAI Conference on Artificial Intelligence, pages 22299–22307, 2025

  55. [55]

    LQER: Low-rank quantization error reconstruction for LLMs

    Cheng Zhang, Jianyi Cheng, George A Constantinides, and Yiren Zhao. LQER: Low-rank quantization error reconstruction for LLMs. InProceedings of the 41st International Conference on Machine Learning, pages 58763–58779, 2024. 18

  56. [56]

    arXiv preprint arXiv:2503.19633 , year=

    Han Zhao, Haotian Wang, Yiping Peng, Sitong Zhao, Xiaoyu Tian, Shuaiting Chen, Yunjie Ji, and Xiangang Li. 1.4 million open-source distilled reasoning dataset to empower large language model training.arXiv preprint arXiv:2503.19633, 2025

  57. [57]

    A review on edge large language models: Design, execution, and applications.ACM Computing Surveys, 57(8):1–35, 2025

    Yue Zheng, Yuhao Chen, Bin Qian, Xiufang Shi, Yuanchao Shu, and Jiming Chen. A review on edge large language models: Design, execution, and applications.ACM Computing Surveys, 57(8):1–35, 2025

  58. [58]

    Instruction-Following Evaluation for Large Language Models

    Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911, 2023

  59. [59]

    MLVU: Benchmarking multi-task long video understanding

    Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Zhengyang Liang, Shitao Xiao, Minghao Qin, Xi Yang, Yongping Xiong, Bo Zhang, et al. MLVU: Benchmarking multi-task long video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13691– 13701, 2025

  60. [60]

    A survey on model compression for large language models.Transactions of the Association for Computational Linguistics, 12:1556–1577, 2024

    Xunyu Zhu, Jian Li, Yong Liu, Can Ma, and Weiping Wang. A survey on model compression for large language models.Transactions of the Association for Computational Linguistics, 12:1556–1577, 2024. 19 A Details of Experimental Results In Tables 16, 17, 18, and 19, we report the comprehensive per-task results underlying the average scores presented in the mai...