Recognition: 2 theorem links
· Lean TheoremEdgeRazor: A Lightweight Framework for Large Language Models via Mixed-Precision Quantization-Aware Distillation
Pith reviewed 2026-05-10 18:00 UTC · model grok-4.3
The pith
EdgeRazor compresses large language models to 1.88 bits while beating all 3-bit methods and using 4-10 times less training compute.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
EdgeRazor with 1.88-bit mixed-precision quantization surpasses all 3-bit contenders and outperforms leading 2-bit post-training quantization methods by 11.3 points, while requiring 4-10 times lower training budget than leading quantization-aware training; the 1.58-bit Qwen3-0.6B model reduces storage from 1.41 GB to 0.28 GB and accelerates decoding by 15.1 times relative to the 16-bit baseline.
What carries the argument
Adaptive Feature Distillation that derives an n-bit student from its 16-bit teacher, combined with Entropy-Aware KL Divergence whose forward-reverse balance is determined solely by the teacher's output distribution.
If this is right
- EdgeRazor achieves higher compression ratios than existing PTQ and QAT methods at every tested bit width.
- The 1.88-bit models maintain superior performance to 3-bit methods across base, instruction-tuned, and multimodal LLMs.
- Training budgets drop by 4-10 times compared with leading quantization-aware training approaches.
- Storage for a 0.6B model falls from 1.41 GB to 0.28 GB while decoding speeds increase by 15.1 times.
Where Pith is reading between the lines
- The entropy-balancing idea could extend to other distillation settings to cut hyperparameter search in non-quantization tasks.
- If the method works without per-model tuning, it opens the door to automated low-bit pipelines for rapidly emerging LLM families.
- Further tests on sub-1-bit regimes or on-device inference hardware would show where the compression-accuracy tradeoff breaks.
Load-bearing premise
The entropy-aware KL divergence, whose forward-reverse balance is set solely by the teacher's output distribution, provides stable and generalizable training signals across different model families and datasets without requiring per-model hyperparameter search or additional validation sets.
What would settle it
Apply EdgeRazor without any per-model adjustments to a new LLM family such as Llama-3 variants and measure whether the resulting 1.88-bit accuracy falls below that of standard 3-bit post-training quantization methods on the same benchmarks.
Figures
read the original abstract
Recent years have witnessed an increasing interest in deploying LLMs on resource-constrained devices, among which quantization has emerged as a promising lightweight technique that converts full-precision model weights and activations into lower-bit formats. Existing weight quantization approaches can be roughly divided into three categories: Post-Training Quantization (PTQ) that calibrates quantized parameters on a small dataset without retraining but suffers from severe performance degradation below 4-bit, Quantization-Aware Training (QAT) that searches low-bit parameters using surrogate gradients but demands substantial computational resources, and Quantization-Aware Distillation that integrates QAT with knowledge transfer from a full-precision teacher but manually selects features to distill and relies heavily on teacher-specific data. In this paper, we propose EdgeRazor, a lightweight framework for LLMs with mixed-precision and extremely low-bit weight quantization. The EdgeRazor framework contains three modules: Mixed-Precision Quantization-Aware Distillation for the fine-grained control of precision, Adaptive Feature Distillation that derives an $n$-bit student from its 16-bit teacher, and Entropy-Aware KL Divergence on both human-annotated and distilled datasets, whose forward-reverse balance is determined solely by the teacher's output distribution. Empirical investigations of EdgeRazor are conducted on base, instruction-tuned, and multimodal LLMs. Notably, EdgeRazor with 1.88-bit surpasses all contenders with the 3-bit precision, especially outperforms the leading 2-bit PTQ methods by 11.3 points, within a 4-10$\times$ lower training budget than the leading QAT approach. EdgeRazor delivers higher compression ratios at all bit width; the 1.58-bit Qwen3-0.6B reduces storage from 1.41 GB to 0.28 GB while accelerating decoding by 15.1$\times$ relative to the 16-bit baseline.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes EdgeRazor, a lightweight framework for mixed-precision and extremely low-bit weight quantization of LLMs via quantization-aware distillation. It consists of three modules: Mixed-Precision Quantization-Aware Distillation for fine-grained precision control, Adaptive Feature Distillation to derive an n-bit student from a 16-bit teacher, and Entropy-Aware KL Divergence (applied to both human-annotated and distilled data) whose forward-reverse weighting is set solely by the teacher's output distribution. Evaluations on base, instruction-tuned, and multimodal LLMs claim that the 1.88-bit variant surpasses all 3-bit contenders and outperforms leading 2-bit PTQ methods by 11.3 points, while using 4-10× lower training budget than leading QAT; additional claims include higher compression ratios and up to 15.1× decoding speedup (e.g., 1.58-bit Qwen3-0.6B reducing storage from 1.41 GB to 0.28 GB).
Significance. If the performance and efficiency claims hold, the work could meaningfully advance practical deployment of LLMs on edge devices by combining the low overhead of PTQ with the accuracy recovery of QAT and distillation, while introducing an entropy-driven loss that avoids manual feature selection. The reported gains at sub-2-bit precision with reduced training cost would be notable if the entropy-aware mechanism proves robust without per-model tuning. The framework's applicability to multimodal models is a secondary strength.
major comments (2)
- [Experiments] Experiments section: The headline result that 1.88-bit EdgeRazor outperforms leading 2-bit PTQ methods by 11.3 points (and all 3-bit methods) must be accompanied by explicit baseline re-implementations, identical calibration data, and statistical significance (multiple seeds or error bars); without these, the cross-method comparison cannot be evaluated as load-bearing evidence for the framework's superiority.
- [Entropy-Aware KL Divergence] Entropy-Aware KL Divergence module: The central claim that the forward-reverse balance is determined solely by the teacher's softmax entropy (requiring no per-model hyperparameter search or extra validation sets) is load-bearing for the 'lightweight' and 'generalizable' framing; this requires explicit ablations across model families (e.g., Llama vs. Qwen) and task distributions showing stable performance without dataset-specific calibration.
minor comments (2)
- [Abstract] Abstract: Concrete numerical claims (11.3 points, 4-10× budget, 15.1× speedup) are presented without naming the exact models, datasets, or task metrics used; adding one sentence with these details would improve readability.
- [Adaptive Feature Distillation] Notation: The description of 'Adaptive Feature Distillation that derives an n-bit student' uses n without defining its range or selection criterion in the main text; a brief equation or table entry would clarify.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive feedback. The two major comments highlight important aspects of experimental rigor and the robustness of our proposed Entropy-Aware KL Divergence. We address each point below and will incorporate the suggested improvements in the revised manuscript.
read point-by-point responses
-
Referee: [Experiments] Experiments section: The headline result that 1.88-bit EdgeRazor outperforms leading 2-bit PTQ methods by 11.3 points (and all 3-bit methods) must be accompanied by explicit baseline re-implementations, identical calibration data, and statistical significance (multiple seeds or error bars); without these, the cross-method comparison cannot be evaluated as load-bearing evidence for the framework's superiority.
Authors: We agree that direct re-implementations with matched conditions are necessary to make the 11.3-point claim fully load-bearing. The original manuscript relied on publicly reported numbers from the respective baseline papers together with standard calibration sets. In the revision we will re-implement the leading 2-bit PTQ baselines using exactly the same calibration data employed for EdgeRazor, run all methods with at least three random seeds, and report means with standard deviations or error bars in the Experiments section. This will be added as a new comparative table. revision: yes
-
Referee: [Entropy-Aware KL Divergence] Entropy-Aware KL Divergence module: The central claim that the forward-reverse balance is determined solely by the teacher's softmax entropy (requiring no per-model hyperparameter search or extra validation sets) is load-bearing for the 'lightweight' and 'generalizable' framing; this requires explicit ablations across model families (e.g., Llama vs. Qwen) and task distributions showing stable performance without dataset-specific calibration.
Authors: We concur that dedicated ablations are required to substantiate the claim of no per-model tuning. While the manuscript already evaluates EdgeRazor on Llama, Qwen, and multimodal families, we will add a focused ablation subsection in the revision. It will compare entropy-aware weighting against fixed-weight KL variants on Llama-2 and Qwen models across language-modeling, instruction-following, and multimodal tasks, confirming that performance remains stable without dataset-specific calibration or extra validation sets. Results will appear in a new table. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper's core contributions—Mixed-Precision Quantization-Aware Distillation, Adaptive Feature Distillation, and Entropy-Aware KL Divergence with balance set by the teacher's output distribution—are presented as algorithmic modules that take external teacher outputs and standard distillation losses as inputs. No equations or performance claims are shown to reduce by construction to fitted parameters that directly encode the reported accuracy gains or compression ratios. The framework relies on independent teacher models and conventional objectives rather than self-referential definitions or self-citation load-bearing uniqueness theorems. This leaves the empirical results as falsifiable outcomes of the proposed training procedure rather than tautological restatements of the inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The teacher's output distribution alone suffices to determine the forward-reverse KL balance without additional validation data or model-specific tuning.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Entropy-Aware KL Divergence (EAKLD) ... whose forward-reverse balance is determined solely by the entropy of the teacher’s output distribution
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanalpha_pin_under_high_calibration unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Mixed-Precision Quantization-Aware Distillation ... super-group assignment ... ρ proportion of rows assigned to 4-bit
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. GPT-4 technical report.arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
QuaRot: Outlier-free 4-bit inference in rotated LLMs
Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. QuaRot: Outlier-free 4-bit inference in rotated LLMs. InAdvances in Neural Information Processing Systems 37, pages 100213–100240, 2024
2024
-
[3]
Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation
Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation.arXiv preprint arXiv:1308.3432, 2013
work page internal anchor Pith review arXiv 2013
-
[4]
PIQA: Reasoning about physical commonsense in natural language
Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. PIQA: Reasoning about physical commonsense in natural language. InProceedings of the 34th AAAI Conference on Artificial Intelligence, pages 7432–7439, 2020. 15
2020
-
[5]
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[6]
EfficientQAT: Efficient quantization-aware training for large language models
Mengzhao Chen, Wenqi Shao, Peng Xu, Jiahao Wang, Peng Gao, Kaipeng Zhang, Yu Qiao, and Ping Luo. EfficientQAT: Efficient quantization-aware training for large language models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, pages 10081–10100, 2025
2025
-
[7]
Optimize weight rounding via signed gradient descent for the quantization of LLMs
Wenhua Cheng, Weiwei Zhang, Haihao Shen, Yiyang Cai, Xin He, Lv Kaokao, and Yi Liu. Optimize weight rounding via signed gradient descent for the quantization of LLMs. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 11332–11350, 2024
2024
-
[8]
BoolQ: Exploring the surprising difficulty of natural yes/no questions
Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. BoolQ: Exploring the surprising difficulty of natural yes/no questions. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, pages 2924–2936, 2019
2019
-
[9]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? Try ARC, the AI2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[10]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[11]
The case for 4-bit precision: K-bit inference scaling laws
Tim Dettmers and Luke Zettlemoyer. The case for 4-bit precision: K-bit inference scaling laws. In Proceedings of the 40th International Conference on Machine Learning, pages 7750–7774, 2023
2023
-
[12]
BitDistiller: Unleashing the potential of sub-4-bit LLMs via self-distillation
Dayou Du, Yijia Zhang, Shijie Cao, Jiaqi Guo, Ting Cao, Xiaowen Chu, and Ningyi Xu. BitDistiller: Unleashing the potential of sub-4-bit LLMs via self-distillation. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, pages 102–116, 2024
2024
-
[13]
Extreme compression of large language models via additive quantization
Vage Egiazarian, Andrei Panferov, Denis Kuznedelev, Elias Frantar, Artem Babenko, and Dan Alistarh. Extreme compression of large language models via additive quantization. InProceedings of the 41st International Conference on Machine Learning, pages 12284–12303, 2024
2024
-
[14]
How contextual are contextualized word representations? Comparing the geometry of BERT, ELMo, and GPT-2 embeddings
Kawin Ethayarajh. How contextual are contextualized word representations? Comparing the geometry of BERT, ELMo, and GPT-2 embeddings. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, pages 55–65, 2019
2019
-
[15]
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. GPTQ: Accurate post-training quantiza- tion for generative pre-trained transformers.arXiv preprint arXiv:2210.17323, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[16]
Video-MME: The first-ever comprehensive evaluation benchmark of multi-modal LLMs in video analysis
Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-MME: The first-ever comprehensive evaluation benchmark of multi-modal LLMs in video analysis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24108–24118, 2025
2025
-
[17]
APTQ: Attention-aware post- training mixed-precision quantization for large language models
Ziyi Guan, Hantao Huang, Yupeng Su, Hong Huang, Ngai Wong, and Hao Yu. APTQ: Attention-aware post- training mixed-precision quantization for large language models. InProceedings of the 61st ACM/IEEE Design Automation Conference, pages 1–6, 2024
2024
-
[18]
Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, and Jacob Steinhardt. Aligning AI with shared human values.arXiv preprint arXiv:2008.02275, 2020
-
[19]
Measuring Massive Multitask Language Understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[20]
Rethinking channel dimensions to isolate outliers for low-bit weight quantization of large language models
Jung Hwan Heo, Jeonghoon Kim, Beomseok Kwon, Byeongwook Kim, Se Jung Kwon, and Dongsoo Lee. Rethinking channel dimensions to isolate outliers for low-bit weight quantization of large language models. InProceedings of the 12th International Conference on Learning Representations, pages 12744–12762, 2024
2024
-
[21]
Distilling the Knowledge in a Neural Network
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[22]
BiLLM: Pushing the limit of post-training quantization for LLMs
Wei Huang, Yangdong Liu, Haotong Qin, Ying Li, Shiming Zhang, Xianglong Liu, Michele Magno, and Xiaojuan Qi. BiLLM: Pushing the limit of post-training quantization for LLMs. InProceedings of the 41st International Conference on Machine Learning, pages 20023–20042, 2024. 16
2024
-
[23]
SliM-LLM: Salience-driven mixed-precision quantization for large language models
Wei Huang, Haotong Qin, Yangdong Liu, Yawei Li, Qinshuo Liu, Xianglong Liu, Luca Benini, Michele Magno, Shiming Zhang, and Xiaojuan Qi. SliM-LLM: Salience-driven mixed-precision quantization for large language models. InProceedings of the 42nd International Conference on Machine Learning, pages 25672–25692, 2025
2025
-
[24]
Deokjae Lee and Hyun Oh Song. Q-Palette: Fractional-bit quantizers toward optimal bit allocation for efficient LLM deployment.arXiv preprint arXiv:2509.20214, 2025
-
[25]
Infinity instruct: Scaling instruction selection and synthesis to enhance language models
Jijie Li, Li Du, Hanyu Zhao, Bo wen Zhang, Liangdong Wang, Boyan Gao, Guang Liu, and Yonghua Lin. Infinity Instruct: Scaling instruction selection and synthesis to enhance language models.arXiv preprint arXiv:2506.11116, 2025
-
[26]
GPTAQ: Efficient finetuning-free quantization for asymmetric calibration
Yuhang Li, Ruokai Yin, Donghyun Lee, Shiting Xiao, and Priyadarshini Panda. GPTAQ: Efficient finetuning-free quantization for asymmetric calibration. InProceedings of the 42nd International Confer- ence on Machine Learning, pages 36690–36706, 2025
2025
-
[27]
TGIF: A new dataset and benchmark on animated gif description
Yuncheng Li, Yale Song, Liangliang Cao, Joel Tetreault, Larry Goldberg, Alejandro Jaimes, and Jiebo Luo. TGIF: A new dataset and benchmark on animated gif description. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4641–4650, 2016
2016
-
[28]
ARB-LLM: Alternating refined binarizations for large language models
Zhiteng Li, Xianglong Yan, Tianao Zhang, Haotong Qin, Dong Xie, Jiang Tian, Zhongchao Shi, Linghe Kong, Yulun Zhang, and Xiaokang Yang. ARB-LLM: Alternating refined binarizations for large language models. InProceedings of the 13th International Conference on Learning Representations, pages 93900– 93912, 2025
2025
-
[29]
AWQ: Activation-aware weight quantization for on-device LLM compression and acceleration
Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. AWQ: Activation-aware weight quantization for on-device LLM compression and acceleration. InProceedings of the 6th Conference on Machine Learning and Systems, volume 6, pages 87–100, 2024
2024
-
[30]
TruthfulQA: Measuring how models mimic human falsehoods
Stephanie Lin, Jacob Hilton, and Owain Evans. TruthfulQA: Measuring how models mimic human falsehoods. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics, pages 3214–3252, 2022
2022
-
[31]
QServe: W4A8KV4 quantization and system co-design for efficient LLM serving
Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, and Song Han. QServe: W4A8KV4 quantization and system co-design for efficient LLM serving. InProceedings of the 7th Conference on Machine Learning and Systems, 2025
2025
-
[32]
VPTQ: Extreme low-bit vector post-training quantization for large language models
Yifei Liu, Jicheng Wen, Yang Wang, Shengyu Ye, Li Lyna Zhang, Ting Cao, Cheng Li, and Mao Yang. VPTQ: Extreme low-bit vector post-training quantization for large language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 8181–8196, 2024
2024
-
[33]
LLM-QAT: Data-free quantization aware training for large language models.arXiv:2305.17888, 2023
Zechun Liu, Barlas Oguz, Changsheng Zhao, Ernie Chang, Pierre Stock, Yashar Mehdad, Yangyang Shi, Raghuraman Krishnamoorthi, and Vikas Chandra. LLM-QAT: Data-free quantization aware training for large language models.arXiv preprint arXiv:2305.17888, 2023
-
[34]
ParetoQ: Scaling Laws in Extremely Low-bit LLM Quantization,
Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, and Tijmen Blankevoort. ParetoQ: Scaling laws in extremely low-bit LLM quantization.arXiv preprint arXiv:2502.02631, 2025
-
[35]
SpinQuant: LLM quantization with learned rotations
Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, and Tijmen Blankevoort. SpinQuant: LLM quantization with learned rotations. InProceedings of the 13th International Conference on Learning Representations, pages 92009–92032, 2025
2025
-
[36]
MobileLLM: Optimizing sub-billion parameter language models for on-device use cases
Zechun Liu, Changsheng Zhao, Forrest Iandola, Chen Lai, Yuandong Tian, Igor Fedorov, Yunyang Xiong, Ernie Chang, Yangyang Shi, Raghuraman Krishnamoorthi, et al. MobileLLM: Optimizing sub-billion parameter language models for on-device use cases. InProceedings of the 41st International Conference on Machine Learning, pages 31267–31289, 2024
2024
-
[37]
Can a suit of armor conduct electricity? A new dataset for open book question answering
Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? A new dataset for open book question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2381–2391, 2018
2018
-
[38]
WinoGrande: An adversarial Winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021
Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. WinoGrande: An adversarial Winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021. 17
2021
-
[39]
Social IQa: Commonsense reasoning about social interactions
Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi. Social IQa: Commonsense reasoning about social interactions. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, pages 4463–4473, 2019
2019
-
[40]
OmniQuant: Omnidirectionally calibrated quantization for large language models
Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, and Ping Luo. OmniQuant: Omnidirectionally calibrated quantization for large language models. InProceedings of the 12th International Conference on Learning Representations, pages 45472–45496, 2024
2024
-
[41]
FlatQuant: Flatness matters for LLM quantization
Yuxuan Sun, Ruikang Liu, Haoli Bai, Han Bao, Kang Zhao, Yuening Li, Jiaxin Hu, Xianzhi Yu, Lu Hou, Chun Yuan, Xin Jiang, Wulong Liu, and Jun Yao. FlatQuant: Flatness matters for LLM quantization. In Proceedings of the 42nd International Conference on Machine Learning, pages 57587–57613, 2025
2025
-
[42]
MobileQuant: Mobile-friendly quantization for on-device language models
Fuwen Tan, Royson Lee, Łukasz Dudziak, Shell Xu Hu, Sourav Bhattacharya, Timothy Hospedales, Georgios Tzimiropoulos, and Brais Martinez. MobileQuant: Mobile-friendly quantization for on-device language models. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 9761–9771, 2024
2024
-
[43]
BERT rediscovers the classical NLP pipeline
Ian Tenney, Dipanjan Das, and Ellie Pavlick. BERT rediscovers the classical NLP pipeline. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4593–4601, 2019
2019
-
[44]
QuIP#: Even better LLM quantization with hadamard incoherence and lattice codebooks
Albert Tseng, Jerry Chee, Qingyao Sun, V olodymyr Kuleshov, and Christopher De Sa. QuIP#: Even better LLM quantization with hadamard incoherence and lattice codebooks. InProceedings of the 41st International Conference on Machine Learning, pages 48630–48656, 2024
2024
-
[45]
QTIP: Quantization with trellises and incoherence processing
Albert Tseng, Qingyao Sun, David Hou, and Christopher M De Sa. QTIP: Quantization with trellises and incoherence processing. InAdvances in Neural Information Processing Systems 37, pages 59597–59620, 2024
2024
-
[46]
BitNet: 1-bit pre-training for large language models.Journal of Machine Learning Research, 26(125):1–29, 2025
Hongyu Wang, Shuming Ma, Lingxiao Ma, Lei Wang, Wenhui Wang, Li Dong, Shaohan Huang, Huaijie Wang, Jilong Xue, Ruiping Wang, et al. BitNet: 1-bit pre-training for large language models.Journal of Machine Learning Research, 26(125):1–29, 2025
2025
-
[47]
MiniLM: Deep self-attention distillation for task-agnostic compression of pre-trained transformers
Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. MiniLM: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. InAdvances in Neural Information Processing Systems 33, pages 5776–5788, 2020
2020
-
[48]
Rethinking kullback-leibler divergence in knowledge distillation for large language models
Taiqiang Wu, Chaofan Tao, Jiahao Wang, Runming Yang, Zhe Zhao, and Ngai Wong. Rethinking kullback-leibler divergence in knowledge distillation for large language models. InProceedings of the 31st International Conference on Computational Linguistics, pages 5737–5755, 2025
2025
-
[49]
SmoothQuant: Accurate and efficient post-training quantization for large language models
Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. SmoothQuant: Accurate and efficient post-training quantization for large language models. InProceedings of the 40th International Conference on Machine Learning, pages 38087–38099, 2023
2023
-
[50]
Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, et al. Qwen2.5-omni technical report.arXiv preprint arXiv:2503.20215, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[51]
OneBit: Towards extremely low-bit large language models
Yuzhuang Xu, Xu Han, Zonghan Yang, Shuo Wang, Qingfu Zhu, Zhiyuan Liu, Weidong Liu, and Wanxiang Che. OneBit: Towards extremely low-bit large language models. InAdvances in Neural Information Processing Systems 37, pages 66357–66382, 2024
2024
-
[52]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[53]
HellaSwag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, 2019
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, 2019
2019
-
[54]
ABQ-LLM: Arbitrary-bit quantized inference acceleration for large language models
Chao Zeng, Songwei Liu, Yusheng Xie, Hong Liu, Xiaojian Wang, Miao Wei, Shu Yang, Fangmin Chen, and Xing Mei. ABQ-LLM: Arbitrary-bit quantized inference acceleration for large language models. In Proceedings of the 39th AAAI Conference on Artificial Intelligence, pages 22299–22307, 2025
2025
-
[55]
LQER: Low-rank quantization error reconstruction for LLMs
Cheng Zhang, Jianyi Cheng, George A Constantinides, and Yiren Zhao. LQER: Low-rank quantization error reconstruction for LLMs. InProceedings of the 41st International Conference on Machine Learning, pages 58763–58779, 2024. 18
2024
-
[56]
arXiv preprint arXiv:2503.19633 , year=
Han Zhao, Haotian Wang, Yiping Peng, Sitong Zhao, Xiaoyu Tian, Shuaiting Chen, Yunjie Ji, and Xiangang Li. 1.4 million open-source distilled reasoning dataset to empower large language model training.arXiv preprint arXiv:2503.19633, 2025
-
[57]
A review on edge large language models: Design, execution, and applications.ACM Computing Surveys, 57(8):1–35, 2025
Yue Zheng, Yuhao Chen, Bin Qian, Xiufang Shi, Yuanchao Shu, and Jiming Chen. A review on edge large language models: Design, execution, and applications.ACM Computing Surveys, 57(8):1–35, 2025
2025
-
[58]
Instruction-Following Evaluation for Large Language Models
Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[59]
MLVU: Benchmarking multi-task long video understanding
Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Zhengyang Liang, Shitao Xiao, Minghao Qin, Xi Yang, Yongping Xiong, Bo Zhang, et al. MLVU: Benchmarking multi-task long video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13691– 13701, 2025
2025
-
[60]
A survey on model compression for large language models.Transactions of the Association for Computational Linguistics, 12:1556–1577, 2024
Xunyu Zhu, Jian Li, Yong Liu, Can Ma, and Weiping Wang. A survey on model compression for large language models.Transactions of the Association for Computational Linguistics, 12:1556–1577, 2024. 19 A Details of Experimental Results In Tables 16, 17, 18, and 19, we report the comprehensive per-task results underlying the average scores presented in the mai...
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.