pith. sign in

arxiv: 2606.05429 · v1 · pith:RF6OS6SWnew · submitted 2026-06-03 · 💻 cs.AI

Minimizing the Hidden Cost of Scales: Graph-Guided Ultra-Low-Bit Quantization for Large Language Models

Pith reviewed 2026-06-28 06:07 UTC · model grok-4.3

classification 💻 cs.AI
keywords post-training quantizationultra-low-bit quantizationlarge language modelsweight saliencygraph-guided groupingscaling overheadLLaMA models
0
0 comments X

The pith

SAGE-PTQ cuts average LLM weight bits to 1.03 and scaling bits to 0.004 by separating salient weights via distributional statistics and modeling unsalient weights as a sparse graph to set group counts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SAGE-PTQ as a post-training quantization method that first uses distributional statistics to split weights into salient and unsalient subsets. It then represents the unsalient weights as a sparse graph to determine the best number of groups per layer, applies multi-bit precision only to salient weights and binarization to the rest, and uses one scale per channel for salient weights plus one scalar per unsalient group. Adaptive saliency thresholding chooses the split ratio per matrix. This combination is shown to reach the stated bit rates while delivering lower perplexity on LLaMA models than prior ultra-low-bit methods and lower memory use during inference.

Core claim

SAGE-PTQ separates salient and unsalient weights using distributional statistics, models subsampled unsalient weights as a sparse graph to estimate the optimal number of groups per layer, applies dual-mode quantization with multi-bit precision on salient weights and binarization on unsalient weights, employs one per-channel scale for salient weights and one scalar per unsalient group, and uses adaptive saliency thresholding; the result is an average of 1.03 weight bits and 0.004 scaling bits per matrix, with 6.74 WikiText2 perplexity on LLaMA-3-8B versus 55.8 for BiLLM and under 50 percent of BiLLM's GPU memory, plus 1.5x faster decoding on LLaMA-2-70B.

What carries the argument

Saliency-aware graph-guided group estimation that models unsalient weights as a sparse graph to determine per-layer group counts while keeping one scalar scale per group.

If this is right

  • Models quantized with SAGE-PTQ fit in less than half the GPU memory of BiLLM while maintaining usable perplexity.
  • Decoding speed on a single NVIDIA L40 GPU increases by 1.5x for 70B-scale models compared with prior ultra-low-bit methods.
  • The per-matrix adaptive threshold removes the need for manual saliency ratio tuning across layers.
  • Only one scale factor per unsalient group replaces the multiple scales required by earlier group-wise methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The graph-modeling step could be replaced by a cheaper heuristic if the optimal group count correlates strongly with simple statistics such as weight variance.
  • Extending the same separation logic to activation quantization might further reduce total inference cost without retraining.
  • The reported memory savings suggest the method could enable running 70B models on consumer GPUs that previously handled only 7B models.

Load-bearing premise

Distributional statistics can reliably identify which weights are salient versus unsalient and that a sparse graph on the unsalient subset yields group counts that preserve final model quality.

What would settle it

Running SAGE-PTQ on LLaMA-3-8B and measuring WikiText2 perplexity above 10 or average scaling bits above 0.01 would show the claimed bit rates and quality cannot be reached simultaneously.

Figures

Figures reproduced from arXiv: 2606.05429 by Amir Hussein, Dinesh Manocha, Min Wu, Rayyan Abdalla.

Figure 1
Figure 1. Figure 1: Illustration of the SAGE-PTQ quantization pipeline. The framework input is the weight [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Memory footprint of SAGE-PTQ versus BiLLM [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Ablation results of SAGE-PTQ on OPT 6.7B, LLaMA 7B, and DeepSeek 7B. Full pipeline achieves best perplexity. 1.3B 2.7B 6.7B 13B 30B 66B OPT Model Size 10 12 14 16 18 Perplexity Perplexity on Wikitext2 1.3B 2.7B 6.7B 13B 30B 66B OPT Model Size 12 14 16 18 20 Perplexity Perplexity on C4 NLUT = 3 NLUT = 4 NLUT = 5 NLUT = 6 [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Distribution of self-attention projection layer weights in OPT-6.7B across selected layers [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Distribution of self-attention projection layer weights in LLaMa-1 7B across Self-Attention [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Self-Attention Weight Distributions in DeepSeek-7B of selected layers (0,15,29): Non [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Extended evaluation of SAGE-PTQ versus BiLLM under a lookup table constraint of NLUT = 4 bits across multiple model families (OPT, LLaMA-1, LLaMA-2, and instruction-tuned Vicuna) ranging from 13B to 70B parameters. The bar plots report perplexity on WikiText2, PTB, and C4 datasets. SAGE-PTQ consistently outperforms BiLLM across all models, achieving an average precision of 1.03–1.07 bits per weight. 15 [P… view at source ↗
Figure 9
Figure 9. Figure 9: SAGE-PTQ Ablation study on saliency metrics across three model families (OPT-6.7B, [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Perplexity impact of varying lookup table (LUT) sizes in [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗
read the original abstract

Post-training quantization (PTQ) is critical for the efficient deployment of large language models (LLMs). Recent ultra-low-bit PTQ methods rely on rigid weight-saliency assumptions or position heuristics, introducing substantial hidden scaling overhead. We propose SAGE-PTQ (Saliency-Aware Graph-guided Efficient PTQ), a novel ultra-low-bit quantization framework for LLMs that minimizes hidden scaling cost. SAGE-PTQ separates salient and unsalient weights using distributional statistics, then models subsampled unsalient weights as a sparse graph to estimate the optimal number of groups per layer. SAGE-PTQ applies dual-mode quantization, assigning multi-bit precision to salient weights and binarizing unsalient weights. To reduce scaling overhead, SAGE-PTQ uses one per-channel scale for salient weights and one scalar per unsalient group. Finally, SAGE-PTQ implements adaptive saliency thresholding to select the optimal saliency ratio per matrix. SAGE-PTQ achieves 1.03 weight bits and only 0.004 scaling bits per matrix on average, outperforming state-of-the-art methods such as BiLLM and PB-LLM. On LLaMA-3-8B, SAGE-PTQ achieves 6.74 WikiText2 perplexity, compared to 55.8 for BiLLM, while using less than 50% of BiLLM's GPU memory. On LLaMA-2-70B, SAGE-PTQ provides 1.5x faster decoding on one NVIDIA L40 GPU, demonstrating practical inference efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces SAGE-PTQ, a post-training quantization framework for LLMs that separates salient and unsalient weights using distributional statistics, models the unsalient subset as a sparse graph to estimate per-layer group counts, applies dual-mode quantization (multi-bit for salient weights, binary for unsalient), and reduces scaling overhead via one per-channel scale for salient weights and one scalar per unsalient group, plus adaptive saliency thresholding. It reports average 1.03 weight bits and 0.004 scaling bits per matrix, with LLaMA-3-8B achieving 6.74 WikiText2 perplexity (vs. 55.8 for BiLLM) and efficiency gains on LLaMA-2-70B.

Significance. If the empirical results and underlying assumptions hold under rigorous validation, the work would be significant for ultra-low-bit LLM quantization by directly targeting hidden scaling costs, a practical bottleneck in prior methods like BiLLM. The graph-guided group estimation offers a potentially parameter-light way to control overhead while preserving quality, which could influence deployment strategies if the separation and estimation steps prove robust across models.

major comments (2)
  1. [Experimental Results] Experimental Results section: The headline claims (1.03 weight bits, 0.004 scaling bits, 6.74 WikiText2 PPL on LLaMA-3-8B vs. 55.8 for BiLLM) are presented without error bars, run counts, or explicit baseline re-implementation details and ablation tables; this directly affects verifiability of the dual-mode scheme's quality preservation at the reported bit rates.
  2. [Method] Method section on sparse-graph group estimation: The central claim that distributional statistics cleanly partition weights and that the sparse-graph model on subsampled unsalient weights yields an accurate per-layer group count (enabling the 0.004 scaling-bit overhead) lacks any comparison to exhaustive search or ablation showing that the derived group numbers match those that would minimize quantization error without quality degradation.
minor comments (1)
  1. [Method] The description of adaptive saliency thresholding would benefit from an explicit equation or pseudocode for how the optimal saliency ratio per matrix is selected, as the current prose leaves the decision criterion implicit.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on improving the verifiability of our empirical claims and the validation of the sparse-graph group estimation. We provide point-by-point responses to the major comments below, indicating where revisions will be made.

read point-by-point responses
  1. Referee: [Experimental Results] Experimental Results section: The headline claims (1.03 weight bits, 0.004 scaling bits, 6.74 WikiText2 PPL on LLaMA-3-8B vs. 55.8 for BiLLM) are presented without error bars, run counts, or explicit baseline re-implementation details and ablation tables; this directly affects verifiability of the dual-mode scheme's quality preservation at the reported bit rates.

    Authors: Post-training quantization is a deterministic process once calibration data and hyperparameters are fixed, which is standard in the field and explains the absence of error bars or multiple run counts. We agree, however, that explicit details on baseline re-implementations and additional ablation tables would strengthen verifiability. In the revised manuscript we will add a dedicated subsection describing how BiLLM and PB-LLM were re-implemented (including calibration settings and code availability) and include new ablation tables on the dual-mode quantization components. The reported numbers remain unchanged as they were obtained under the documented protocol. revision: partial

  2. Referee: [Method] Method section on sparse-graph group estimation: The central claim that distributional statistics cleanly partition weights and that the sparse-graph model on subsampled unsalient weights yields an accurate per-layer group count (enabling the 0.004 scaling-bit overhead) lacks any comparison to exhaustive search or ablation showing that the derived group numbers match those that would minimize quantization error without quality degradation.

    Authors: Exhaustive search over per-layer group counts is computationally prohibitive for models at the scale of LLaMA-70B. The sparse-graph model is presented as an efficient, structure-aware approximation rather than an exact optimizer. To address the concern we will add an ablation study on smaller models (LLaMA-7B and LLaMA-13B) that compares the group counts produced by the graph estimator against those obtained via grid search minimizing per-layer quantization error. The results of this comparison, together with the corresponding perplexity impact, will be reported in the revised Method and Experiments sections. revision: yes

Circularity Check

0 steps flagged

No circularity; method is an independent algorithmic construction

full rationale

The provided abstract and description contain no equations, derivations, or self-citations that reduce any claimed result to its own inputs by construction. Separation of salient/unsalient weights via distributional statistics, sparse-graph group estimation, dual-mode quantization, and adaptive thresholding are presented as procedural choices whose outputs are validated empirically on perplexity and bit-rate metrics. No fitted parameter is relabeled as a prediction, no uniqueness theorem is imported from prior self-work, and no ansatz is smuggled via citation. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities can be extracted beyond the high-level algorithmic choices described.

pith-pipeline@v0.9.1-grok · 5824 in / 1256 out tokens · 26148 ms · 2026-06-28T06:07:32.916571+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

54 extracted references · 27 canonical work pages · 13 internal anchors

  1. [1]

    OPT: Open Pre-trained Transformer Language Models

    Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Kantharaj Dewan, et al. Opt: Open pre-trained transformer language models.arXiv preprint arXiv:2205.01068, 2022

  2. [2]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

  3. [3]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Sergey Bashlykov, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

  4. [4]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  5. [5]

    Deepseek-vl: Scaling vision-language models with decoupled pretraining

    DeepSeek AI. Deepseek-vl: Scaling vision-language models with decoupled pretraining. https:// github.com/deepseek-ai, 2024. Accessed: 2024-07-25

  6. [6]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique de Oliveira Pinto, Jared Kaplan, Harri Edwards, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

  7. [7]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models.arXiv preprint arXiv:2201.11903, 2022

  8. [8]

    GPT-4 Technical Report

    OpenAI. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  9. [9]

    Abq-llm: Arbitrary-bit quantized inference acceleration for large language models

    Chao Zeng, Songwei Liu, Yusheng Xie, Hong Liu, Xiaojian Wang, Miao Wei, Shu Yang, Fangmin Chen, and Xing Mei. Abq-llm: Arbitrary-bit quantized inference acceleration for large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 22299–22307, 2025

  10. [10]

    Post-training quantization for vision transformer.arXiv preprint arXiv:2208.13555, 2022

    Xing Liu, Zhenhua Zhang, Qinghao Ye, Yiren Lin, Xiangyu Zhang, and Jian Sun. Post-training quantization for vision transformer.arXiv preprint arXiv:2208.13555, 2022

  11. [11]

    arXiv preprint arXiv:1909.10351 , year=

    Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. Tinybert: Distilling bert for natural language understanding.arXiv preprint arXiv:1909.10351, 2019

  12. [13]

    Dynamic sparse attention for scalable transformer acceleration.arXiv preprint arXiv:2301.11270, 2023

    Zhuohan Ma and et al. Dynamic sparse attention for scalable transformer acceleration.arXiv preprint arXiv:2301.11270, 2023

  13. [14]

    Knowledge distillation: A survey

    Jianping Gou, Baosheng Yu, Stephen J Maybank, and Dacheng Tao. Knowledge distillation: A survey. International Journal of Computer Vision, 129:1789–1819, 2021

  14. [15]

    Efficient language model distillation using hugging face transformers.arXiv preprint arXiv:2301.11734, 2023

    Lewis Tunstall, Leandro von Werra, and Thomas Wolf. Efficient language model distillation using hugging face transformers.arXiv preprint arXiv:2301.11734, 2023

  15. [16]

    LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

    Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Llm.int8(): 8-bit matrix multiplication for transformers at scale.arXiv preprint arXiv:2208.07339, 2022

  16. [17]

    SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

    Ziheng Xiao, Ren Yan, Shaohuai Wang, Zhe Wei, Ang Li, Mingyu Zhang, Xiaowei Li, and Yiying Wang. Smoothquant: Accurate and efficient post-training quantization for large language models.arXiv preprint arXiv:2211.10438, 2023

  17. [18]

    GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

    Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323, 2022

  18. [19]

    Awq: Activation-aware weight quantization for on-device llm compression and acceleration.Proceedings of machine learning and systems, 6:87–100, 2024

    Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization for on-device llm compression and acceleration.Proceedings of machine learning and systems, 6:87–100, 2024

  19. [20]

    Pb-llm: Partially binarized large language models.arXiv preprint arXiv:2310.00034, 2023

    Yuzhang Shang, Zhihang Yuan, Qiang Wu, and Zhen Dong. Pb-llm: Partially binarized large language models.arXiv preprint arXiv:2310.00034, 2023. 10

  20. [21]

    Billm: Pushing the limit of post-training quantization for llms.arXiv preprint arXiv:2402.04291, 2024

    Wei Huang, Yangdong Liu, Haotong Qin, Ying Li, Shiming Zhang, Xianglong Liu, Michele Magno, and Xi- aojuan Qi. Billm: Pushing the limit of post-training quantization for llms.arXiv preprint arXiv:2402.04291, 2024

  21. [22]

    Up or down? adaptive rounding for post-training quantization

    Markus Nagel, Rana Ali Amjad, Mart Van Baalen, Christos Louizos, and Tijmen Blankevoort. Up or down? adaptive rounding for post-training quantization. InInternational conference on machine learning, pages 7197–7206. PMLR, 2020

  22. [23]

    Brecq: Pushing the limit of post-training quantization by block reconstruction.arXiv preprint arXiv:2102.05426, 2021

    Yuhang Li, Ruihao Gong, Xu Tan, Yang Yang, Peng Hu, Qi Zhang, Fengwei Yu, Wei Wang, and Shi Gu. Brecq: Pushing the limit of post-training quantization by block reconstruction.arXiv preprint arXiv:2102.05426, 2021

  23. [24]

    Zeroquant: Efficient and affordable post-training quantization for large-scale transformers.Advances in neural information processing systems, 35:27168–27183, 2022

    Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, Conglong Li, and Yuxiong He. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers.Advances in neural information processing systems, 35:27168–27183, 2022

  24. [25]

    Quip#: Even better llm quantization with hadamard incoherence and lattice codebooks.arXiv preprint arXiv:2402.04396,

    Albert Tseng, Jerry Chee, Qingyao Sun, V olodymyr Kuleshov, and Christopher De Sa. Quip#: Even better llm quantization with hadamard incoherence and lattice codebooks.arXiv preprint arXiv:2402.04396, 2024

  25. [26]

    Sqpr: Stream query planning with reuse

    Evangelia Kalyvianaki, Wolfram Wiesemann, Quang Hieu Vu, Daniel Kuhn, and Peter Pietzuch. Sqpr: Stream query planning with reuse. In2011 IEEE 27th International Conference on Data Engineering, pages 840–851. IEEE, 2011

  26. [27]

    arXiv preprint arXiv:2306.07629 , year=

    Sehoon Kim, Coleman Hooper, Amir Gholami, Zhen Dong, Xiuyu Li, Sheng Shen, Michael W Mahoney, and Kurt Keutzer. Squeezellm: Dense-and-sparse quantization.arXiv preprint arXiv:2306.07629, 2023

  27. [28]

    Omniquant: Omnidirectionally calibrated quantization for large lan- guage models.arXiv preprint arXiv:2308.13137,

    Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, and Ping Luo. Omniquant: Omnidirectionally calibrated quantization for large language models.arXiv preprint arXiv:2308.13137, 2023

  28. [29]

    Bcq: Block clustered quantization for 4-bit (w4a4) llm inference.arXiv preprint arXiv:2502.05376, 2025

    Reena Elangovan, Charbel Sakr, Anand Raghunathan, and Brucek Khailany. Bcq: Block clustered quantization for 4-bit (w4a4) llm inference.arXiv preprint arXiv:2502.05376, 2025

  29. [30]

    Hawq: Hessian aware quantization of neural networks with mixed-precision

    Zhen Dong, Zhewei Yao, Amir Gholami, Michael W Mahoney, and Kurt Keutzer. Hawq: Hessian aware quantization of neural networks with mixed-precision. InProceedings of the IEEE/CVF international conference on computer vision, pages 293–302, 2019

  30. [31]

    Qdrop: Randomly dropping quantization for extremely low-bit post-training quantization.arXiv preprint arXiv:2203.05740, 2022

    Xiuying Wei, Ruihao Gong, Yuhang Li, Xianglong Liu, and Fengwei Yu. Qdrop: Randomly dropping quantization for extremely low-bit post-training quantization.arXiv preprint arXiv:2203.05740, 2022

  31. [32]

    Self-tuning spectral clustering.Advances in neural information processing systems, 17:1601–1608, 2005

    Lihi Zelnik-Manor and Pietro Perona. Self-tuning spectral clustering.Advances in neural information processing systems, 17:1601–1608, 2005

  32. [33]

    On spectral clustering: Analysis and an algorithm

    Andrew Y Ng, Michael I Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm. Advances in neural information processing systems, 14:849–856, 2002

  33. [34]

    A tutorial on spectral clustering.Statistics and computing, 17(4):395–416, 2007

    Ulrike von Luxburg. A tutorial on spectral clustering.Statistics and computing, 17(4):395–416, 2007

  34. [35]

    Silhouettes: a graphical aid to the interpretation and validation of cluster analysis

    Peter J Rousseeuw. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics, 20:53–65, 1987

  35. [36]

    Compressing deep convolutional networks using vector quantization

    Yunchao Gong, Liu Liu, Ming Yang, and Lubomir Bourdev. Compressing deep convolutional networks using vector quantization. InInternational Conference on Learning Representations (ICLR), 2014

  36. [37]

    Clip-q: Deep network compression learning by in-parallel pruning- quantization

    Frederick Tung and Greg Mori. Clip-q: Deep network compression learning by in-parallel pruning- quantization. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 7873–7882, 2018

  37. [38]

    Bi-vlm: Pushing ultra-low precision post-training quantization boundaries in vision-language models

    Xijun Wang, Junyun Huang, Rayyan Abdalla, Chengyuan Zhang, Ruiqi Xian, and Dinesh Manocha. Bi-vlm: Pushing ultra-low precision post-training quantization boundaries in vision-language models. arXiv preprint arXiv:2509.18763, 2025

  38. [39]

    Towards superior quantization accuracy: A layer-sensitive approach.arXiv e-prints, pages arXiv–2503, 2025

    Feng Zhang, Yanbin Liu, Weihua Li, Jie Lv, Xiaodan Wang, and Quan Bai. Towards superior quantization accuracy: A layer-sensitive approach.arXiv e-prints, pages arXiv–2503, 2025

  39. [40]

    Courier Corporation, 2013

    Richard P Brent.Algorithms for minimization without derivatives. Courier Corporation, 2013. 11

  40. [41]

    Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019

  41. [42]

    HuggingFace's Transformers: State-of-the-art Natural Language Processing

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. Huggingface’s transformers: State-of-the-art natural language processing.arXiv preprint arXiv:1910.03771, 2019

  42. [43]

    The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024

  43. [44]

    Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.See https://vicuna

    Wei-Lin Chiang, Zhuohan Li, Ziqing Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.See https://vicuna. lmsys. org (accessed 14 April 2023), 2(3):6, 2023

  44. [45]

    Qwen2.5: A party of foundation models

    Qwen Team. Qwen2.5: A party of foundation models. https://qwenlm.github.io/blog/qwen2.5/,

  45. [46]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  46. [47]

    Pointer Sentinel Mixture Models

    Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. In arXiv preprint arXiv:1609.07843, 2016

  47. [48]

    Exploring the limits of transfer learning with a unified text-to-text transformer

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 2020

  48. [49]

    Piqa: Reasoning about physi- cal commonsense in natural language

    Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physi- cal commonsense in natural language. InProceedings of the AAAI Conference on Artificial Intelligence, 2020

  49. [50]

    Boolq: Exploring the surprising difficulty of natural yes/no questions

    Christopher Clark and Kenton Lee. Boolq: Exploring the surprising difficulty of natural yes/no questions. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, 2019

  50. [51]

    Can a suit of armor conduct electricity? a new dataset for open book question answering

    Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018

  51. [52]

    Winogrande: An adversarial winograd schema challenge at scale

    Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. InProceedings of the AAAI Conference on Artificial Intelligence, 2021

  52. [53]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark and Oren Etzioni. Think you have solved question answering? try arc, the ai2 reasoning challenge. InarXiv preprint arXiv:1803.05457, 2018

  53. [54]

    Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019

  54. [55]

    The penn treebank: Annotating predicate argument structure

    Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. The penn treebank: Annotating predicate argument structure. InHuman Language Technology, 1994. 12 A Statistical Analysis of Weight Matrices and Outlier Pattern Detection Efficient post-training quantization for large language models (LLMs) demands careful modeling of weight matrix statisti...