pith. sign in

arxiv: 2606.23419 · v1 · pith:DCUXCCTVnew · submitted 2026-06-22 · 💻 cs.LG · cs.AI

GRINQH: Graded Input-based Quantization Hierarchy for Efficient LLM Generation

Pith reviewed 2026-06-26 08:36 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords LLM quantizationpost-training quantizationmixed-precision inferenceefficient autoregressive decodingweight-only quantizationGPU kernel optimization
0
0 comments X

The pith

GRINQH assigns LLM weight channels to precision levels using activation magnitudes as importance proxy, allowing variable bit widths that outperform fixed and mixed baselines at 3-4 bits and support effective 2-bit generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces GRINQH as a weight-only post-training quantization method that addresses the memory bandwidth bottleneck in LLM autoregressive decoding. It treats inference asymmetrically by focusing on the memory-bound decode phase and dynamically grades weight channels into a hierarchy of precisions based on input activation magnitudes. This unifies quantization with sparsification to achieve flexible average bit widths. Evaluated on Llama3 and Qwen3, the approach claims better quality than existing methods at comparable 3- and 4-bit settings while enabling usable 2-bit operation. A custom GPU kernel with nested memory layout confirms the theoretical speed gains.

Core claim

GRINQH dynamically assigns weight channels to different precision levels in a graded hierarchy by using activation magnitudes to estimate computational importance, producing a single framework that improves generation quality over fixed- and mixed-precision baselines at 3- and 4-bit averages and permits effective 2-bit decoding on Llama3 and Qwen3 models.

What carries the argument

The GRaded INput-based Quantization Hierarchy, which uses activation magnitudes to route weight channels into multiple precision tiers within a unified quantization-sparsification scheme.

If this is right

  • The method produces a new Pareto-optimal trade-off curve between output quality and decode speed.
  • Effective 2-bit generation becomes feasible without the quality collapse seen in prior fixed-precision approaches.
  • A hierarchical nested memory layout in a custom kernel translates the variable bit widths into measured wall-clock speedups.
  • The same graded assignment principle can be applied at inference time without retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the activation-magnitude proxy generalizes, the hierarchy could be recomputed on the fly for different prompts or tasks rather than fixed once per model.
  • The approach may combine with KV-cache compression to further reduce memory traffic in long-context settings.
  • Similar input-driven grading might apply to non-transformer architectures where activation statistics also track parameter importance.

Load-bearing premise

Activation magnitudes reliably indicate which weight channels matter most for generation quality when deciding their precision level.

What would settle it

An experiment in which random or uniform channel-to-precision assignments at the same average bit width match or exceed GRINQH quality on Llama3 or Qwen3 would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.23419 by Catherine M. Sch\"ofmann, Emre Neftci, Jan Finkbeiner, Jette Oberl\"ander.

Figure 1
Figure 1. Figure 1: Pareto frontier of GSM8K accuracy vs. VMM runtime for Qwen3-8B on an RTX 4090. GPTQ, AWQ, RTN assume MARLIN execution. The primary operational focus for Large Lan￾guage Models (LLMs) has shifted from train￾ing to efficient inference at scale. Despite advances in hardware throughput, inference remains constrained by the memory wall: a fundamental performance bottleneck where off-chip DRAM transfer speeds ca… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of GRINQH. During decoding, GRINQH mitigates the memory bandwidth bottleneck through a dynamic, channel-wise precision loading scheme. 1) Precision Assignment: Input activations xi are mapped to bit widths bi ∈ {0, 2, 4, 6} based on their magnitude via pre￾computed thresholds from a calibration set. 2) Bit-Stacked Storage: Weights are stored in DRAM using a bit-planar format. Each bmax = 6-bit wei… view at source ↗
Figure 3
Figure 3. Figure 3: GRINQH redefines the quantization Pareto frontier across model families and scales. Dashed lines indicate BF16 baselines. GRINQH precision distribution sweep is compared against iso-bit symmetric RTN, GPTQ, and AWQ baselines. (A) WikiText-2 perplexity vs. effective bit width for the Llama3 Instruct family using bmax ∈ {6, 8}. (B) GSM8K CoT accuracy vs. effective bit width for the Qwen3 family. (Left panel)… view at source ↗
Figure 4
Figure 4. Figure 4: Normalized isolated kernel runtimes over a range of target effective bit widths. Times are normalized w.r.t. 4-bit Marlin kernel on the same device. Representational Fidelity in Decoupled Inference. We evaluate GRINQH in a realistic inference setting where prefill and decoding are treated separately. Since the prefill stage is primarily compute￾bound, loading weights at bmax introduces negligible latency o… view at source ↗
Figure 5
Figure 5. Figure 5: End-to-End performance scaling and task evaluation. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: (A) Sensitivity analysis of the 0-bit (sparsity) fraction (p0) for specific bit-width windows (Top: 2.0–3.0 bits; Bottom: 3.7–4.3 bits). Data points indicate the relative GSM8K performance change across varying sparsity levels with respect to the average performance in that window. Results show that high sparsity benefits low-bit regimes but penalizes higher bit regimes, suggesting that optimal p0 configur… view at source ↗
Figure 7
Figure 7. Figure 7: Correlation between calibration and downstream perplexity. Each data point represents a unique precision distribution configuration P sampled during our hyperparameter sweep. Calibration PPL is computed on The Pile (Uncopyrighted), while downstream PPL is evaluated on Wikitext-2. The strong linear correlation validates calibration PPL as a reliable and computationally efficient proxy metric for performance… view at source ↗
Figure 8
Figure 8. Figure 8: Sensitivity of calibration perplexity to precision fractions pi . Each panel illustrates the impact of a specific bit-width allocation on Llama3-1B performance, constrained to a target effective bit width b ∗ = 4.0 ± 0.1 (bmax = 8). We observe divergent scaling behaviors across the precision levels: boundary bit widths (b0, b1, and b4) show a positive correlation with perplexity, suggesting that excessive … view at source ↗
Figure 9
Figure 9. Figure 9: GRINQH redefines the quantization Pareto frontier across model families and scales. Dashed lines indicate BF16 baselines. GRINQH data points represent a sweep of precision dis￾tributions compared against iso-bit symmetric RTN, GPTQ, and AWQ baselines. (A) WikiText-2 perplexity vs. effective bit width for the Qwen3 family using bmax ∈ {6, 8}. GRINQH outperforms SOTA counterparts, maintaining representationa… view at source ↗
Figure 10
Figure 10. Figure 10: Comparison between GRINQH’s fine-grained dynamic input-channel-wise bit allocation [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: FP32 throughput over DRAM bytes loaded for different effective bit widths on consumer [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Robustness of calibrated thresholds to sample size and subset variation for Llama-3.1-8B [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Quantization method comparison across key metrics on Llama3 8B. Performance met [PITH_FULL_IMAGE:figures/full_fig_p025_13.png] view at source ↗
read the original abstract

Autoregressive decoding with LLMs is primarily bottlenecked by GPU memory bandwidth, especially in edge-computing settings. While quantization is essential for mitigating this bottleneck, most existing methods treat inference as a uniform process and fail to account for the asymmetry between the compute-bound prefill stage and the memory-bound decoding stage. We propose GRINQH (GRaded INput-based Quantization Hierarchy), a weight-only post-training quantization framework that accelerates decoding by unifying quantization and sparsification. GRINQH leverages activation magnitudes as a proxy for computational importance to dynamically assign weight channels to different precision levels, enabling flexible average bit widths during decoding. Evaluated on Llama3 and Qwen3 models, GRINQH outperforms state-of-the-art fixed- and mixed-precision baselines at comparable 3- and 4-bit settings, even enabling effective 2-bit generation. We experimentally verify theoretical speedups by leveraging a hierarchical nested memory layout for multi-precision storage in a custom GPU kernel. Ultimately, GRINQH establishes a new state-of-the-art Pareto frontier for LLM generation, enabling a dynamic trade-off between generation quality and inference speed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes GRINQH, a weight-only post-training quantization framework for LLMs that unifies quantization and sparsification. It dynamically assigns weight channels to precision levels in a hierarchy using activation magnitudes from calibration data as a proxy for computational importance. This is intended to accelerate the memory-bound decoding stage while handling the asymmetry with the compute-bound prefill stage. On Llama3 and Qwen3 models, it claims to outperform fixed- and mixed-precision SOTA baselines at comparable 3- and 4-bit average widths, enable effective 2-bit generation, and experimentally verify theoretical speedups via a custom GPU kernel with hierarchical nested memory layout for multi-precision storage.

Significance. If the central claims hold after addressing validation gaps, GRINQH could meaningfully advance efficient autoregressive LLM inference, particularly for edge devices where memory bandwidth dominates. The dynamic, input-based hierarchy offers a potential improvement over static quantization by allowing flexible average bit widths, and the explicit GPU kernel implementation with speed-up verification is a concrete strength that supports practical impact.

major comments (2)
  1. [Method (activation magnitude proxy)] The central claim of outperformance at 3-/4-bit (and effective 2-bit) settings depends on activation magnitudes serving as a reliable proxy for per-channel importance during autoregressive decoding. The manuscript provides no explicit ablation or comparison demonstrating that this proxy outperforms alternatives such as gradient-based or Hessian-based importance metrics; without this, the reported gains risk being artifacts of calibration choice rather than the hierarchy itself.
  2. [Experiments] No details are given on experimental protocols, including calibration dataset size and selection criteria, number of generation steps evaluated, or how the hierarchy adapts across early vs. late tokens in decoding. This prevents verification that the outperformance supports the claim, as the proxy may be input-dependent.
minor comments (1)
  1. [Abstract] The abstract mentions 'theoretically speedups' but the full text should clarify the exact theoretical model and how the custom kernel achieves them.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point by point below.

read point-by-point responses
  1. Referee: [Method (activation magnitude proxy)] The central claim of outperformance at 3-/4-bit (and effective 2-bit) settings depends on activation magnitudes serving as a reliable proxy for per-channel importance during autoregressive decoding. The manuscript provides no explicit ablation or comparison demonstrating that this proxy outperforms alternatives such as gradient-based or Hessian-based importance metrics; without this, the reported gains risk being artifacts of calibration choice rather than the hierarchy itself.

    Authors: We agree that an explicit ablation comparing activation magnitudes against gradient- and Hessian-based alternatives would strengthen the justification for the chosen proxy. While activation magnitude is a standard, low-overhead proxy in post-training quantization, we will add such an ablation study to the revised manuscript, evaluating all three metrics on the same Llama3 and Qwen3 models and bit-width settings. revision: yes

  2. Referee: [Experiments] No details are given on experimental protocols, including calibration dataset size and selection criteria, number of generation steps evaluated, or how the hierarchy adapts across early vs. late tokens in decoding. This prevents verification that the outperformance supports the claim, as the proxy may be input-dependent.

    Authors: We acknowledge that the original manuscript omitted these protocol details. In the revision we will expand the Experiments section to specify the calibration dataset size and selection, the number of generation steps and prompts used, and provide analysis of hierarchy behavior across early versus late decoding tokens. revision: yes

Circularity Check

0 steps flagged

No circularity detected in GRINQH derivation

full rationale

The provided abstract and manuscript excerpt describe GRINQH as an empirical post-training quantization method that assigns weight-channel precisions using activation magnitudes from calibration data. No equations, parameter-fitting steps, or self-citations appear that would reduce any claimed prediction or uniqueness result to the inputs by construction. The central claims rest on experimental comparisons to baselines rather than a closed definitional or self-referential loop, rendering the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; ledger populated from explicit statements in the abstract. No free parameters, invented entities, or additional axioms are identifiable without the full manuscript.

axioms (1)
  • domain assumption Activation magnitudes are a valid proxy for computational importance when assigning precision levels.
    Stated as the basis for dynamic channel assignment in the quantization hierarchy.

pith-pipeline@v0.9.1-grok · 5741 in / 1243 out tokens · 22292 ms · 2026-06-26T08:36:34.514538+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 3 canonical work pages · 1 internal anchor

  1. [1]

    Abecassis, A

    F. Abecassis, A. Agrusa, D. Ahn, J. Alben, S. Alborghetti, M. Andersch, S. Arayandi, A. Bjorlin, A. Blakeman, E. Briones, et al. Pretraining large language models with nvfp4.arXiv preprint arXiv:2509.25149, 2025

  2. [2]

    D. Alvarez. Juwels cluster and booster: exascale pathfinder with modular supercomputing architecture at juelich supercomputing centre.Journal of large-scale research facilities JLSRF, 7:A183–A183, 2021

  3. [3]

    Ashkboos, I

    S. Ashkboos, I. Markov, E. Frantar, T. Zhong, X. Wang, J. Ren, T. Hoefler, and D. Alistarh. Quik: Towards end-to-end 4-bit inference on generative large language models, 2023. URL https://arxiv.org/abs/2310.09259. 10

  4. [4]

    Ashkboos, A

    S. Ashkboos, A. Mohtashami, M. L. Croci, B. Li, P. Cameron, M. Jaggi, D. Alistarh, T. Hoefler, and J. Hensman. Quarot: Outlier-free 4-bit inference in rotated llms, 2024. URL https: //arxiv.org/abs/2404.00456

  5. [5]

    Y . Bisk, R. Zellers, J. Gao, Y . Choi, et al. Piqa: Reasoning about physical commonsense in natural language. InProceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020

  6. [6]

    H. M. Chen, F. Tan, A. Kouris, R. Lee, H. Fan, and S. I. Venieris. Progressive mixed-precision decoding for efficient llm inference, 2024. URLhttps://arxiv.org/abs/2410.13461

  7. [7]

    Cheng, W

    W. Cheng, W. Zhang, H. Shen, Y . Cai, X. He, K. Lv, and Y . Liu. Optimize weight rounding via signed gradient descent for the quantization of llms, 2023. URL https://arxiv.org/abs/ 2309.05516

  8. [8]

    Clark, K

    C. Clark, K. Lee, M.-W. Chang, T. Kwiatkowski, M. Collins, and K. Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions, 2019. URL https://arxiv. org/abs/1905.10044

  9. [9]

    Clark, I

    P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018. URL https: //arxiv.org/abs/1803.05457

  10. [10]

    Cobbe, V

    K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman. Training verifiers to solve math word problems, 2021. URLhttps://arxiv.org/abs/2110.14168

  11. [11]

    Dettmers, M

    T. Dettmers, M. Lewis, Y . Belkada, and L. Zettlemoyer. Llm.int8(): 8-bit matrix multiplication for transformers at scale, 2022. URLhttps://arxiv.org/abs/2208.07339

  12. [12]

    Kudugunta, A

    Devvrit, S. Kudugunta, A. Kusupati, T. Dettmers, K. Chen, I. Dhillon, Y . Tsvetkov, H. Hajishirzi, S. Kakade, A. Farhadi, and P. Jain. Matformer: Nested transformer for elastic inference, 2023. URLhttps://arxiv.org/abs/2310.07707

  13. [13]

    Eccleston

    D. Eccleston. sharegpt, 2022. URL https://github.com/domeccleston/sharegpt. Ac- cessed 2026-03-12

  14. [14]

    Frantar, S

    E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers, 2022. URL https://arxiv.org/abs/2210.17323

  15. [15]

    Frantar, R

    E. Frantar, R. L. Castro, J. Chen, T. Hoefler, and D. Alistarh. Marlin: Mixed-precision auto- regressive parallel inference on large language models, 2024. URL https://arxiv.org/ abs/2408.11743

  16. [16]

    L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, C. Foster, J. Phang, H. He, A. Thite, N. Nabeshima, S. Presser, and C. Leahy. The pile: An 800gb dataset of diverse text for language modeling, 2021. URLhttps://arxiv.org/abs/2101.00027

  17. [17]

    Grattafiori, A

    A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, et al. The llama 3 herd of models,

  18. [18]

    URLhttps://arxiv.org/abs/2407.21783

  19. [19]

    Hendrycks, C

    D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding, 2020. URL https://arxiv.org/abs/2009. 03300

  20. [20]

    Huang, H

    W. Huang, H. Qin, Y . Liu, Y . Li, Q. Liu, X. Liu, L. Benini, M. Magno, S. Zhang, and X. Qi. Slim-llm: Salience-driven mixed-precision quantization for large language models, 2024. URL https://arxiv.org/abs/2405.14917

  21. [21]

    Jülich Supercomputing Centre. JURECA: Data Centric and Booster Modules implementing the Modular Supercomputing Architecture at Jülich Supercomputing Centre.Journal of large-scale research facilities, 7(A182), 2021. doi: 10.17815/jlsrf-7-182. URL http://dx.doi.org/10. 17815/jlsrf-7-182

  22. [22]

    Kleinegger, E

    M. Kleinegger, E. Crnˇcevi´c, and D. Alistarh. Matgptq: Accurate and efficient post-training matryoshka quantization, 2026. URLhttps://arxiv.org/abs/2602.03537

  23. [23]

    W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023. 11

  24. [24]

    J. Lin, J. Tang, H. Tang, S. Yang, W.-M. Chen, W.-C. Wang, G. Xiao, X. Dang, C. Gan, and S. Han. Awq: Activation-aware weight quantization for llm compression and acceleration, 2023. URLhttps://arxiv.org/abs/2306.00978

  25. [25]

    J. Liu, P. Ponnusamy, T. Cai, H. Guo, Y . Kim, and B. Athiwaratkun. Training-free activation sparsity in large language models, 2024. URLhttps://arxiv.org/abs/2408.14690

  26. [26]

    Z. Liu, J. Wang, T. Dao, T. Zhou, B. Yuan, Z. Song, A. Shrivastava, C. Zhang, Y . Tian, C. Re, and B. Chen. Deja vu: Contextual sparsity for efficient llms at inference time. 2023. doi: 10.48550/ARXIV .2310.17157. URLhttps://arxiv.org/abs/2310.17157

  27. [27]

    Merity, C

    S. Merity, C. Xiong, J. Bradbury, and R. Socher. Pointer sentinel mixture models, 2016. URL https://arxiv.org/abs/1609.07843

  28. [28]

    Mirzadeh, K

    I. Mirzadeh, K. Alizadeh, S. Mehta, C. C. Del Mundo, O. Tuzel, G. Samei, M. Rastegari, and M. Farajtabar. Relu strikes back: Exploiting activation sparsity in large language models, 2023. URLhttps://arxiv.org/abs/2310.04564

  29. [29]

    P. Nair, P. Datta, J. Dean, P. Jain, and A. Kusupati. Matryoshka quantization, 2025. URL https://arxiv.org/abs/2502.06786

  30. [30]

    NVIDIA Corporation, 2026

    NVIDIA Corporation.NVIDIA Nsight Compute CLI. NVIDIA Corporation, 2026. URL https://docs.nvidia.com/nsight-compute/NsightComputeCli/index.html. Ver- sion 2025.4.1

  31. [31]

    Paperno, G

    D. Paperno, G. Kruszewski, A. Lazaridou, Q. N. Pham, R. Bernardi, S. Pezzelle, M. Baroni, G. Boleda, and R. Fernández. The lambada dataset: Word prediction requiring a broad discourse context, 2016. URLhttps://arxiv.org/abs/1606.06031

  32. [32]

    Y . Park, J. Hyun, S. Cho, B. Sim, and J. W. Lee. Any-precision llm: Low-cost deployment of multiple, different-sized llms, 2024. URLhttps://arxiv.org/abs/2402.10517

  33. [33]

    R. Pope, S. Douglas, A. Chowdhery, J. Devlin, J. Bradbury, A. Levskaya, J. Heek, K. Xiao, S. Agrawal, and J. Dean. Efficiently scaling transformer inference, 2022. URL https: //arxiv.org/abs/2211.05102

  34. [34]

    LLM Compressor

    Red Hat AI and vLLM Project. LLM Compressor. https://github.com/vllm-project/ llm-compressor, Aug. 2024

  35. [35]

    Sakaguchi, R

    K. Sakaguchi, R. L. Bras, C. Bhagavatula, and Y . Choi. Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

  36. [36]

    C. Song, X. Han, Z. Zhang, S. Hu, X. Shi, K. Li, C. Chen, Z. Liu, G. Li, T. Yang, and M. Sun. Prosparse: Introducing and enhancing intrinsic activation sparsity within large language models,

  37. [37]

    URLhttps://arxiv.org/abs/2402.13516

  38. [38]

    M. Sun, Z. Liu, A. Bair, and J. Z. Kolter. A simple and effective pruning approach for large language models, 2023. URLhttps://arxiv.org/abs/2306.11695

  39. [39]

    Sutawika, H

    L. Sutawika, H. Schoelkopf, L. Gao, B. Abbasi, S. Biderman, J. Tow, B. Fattori, C. Lovering, J. Phang, A. Thite, T. Wang, sdtblck, gakada, nopperl, researcher2, tttyuntian, E. Julen, Chris, J. A. Michaelov, H. A. Lee, Janna, L. Sinev, Z. Kasner, K. Stokes, Khalid, and KonradSzafer. Eleutherai/lm-evaluation-harness: lm-eval v0.4.9.2 release notes, 2025. UR...

  40. [40]

    Tillet, H

    P. Tillet, H. Kung, and D. Cox. Triton: An intermediate language and compiler for tiled neural network computations. 2019. URL https://www.eecs.harvard.edu/~htk/ publication/2019-mapl-tillet-kung-cox.pdf

  41. [41]

    Torchao: Pytorch-native training-to-serving model optimization, oct 2024

    torchao. Torchao: Pytorch-native training-to-serving model optimization, oct 2024. URL https://github.com/pytorch/ao

  42. [42]

    Tseng, J

    A. Tseng, J. Chee, Q. Sun, V . Kuleshov, and C. De Sa. Quip#: Even better llm quantization with hadamard incoherence and lattice codebooks, 2024. URL https://arxiv.org/abs/ 2402.04396

  43. [43]

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, and D. Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2022. URL https: //arxiv.org/abs/2201.11903. 12

  44. [44]

    T. Wolf, L. Debut, V . Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y . Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger, M. Drame, Q. Lhoest, and A. M. Rush. Huggingface’s transformers: State-of-the-art natural language processing, 2019. URLhttps://arxiv.org/abs/1910.03771

  45. [45]

    G. Xiao, J. Lin, M. Seznec, H. Wu, J. Demouth, and S. Han. Smoothquant: Accurate and efficient post-training quantization for large language models, 2022. URL https://arxiv. org/abs/2211.10438

  46. [46]

    A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, C. Zheng, D. Liu, F. Zhou, F. Huang, F. Hu, H. Ge, H. Wei, H. Lin, J. Tang, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Zhou, J. Lin, K. Dang, K. Bao, K. Yang, L. Yu, L. Deng, M. Li, M. Xue, M. Li, P. Zhang, P. Wang, Q. Zhu, R. Men, R. Gao, S. Liu, S. Luo, T. ...

  47. [47]

    Zellers, A

    R. Zellers, A. Holtzman, Y . Bisk, A. Farhadi, and Y . Choi. Hellaswag: Can a machine really finish your sentence?, 2019. URLhttps://arxiv.org/abs/1905.07830

  48. [48]

    free sparsity

    J. Zhou, T. Lu, S. Mishra, S. Brahma, S. Basu, Y . Luan, D. Zhou, and L. Hou. Instruction- following evaluation for large language models.arXiv preprint arXiv:2311.07911, 2023. 13 A Appendix: Hyperparameter Selection and Tuning This section details the formalization of our precision distribution (Section A.1), describes the random sweep used to validate t...

  49. [49]

    Sample:Use a constrained random sampler (e.g., based on a Dirichlet distribution) to generate ∼15 candidate vectors P that satisfy the simplex constraints and align with the sparsity regimes identified above

  50. [50]

    4.Select:Deploy the configuration that yields the lowest calibration PPL

    Calibrate:Perform a single calibration forward pass for each P to determine the layer-wise thresholds and the resulting calibration PPL. 4.Select:Deploy the configuration that yields the lowest calibration PPL. This empirical approach effectively identifies high-performance distributions without the need for an exhaustive search. Future work may further a...