GRINQH: Graded Input-based Quantization Hierarchy for Efficient LLM Generation

Catherine M. Sch\"ofmann; Emre Neftci; Jan Finkbeiner; Jette Oberl\"ander

arxiv: 2606.23419 · v1 · pith:DCUXCCTVnew · submitted 2026-06-22 · 💻 cs.LG · cs.AI

GRINQH: Graded Input-based Quantization Hierarchy for Efficient LLM Generation

Jette Oberl\"ander , Jan Finkbeiner , Catherine M. Sch\"ofmann , Emre Neftci This is my paper

Pith reviewed 2026-06-26 08:36 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords LLM quantizationpost-training quantizationmixed-precision inferenceefficient autoregressive decodingweight-only quantizationGPU kernel optimization

0 comments

The pith

GRINQH assigns LLM weight channels to precision levels using activation magnitudes as importance proxy, allowing variable bit widths that outperform fixed and mixed baselines at 3-4 bits and support effective 2-bit generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces GRINQH as a weight-only post-training quantization method that addresses the memory bandwidth bottleneck in LLM autoregressive decoding. It treats inference asymmetrically by focusing on the memory-bound decode phase and dynamically grades weight channels into a hierarchy of precisions based on input activation magnitudes. This unifies quantization with sparsification to achieve flexible average bit widths. Evaluated on Llama3 and Qwen3, the approach claims better quality than existing methods at comparable 3- and 4-bit settings while enabling usable 2-bit operation. A custom GPU kernel with nested memory layout confirms the theoretical speed gains.

Core claim

GRINQH dynamically assigns weight channels to different precision levels in a graded hierarchy by using activation magnitudes to estimate computational importance, producing a single framework that improves generation quality over fixed- and mixed-precision baselines at 3- and 4-bit averages and permits effective 2-bit decoding on Llama3 and Qwen3 models.

What carries the argument

The GRaded INput-based Quantization Hierarchy, which uses activation magnitudes to route weight channels into multiple precision tiers within a unified quantization-sparsification scheme.

If this is right

The method produces a new Pareto-optimal trade-off curve between output quality and decode speed.
Effective 2-bit generation becomes feasible without the quality collapse seen in prior fixed-precision approaches.
A hierarchical nested memory layout in a custom kernel translates the variable bit widths into measured wall-clock speedups.
The same graded assignment principle can be applied at inference time without retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the activation-magnitude proxy generalizes, the hierarchy could be recomputed on the fly for different prompts or tasks rather than fixed once per model.
The approach may combine with KV-cache compression to further reduce memory traffic in long-context settings.
Similar input-driven grading might apply to non-transformer architectures where activation statistics also track parameter importance.

Load-bearing premise

Activation magnitudes reliably indicate which weight channels matter most for generation quality when deciding their precision level.

What would settle it

An experiment in which random or uniform channel-to-precision assignments at the same average bit width match or exceed GRINQH quality on Llama3 or Qwen3 would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.23419 by Catherine M. Sch\"ofmann, Emre Neftci, Jan Finkbeiner, Jette Oberl\"ander.

**Figure 1.** Figure 1: Pareto frontier of GSM8K accuracy vs. VMM runtime for Qwen3-8B on an RTX 4090. GPTQ, AWQ, RTN assume MARLIN execution. The primary operational focus for Large Language Models (LLMs) has shifted from training to efficient inference at scale. Despite advances in hardware throughput, inference remains constrained by the memory wall: a fundamental performance bottleneck where off-chip DRAM transfer speeds ca… view at source ↗

**Figure 2.** Figure 2: Overview of GRINQH. During decoding, GRINQH mitigates the memory bandwidth bottleneck through a dynamic, channel-wise precision loading scheme. 1) Precision Assignment: Input activations xi are mapped to bit widths bi ∈ {0, 2, 4, 6} based on their magnitude via precomputed thresholds from a calibration set. 2) Bit-Stacked Storage: Weights are stored in DRAM using a bit-planar format. Each bmax = 6-bit wei… view at source ↗

**Figure 3.** Figure 3: GRINQH redefines the quantization Pareto frontier across model families and scales. Dashed lines indicate BF16 baselines. GRINQH precision distribution sweep is compared against iso-bit symmetric RTN, GPTQ, and AWQ baselines. (A) WikiText-2 perplexity vs. effective bit width for the Llama3 Instruct family using bmax ∈ {6, 8}. (B) GSM8K CoT accuracy vs. effective bit width for the Qwen3 family. (Left panel)… view at source ↗

**Figure 4.** Figure 4: Normalized isolated kernel runtimes over a range of target effective bit widths. Times are normalized w.r.t. 4-bit Marlin kernel on the same device. Representational Fidelity in Decoupled Inference. We evaluate GRINQH in a realistic inference setting where prefill and decoding are treated separately. Since the prefill stage is primarily computebound, loading weights at bmax introduces negligible latency o… view at source ↗

**Figure 5.** Figure 5: End-to-End performance scaling and task evaluation. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: (A) Sensitivity analysis of the 0-bit (sparsity) fraction (p0) for specific bit-width windows (Top: 2.0–3.0 bits; Bottom: 3.7–4.3 bits). Data points indicate the relative GSM8K performance change across varying sparsity levels with respect to the average performance in that window. Results show that high sparsity benefits low-bit regimes but penalizes higher bit regimes, suggesting that optimal p0 configur… view at source ↗

**Figure 7.** Figure 7: Correlation between calibration and downstream perplexity. Each data point represents a unique precision distribution configuration P sampled during our hyperparameter sweep. Calibration PPL is computed on The Pile (Uncopyrighted), while downstream PPL is evaluated on Wikitext-2. The strong linear correlation validates calibration PPL as a reliable and computationally efficient proxy metric for performance… view at source ↗

**Figure 8.** Figure 8: Sensitivity of calibration perplexity to precision fractions pi . Each panel illustrates the impact of a specific bit-width allocation on Llama3-1B performance, constrained to a target effective bit width b ∗ = 4.0 ± 0.1 (bmax = 8). We observe divergent scaling behaviors across the precision levels: boundary bit widths (b0, b1, and b4) show a positive correlation with perplexity, suggesting that excessive … view at source ↗

**Figure 9.** Figure 9: GRINQH redefines the quantization Pareto frontier across model families and scales. Dashed lines indicate BF16 baselines. GRINQH data points represent a sweep of precision distributions compared against iso-bit symmetric RTN, GPTQ, and AWQ baselines. (A) WikiText-2 perplexity vs. effective bit width for the Qwen3 family using bmax ∈ {6, 8}. GRINQH outperforms SOTA counterparts, maintaining representationa… view at source ↗

**Figure 10.** Figure 10: Comparison between GRINQH’s fine-grained dynamic input-channel-wise bit allocation [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗

**Figure 11.** Figure 11: FP32 throughput over DRAM bytes loaded for different effective bit widths on consumer [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗

**Figure 12.** Figure 12: Robustness of calibrated thresholds to sample size and subset variation for Llama-3.1-8B [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗

**Figure 13.** Figure 13: Quantization method comparison across key metrics on Llama3 8B. Performance met [PITH_FULL_IMAGE:figures/full_fig_p025_13.png] view at source ↗

read the original abstract

Autoregressive decoding with LLMs is primarily bottlenecked by GPU memory bandwidth, especially in edge-computing settings. While quantization is essential for mitigating this bottleneck, most existing methods treat inference as a uniform process and fail to account for the asymmetry between the compute-bound prefill stage and the memory-bound decoding stage. We propose GRINQH (GRaded INput-based Quantization Hierarchy), a weight-only post-training quantization framework that accelerates decoding by unifying quantization and sparsification. GRINQH leverages activation magnitudes as a proxy for computational importance to dynamically assign weight channels to different precision levels, enabling flexible average bit widths during decoding. Evaluated on Llama3 and Qwen3 models, GRINQH outperforms state-of-the-art fixed- and mixed-precision baselines at comparable 3- and 4-bit settings, even enabling effective 2-bit generation. We experimentally verify theoretical speedups by leveraging a hierarchical nested memory layout for multi-precision storage in a custom GPU kernel. Ultimately, GRINQH establishes a new state-of-the-art Pareto frontier for LLM generation, enabling a dynamic trade-off between generation quality and inference speed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GRINQH's graded assignment of weight channels by activation magnitude is the actual new piece, but the abstract gives no evidence that this proxy beats simpler alternatives or holds across inputs.

read the letter

The paper introduces a post-training weight-only quantization scheme that uses activation magnitudes to sort channels into a precision hierarchy, then unifies that with sparsification and stores the result in a nested memory layout so a custom kernel can read at the right average bit width. The target is the memory-bandwidth limit during autoregressive decoding rather than prefill.

What the work does cleanly is name the prefill/decode asymmetry and show a concrete way to get variable precision without retraining. The claim that this beats fixed- and mixed-precision baselines at 3- and 4-bit settings, and even supports usable 2-bit generation on Llama3 and Qwen3, is the result that would matter if it holds.

The soft spot is exactly the one the stress-test flags: activation magnitude is treated as a reliable importance signal, yet the abstract supplies no ablation against gradient or Hessian proxies, no sensitivity check on calibration data, and no breakdown by token position or prompt type. If that correlation is weak or input-dependent, the reported Pareto gains could shrink or disappear. The custom kernel and speed-up measurements are mentioned but not quantified here, so the practical side also needs the full numbers.

The citation list looks standard for the quantization literature and does not appear to hide circular definitions. No equations are visible, so there is nothing to check for fitting artifacts.

This is for people who build or deploy low-bit LLM inference stacks. A reader already working on mixed-precision or activation-aware methods would get a usable idea to test. The paper is coherent enough on its own terms to deserve referee time; the central claim is falsifiable once the experiments are shown.

Referee Report

2 major / 1 minor

Summary. The paper proposes GRINQH, a weight-only post-training quantization framework for LLMs that unifies quantization and sparsification. It dynamically assigns weight channels to precision levels in a hierarchy using activation magnitudes from calibration data as a proxy for computational importance. This is intended to accelerate the memory-bound decoding stage while handling the asymmetry with the compute-bound prefill stage. On Llama3 and Qwen3 models, it claims to outperform fixed- and mixed-precision SOTA baselines at comparable 3- and 4-bit average widths, enable effective 2-bit generation, and experimentally verify theoretical speedups via a custom GPU kernel with hierarchical nested memory layout for multi-precision storage.

Significance. If the central claims hold after addressing validation gaps, GRINQH could meaningfully advance efficient autoregressive LLM inference, particularly for edge devices where memory bandwidth dominates. The dynamic, input-based hierarchy offers a potential improvement over static quantization by allowing flexible average bit widths, and the explicit GPU kernel implementation with speed-up verification is a concrete strength that supports practical impact.

major comments (2)

[Method (activation magnitude proxy)] The central claim of outperformance at 3-/4-bit (and effective 2-bit) settings depends on activation magnitudes serving as a reliable proxy for per-channel importance during autoregressive decoding. The manuscript provides no explicit ablation or comparison demonstrating that this proxy outperforms alternatives such as gradient-based or Hessian-based importance metrics; without this, the reported gains risk being artifacts of calibration choice rather than the hierarchy itself.
[Experiments] No details are given on experimental protocols, including calibration dataset size and selection criteria, number of generation steps evaluated, or how the hierarchy adapts across early vs. late tokens in decoding. This prevents verification that the outperformance supports the claim, as the proxy may be input-dependent.

minor comments (1)

[Abstract] The abstract mentions 'theoretically speedups' but the full text should clarify the exact theoretical model and how the custom kernel achieves them.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point by point below.

read point-by-point responses

Referee: [Method (activation magnitude proxy)] The central claim of outperformance at 3-/4-bit (and effective 2-bit) settings depends on activation magnitudes serving as a reliable proxy for per-channel importance during autoregressive decoding. The manuscript provides no explicit ablation or comparison demonstrating that this proxy outperforms alternatives such as gradient-based or Hessian-based importance metrics; without this, the reported gains risk being artifacts of calibration choice rather than the hierarchy itself.

Authors: We agree that an explicit ablation comparing activation magnitudes against gradient- and Hessian-based alternatives would strengthen the justification for the chosen proxy. While activation magnitude is a standard, low-overhead proxy in post-training quantization, we will add such an ablation study to the revised manuscript, evaluating all three metrics on the same Llama3 and Qwen3 models and bit-width settings. revision: yes
Referee: [Experiments] No details are given on experimental protocols, including calibration dataset size and selection criteria, number of generation steps evaluated, or how the hierarchy adapts across early vs. late tokens in decoding. This prevents verification that the outperformance supports the claim, as the proxy may be input-dependent.

Authors: We acknowledge that the original manuscript omitted these protocol details. In the revision we will expand the Experiments section to specify the calibration dataset size and selection, the number of generation steps and prompts used, and provide analysis of hierarchy behavior across early versus late decoding tokens. revision: yes

Circularity Check

0 steps flagged

No circularity detected in GRINQH derivation

full rationale

The provided abstract and manuscript excerpt describe GRINQH as an empirical post-training quantization method that assigns weight-channel precisions using activation magnitudes from calibration data. No equations, parameter-fitting steps, or self-citations appear that would reduce any claimed prediction or uniqueness result to the inputs by construction. The central claims rest on experimental comparisons to baselines rather than a closed definitional or self-referential loop, rendering the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; ledger populated from explicit statements in the abstract. No free parameters, invented entities, or additional axioms are identifiable without the full manuscript.

axioms (1)

domain assumption Activation magnitudes are a valid proxy for computational importance when assigning precision levels.
Stated as the basis for dynamic channel assignment in the quantization hierarchy.

pith-pipeline@v0.9.1-grok · 5741 in / 1243 out tokens · 22292 ms · 2026-06-26T08:36:34.514538+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

50 extracted references · 3 canonical work pages · 1 internal anchor

[1]

Abecassis, A

F. Abecassis, A. Agrusa, D. Ahn, J. Alben, S. Alborghetti, M. Andersch, S. Arayandi, A. Bjorlin, A. Blakeman, E. Briones, et al. Pretraining large language models with nvfp4.arXiv preprint arXiv:2509.25149, 2025

arXiv 2025
[2]

D. Alvarez. Juwels cluster and booster: exascale pathfinder with modular supercomputing architecture at juelich supercomputing centre.Journal of large-scale research facilities JLSRF, 7:A183–A183, 2021

2021
[3]

Ashkboos, I

S. Ashkboos, I. Markov, E. Frantar, T. Zhong, X. Wang, J. Ren, T. Hoefler, and D. Alistarh. Quik: Towards end-to-end 4-bit inference on generative large language models, 2023. URL https://arxiv.org/abs/2310.09259. 10

arXiv 2023
[4]

Ashkboos, A

S. Ashkboos, A. Mohtashami, M. L. Croci, B. Li, P. Cameron, M. Jaggi, D. Alistarh, T. Hoefler, and J. Hensman. Quarot: Outlier-free 4-bit inference in rotated llms, 2024. URL https: //arxiv.org/abs/2404.00456

arXiv 2024
[5]

Y . Bisk, R. Zellers, J. Gao, Y . Choi, et al. Piqa: Reasoning about physical commonsense in natural language. InProceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020

2020
[6]

H. M. Chen, F. Tan, A. Kouris, R. Lee, H. Fan, and S. I. Venieris. Progressive mixed-precision decoding for efficient llm inference, 2024. URLhttps://arxiv.org/abs/2410.13461

arXiv 2024
[7]

Cheng, W

W. Cheng, W. Zhang, H. Shen, Y . Cai, X. He, K. Lv, and Y . Liu. Optimize weight rounding via signed gradient descent for the quantization of llms, 2023. URL https://arxiv.org/abs/ 2309.05516

arXiv 2023
[8]

Clark, K

C. Clark, K. Lee, M.-W. Chang, T. Kwiatkowski, M. Collins, and K. Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions, 2019. URL https://arxiv. org/abs/1905.10044

Pith/arXiv arXiv 2019
[9]

Clark, I

P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018. URL https: //arxiv.org/abs/1803.05457

Pith/arXiv arXiv 2018
[10]

Cobbe, V

K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman. Training verifiers to solve math word problems, 2021. URLhttps://arxiv.org/abs/2110.14168

Pith/arXiv arXiv 2021
[11]

Dettmers, M

T. Dettmers, M. Lewis, Y . Belkada, and L. Zettlemoyer. Llm.int8(): 8-bit matrix multiplication for transformers at scale, 2022. URLhttps://arxiv.org/abs/2208.07339

Pith/arXiv arXiv 2022
[12]

Kudugunta, A

Devvrit, S. Kudugunta, A. Kusupati, T. Dettmers, K. Chen, I. Dhillon, Y . Tsvetkov, H. Hajishirzi, S. Kakade, A. Farhadi, and P. Jain. Matformer: Nested transformer for elastic inference, 2023. URLhttps://arxiv.org/abs/2310.07707

arXiv 2023
[13]

Eccleston

D. Eccleston. sharegpt, 2022. URL https://github.com/domeccleston/sharegpt. Ac- cessed 2026-03-12

2022
[14]

Frantar, S

E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers, 2022. URL https://arxiv.org/abs/2210.17323

Pith/arXiv arXiv 2022
[15]

Frantar, R

E. Frantar, R. L. Castro, J. Chen, T. Hoefler, and D. Alistarh. Marlin: Mixed-precision auto- regressive parallel inference on large language models, 2024. URL https://arxiv.org/ abs/2408.11743

arXiv 2024
[16]

L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, C. Foster, J. Phang, H. He, A. Thite, N. Nabeshima, S. Presser, and C. Leahy. The pile: An 800gb dataset of diverse text for language modeling, 2021. URLhttps://arxiv.org/abs/2101.00027

Pith/arXiv arXiv 2021
[17]

Grattafiori, A

A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, et al. The llama 3 herd of models,
[18]

URLhttps://arxiv.org/abs/2407.21783

Pith/arXiv arXiv
[19]

Hendrycks, C

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding, 2020. URL https://arxiv.org/abs/2009. 03300

2020
[20]

Huang, H

W. Huang, H. Qin, Y . Liu, Y . Li, Q. Liu, X. Liu, L. Benini, M. Magno, S. Zhang, and X. Qi. Slim-llm: Salience-driven mixed-precision quantization for large language models, 2024. URL https://arxiv.org/abs/2405.14917

arXiv 2024
[21]

Jülich Supercomputing Centre. JURECA: Data Centric and Booster Modules implementing the Modular Supercomputing Architecture at Jülich Supercomputing Centre.Journal of large-scale research facilities, 7(A182), 2021. doi: 10.17815/jlsrf-7-182. URL http://dx.doi.org/10. 17815/jlsrf-7-182

work page doi:10.17815/jlsrf-7-182 2021
[22]

Kleinegger, E

M. Kleinegger, E. Crnˇcevi´c, and D. Alistarh. Matgptq: Accurate and efficient post-training matryoshka quantization, 2026. URLhttps://arxiv.org/abs/2602.03537

arXiv 2026
[23]

W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023. 11

2023
[24]

J. Lin, J. Tang, H. Tang, S. Yang, W.-M. Chen, W.-C. Wang, G. Xiao, X. Dang, C. Gan, and S. Han. Awq: Activation-aware weight quantization for llm compression and acceleration, 2023. URLhttps://arxiv.org/abs/2306.00978

Pith/arXiv arXiv 2023
[25]

J. Liu, P. Ponnusamy, T. Cai, H. Guo, Y . Kim, and B. Athiwaratkun. Training-free activation sparsity in large language models, 2024. URLhttps://arxiv.org/abs/2408.14690

arXiv 2024
[26]

Z. Liu, J. Wang, T. Dao, T. Zhou, B. Yuan, Z. Song, A. Shrivastava, C. Zhang, Y . Tian, C. Re, and B. Chen. Deja vu: Contextual sparsity for efficient llms at inference time. 2023. doi: 10.48550/ARXIV .2310.17157. URLhttps://arxiv.org/abs/2310.17157

work page internal anchor Pith review doi:10.48550/arxiv 2023
[27]

Merity, C

S. Merity, C. Xiong, J. Bradbury, and R. Socher. Pointer sentinel mixture models, 2016. URL https://arxiv.org/abs/1609.07843

Pith/arXiv arXiv 2016
[28]

Mirzadeh, K

I. Mirzadeh, K. Alizadeh, S. Mehta, C. C. Del Mundo, O. Tuzel, G. Samei, M. Rastegari, and M. Farajtabar. Relu strikes back: Exploiting activation sparsity in large language models, 2023. URLhttps://arxiv.org/abs/2310.04564

arXiv 2023
[29]

P. Nair, P. Datta, J. Dean, P. Jain, and A. Kusupati. Matryoshka quantization, 2025. URL https://arxiv.org/abs/2502.06786

arXiv 2025
[30]

NVIDIA Corporation, 2026

NVIDIA Corporation.NVIDIA Nsight Compute CLI. NVIDIA Corporation, 2026. URL https://docs.nvidia.com/nsight-compute/NsightComputeCli/index.html. Ver- sion 2025.4.1

2026
[31]

Paperno, G

D. Paperno, G. Kruszewski, A. Lazaridou, Q. N. Pham, R. Bernardi, S. Pezzelle, M. Baroni, G. Boleda, and R. Fernández. The lambada dataset: Word prediction requiring a broad discourse context, 2016. URLhttps://arxiv.org/abs/1606.06031

Pith/arXiv arXiv 2016
[32]

Y . Park, J. Hyun, S. Cho, B. Sim, and J. W. Lee. Any-precision llm: Low-cost deployment of multiple, different-sized llms, 2024. URLhttps://arxiv.org/abs/2402.10517

arXiv 2024
[33]

R. Pope, S. Douglas, A. Chowdhery, J. Devlin, J. Bradbury, A. Levskaya, J. Heek, K. Xiao, S. Agrawal, and J. Dean. Efficiently scaling transformer inference, 2022. URL https: //arxiv.org/abs/2211.05102

arXiv 2022
[34]

LLM Compressor

Red Hat AI and vLLM Project. LLM Compressor. https://github.com/vllm-project/ llm-compressor, Aug. 2024

2024
[35]

Sakaguchi, R

K. Sakaguchi, R. L. Bras, C. Bhagavatula, and Y . Choi. Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

2021
[36]

C. Song, X. Han, Z. Zhang, S. Hu, X. Shi, K. Li, C. Chen, Z. Liu, G. Li, T. Yang, and M. Sun. Prosparse: Introducing and enhancing intrinsic activation sparsity within large language models,
[37]

URLhttps://arxiv.org/abs/2402.13516

arXiv
[38]

M. Sun, Z. Liu, A. Bair, and J. Z. Kolter. A simple and effective pruning approach for large language models, 2023. URLhttps://arxiv.org/abs/2306.11695

Pith/arXiv arXiv 2023
[39]

Sutawika, H

L. Sutawika, H. Schoelkopf, L. Gao, B. Abbasi, S. Biderman, J. Tow, B. Fattori, C. Lovering, J. Phang, A. Thite, T. Wang, sdtblck, gakada, nopperl, researcher2, tttyuntian, E. Julen, Chris, J. A. Michaelov, H. A. Lee, Janna, L. Sinev, Z. Kasner, K. Stokes, Khalid, and KonradSzafer. Eleutherai/lm-evaluation-harness: lm-eval v0.4.9.2 release notes, 2025. UR...

work page doi:10.5281/zenodo.17728786 2025
[40]

Tillet, H

P. Tillet, H. Kung, and D. Cox. Triton: An intermediate language and compiler for tiled neural network computations. 2019. URL https://www.eecs.harvard.edu/~htk/ publication/2019-mapl-tillet-kung-cox.pdf

2019
[41]

Torchao: Pytorch-native training-to-serving model optimization, oct 2024

torchao. Torchao: Pytorch-native training-to-serving model optimization, oct 2024. URL https://github.com/pytorch/ao

2024
[42]

Tseng, J

A. Tseng, J. Chee, Q. Sun, V . Kuleshov, and C. De Sa. Quip#: Even better llm quantization with hadamard incoherence and lattice codebooks, 2024. URL https://arxiv.org/abs/ 2402.04396

arXiv 2024
[43]

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, and D. Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2022. URL https: //arxiv.org/abs/2201.11903. 12

Pith/arXiv arXiv 2022
[44]

T. Wolf, L. Debut, V . Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y . Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger, M. Drame, Q. Lhoest, and A. M. Rush. Huggingface’s transformers: State-of-the-art natural language processing, 2019. URLhttps://arxiv.org/abs/1910.03771

Pith/arXiv arXiv 2019
[45]

G. Xiao, J. Lin, M. Seznec, H. Wu, J. Demouth, and S. Han. Smoothquant: Accurate and efficient post-training quantization for large language models, 2022. URL https://arxiv. org/abs/2211.10438

arXiv 2022
[46]

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, C. Zheng, D. Liu, F. Zhou, F. Huang, F. Hu, H. Ge, H. Wei, H. Lin, J. Tang, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Zhou, J. Lin, K. Dang, K. Bao, K. Yang, L. Yu, L. Deng, M. Li, M. Xue, M. Li, P. Zhang, P. Wang, Q. Zhu, R. Men, R. Gao, S. Liu, S. Luo, T. ...

Pith/arXiv arXiv 2025
[47]

Zellers, A

R. Zellers, A. Holtzman, Y . Bisk, A. Farhadi, and Y . Choi. Hellaswag: Can a machine really finish your sentence?, 2019. URLhttps://arxiv.org/abs/1905.07830

Pith/arXiv arXiv 2019
[48]

free sparsity

J. Zhou, T. Lu, S. Mishra, S. Brahma, S. Basu, Y . Luan, D. Zhou, and L. Hou. Instruction- following evaluation for large language models.arXiv preprint arXiv:2311.07911, 2023. 13 A Appendix: Hyperparameter Selection and Tuning This section details the formalization of our precision distribution (Section A.1), describes the random sweep used to validate t...

Pith/arXiv arXiv 2023
[49]

Sample:Use a constrained random sampler (e.g., based on a Dirichlet distribution) to generate ∼15 candidate vectors P that satisfy the simplex constraints and align with the sparsity regimes identified above
[50]

4.Select:Deploy the configuration that yields the lowest calibration PPL

Calibrate:Perform a single calibration forward pass for each P to determine the layer-wise thresholds and the resulting calibration PPL. 4.Select:Deploy the configuration that yields the lowest calibration PPL. This empirical approach effectively identifies high-performance distributions without the need for an exhaustive search. Future work may further a...

2048

[1] [1]

Abecassis, A

F. Abecassis, A. Agrusa, D. Ahn, J. Alben, S. Alborghetti, M. Andersch, S. Arayandi, A. Bjorlin, A. Blakeman, E. Briones, et al. Pretraining large language models with nvfp4.arXiv preprint arXiv:2509.25149, 2025

arXiv 2025

[2] [2]

D. Alvarez. Juwels cluster and booster: exascale pathfinder with modular supercomputing architecture at juelich supercomputing centre.Journal of large-scale research facilities JLSRF, 7:A183–A183, 2021

2021

[3] [3]

Ashkboos, I

S. Ashkboos, I. Markov, E. Frantar, T. Zhong, X. Wang, J. Ren, T. Hoefler, and D. Alistarh. Quik: Towards end-to-end 4-bit inference on generative large language models, 2023. URL https://arxiv.org/abs/2310.09259. 10

arXiv 2023

[4] [4]

Ashkboos, A

S. Ashkboos, A. Mohtashami, M. L. Croci, B. Li, P. Cameron, M. Jaggi, D. Alistarh, T. Hoefler, and J. Hensman. Quarot: Outlier-free 4-bit inference in rotated llms, 2024. URL https: //arxiv.org/abs/2404.00456

arXiv 2024

[5] [5]

Y . Bisk, R. Zellers, J. Gao, Y . Choi, et al. Piqa: Reasoning about physical commonsense in natural language. InProceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020

2020

[6] [6]

H. M. Chen, F. Tan, A. Kouris, R. Lee, H. Fan, and S. I. Venieris. Progressive mixed-precision decoding for efficient llm inference, 2024. URLhttps://arxiv.org/abs/2410.13461

arXiv 2024

[7] [7]

Cheng, W

W. Cheng, W. Zhang, H. Shen, Y . Cai, X. He, K. Lv, and Y . Liu. Optimize weight rounding via signed gradient descent for the quantization of llms, 2023. URL https://arxiv.org/abs/ 2309.05516

arXiv 2023

[8] [8]

Clark, K

C. Clark, K. Lee, M.-W. Chang, T. Kwiatkowski, M. Collins, and K. Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions, 2019. URL https://arxiv. org/abs/1905.10044

Pith/arXiv arXiv 2019

[9] [9]

Clark, I

P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018. URL https: //arxiv.org/abs/1803.05457

Pith/arXiv arXiv 2018

[10] [10]

Cobbe, V

K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman. Training verifiers to solve math word problems, 2021. URLhttps://arxiv.org/abs/2110.14168

Pith/arXiv arXiv 2021

[11] [11]

Dettmers, M

T. Dettmers, M. Lewis, Y . Belkada, and L. Zettlemoyer. Llm.int8(): 8-bit matrix multiplication for transformers at scale, 2022. URLhttps://arxiv.org/abs/2208.07339

Pith/arXiv arXiv 2022

[12] [12]

Kudugunta, A

Devvrit, S. Kudugunta, A. Kusupati, T. Dettmers, K. Chen, I. Dhillon, Y . Tsvetkov, H. Hajishirzi, S. Kakade, A. Farhadi, and P. Jain. Matformer: Nested transformer for elastic inference, 2023. URLhttps://arxiv.org/abs/2310.07707

arXiv 2023

[13] [13]

Eccleston

D. Eccleston. sharegpt, 2022. URL https://github.com/domeccleston/sharegpt. Ac- cessed 2026-03-12

2022

[14] [14]

Frantar, S

E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers, 2022. URL https://arxiv.org/abs/2210.17323

Pith/arXiv arXiv 2022

[15] [15]

Frantar, R

E. Frantar, R. L. Castro, J. Chen, T. Hoefler, and D. Alistarh. Marlin: Mixed-precision auto- regressive parallel inference on large language models, 2024. URL https://arxiv.org/ abs/2408.11743

arXiv 2024

[16] [16]

L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, C. Foster, J. Phang, H. He, A. Thite, N. Nabeshima, S. Presser, and C. Leahy. The pile: An 800gb dataset of diverse text for language modeling, 2021. URLhttps://arxiv.org/abs/2101.00027

Pith/arXiv arXiv 2021

[17] [17]

Grattafiori, A

A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, et al. The llama 3 herd of models,

[18] [18]

URLhttps://arxiv.org/abs/2407.21783

Pith/arXiv arXiv

[19] [19]

Hendrycks, C

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding, 2020. URL https://arxiv.org/abs/2009. 03300

2020

[20] [20]

Huang, H

W. Huang, H. Qin, Y . Liu, Y . Li, Q. Liu, X. Liu, L. Benini, M. Magno, S. Zhang, and X. Qi. Slim-llm: Salience-driven mixed-precision quantization for large language models, 2024. URL https://arxiv.org/abs/2405.14917

arXiv 2024

[21] [21]

Jülich Supercomputing Centre. JURECA: Data Centric and Booster Modules implementing the Modular Supercomputing Architecture at Jülich Supercomputing Centre.Journal of large-scale research facilities, 7(A182), 2021. doi: 10.17815/jlsrf-7-182. URL http://dx.doi.org/10. 17815/jlsrf-7-182

work page doi:10.17815/jlsrf-7-182 2021

[22] [22]

Kleinegger, E

M. Kleinegger, E. Crnˇcevi´c, and D. Alistarh. Matgptq: Accurate and efficient post-training matryoshka quantization, 2026. URLhttps://arxiv.org/abs/2602.03537

arXiv 2026

[23] [23]

W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023. 11

2023

[24] [24]

J. Lin, J. Tang, H. Tang, S. Yang, W.-M. Chen, W.-C. Wang, G. Xiao, X. Dang, C. Gan, and S. Han. Awq: Activation-aware weight quantization for llm compression and acceleration, 2023. URLhttps://arxiv.org/abs/2306.00978

Pith/arXiv arXiv 2023

[25] [25]

J. Liu, P. Ponnusamy, T. Cai, H. Guo, Y . Kim, and B. Athiwaratkun. Training-free activation sparsity in large language models, 2024. URLhttps://arxiv.org/abs/2408.14690

arXiv 2024

[26] [26]

Z. Liu, J. Wang, T. Dao, T. Zhou, B. Yuan, Z. Song, A. Shrivastava, C. Zhang, Y . Tian, C. Re, and B. Chen. Deja vu: Contextual sparsity for efficient llms at inference time. 2023. doi: 10.48550/ARXIV .2310.17157. URLhttps://arxiv.org/abs/2310.17157

work page internal anchor Pith review doi:10.48550/arxiv 2023

[27] [27]

Merity, C

S. Merity, C. Xiong, J. Bradbury, and R. Socher. Pointer sentinel mixture models, 2016. URL https://arxiv.org/abs/1609.07843

Pith/arXiv arXiv 2016

[28] [28]

Mirzadeh, K

I. Mirzadeh, K. Alizadeh, S. Mehta, C. C. Del Mundo, O. Tuzel, G. Samei, M. Rastegari, and M. Farajtabar. Relu strikes back: Exploiting activation sparsity in large language models, 2023. URLhttps://arxiv.org/abs/2310.04564

arXiv 2023

[29] [29]

P. Nair, P. Datta, J. Dean, P. Jain, and A. Kusupati. Matryoshka quantization, 2025. URL https://arxiv.org/abs/2502.06786

arXiv 2025

[30] [30]

NVIDIA Corporation, 2026

NVIDIA Corporation.NVIDIA Nsight Compute CLI. NVIDIA Corporation, 2026. URL https://docs.nvidia.com/nsight-compute/NsightComputeCli/index.html. Ver- sion 2025.4.1

2026

[31] [31]

Paperno, G

D. Paperno, G. Kruszewski, A. Lazaridou, Q. N. Pham, R. Bernardi, S. Pezzelle, M. Baroni, G. Boleda, and R. Fernández. The lambada dataset: Word prediction requiring a broad discourse context, 2016. URLhttps://arxiv.org/abs/1606.06031

Pith/arXiv arXiv 2016

[32] [32]

Y . Park, J. Hyun, S. Cho, B. Sim, and J. W. Lee. Any-precision llm: Low-cost deployment of multiple, different-sized llms, 2024. URLhttps://arxiv.org/abs/2402.10517

arXiv 2024

[33] [33]

R. Pope, S. Douglas, A. Chowdhery, J. Devlin, J. Bradbury, A. Levskaya, J. Heek, K. Xiao, S. Agrawal, and J. Dean. Efficiently scaling transformer inference, 2022. URL https: //arxiv.org/abs/2211.05102

arXiv 2022

[34] [34]

LLM Compressor

Red Hat AI and vLLM Project. LLM Compressor. https://github.com/vllm-project/ llm-compressor, Aug. 2024

2024

[35] [35]

Sakaguchi, R

K. Sakaguchi, R. L. Bras, C. Bhagavatula, and Y . Choi. Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

2021

[36] [36]

C. Song, X. Han, Z. Zhang, S. Hu, X. Shi, K. Li, C. Chen, Z. Liu, G. Li, T. Yang, and M. Sun. Prosparse: Introducing and enhancing intrinsic activation sparsity within large language models,

[37] [37]

URLhttps://arxiv.org/abs/2402.13516

arXiv

[38] [38]

M. Sun, Z. Liu, A. Bair, and J. Z. Kolter. A simple and effective pruning approach for large language models, 2023. URLhttps://arxiv.org/abs/2306.11695

Pith/arXiv arXiv 2023

[39] [39]

Sutawika, H

L. Sutawika, H. Schoelkopf, L. Gao, B. Abbasi, S. Biderman, J. Tow, B. Fattori, C. Lovering, J. Phang, A. Thite, T. Wang, sdtblck, gakada, nopperl, researcher2, tttyuntian, E. Julen, Chris, J. A. Michaelov, H. A. Lee, Janna, L. Sinev, Z. Kasner, K. Stokes, Khalid, and KonradSzafer. Eleutherai/lm-evaluation-harness: lm-eval v0.4.9.2 release notes, 2025. UR...

work page doi:10.5281/zenodo.17728786 2025

[40] [40]

Tillet, H

P. Tillet, H. Kung, and D. Cox. Triton: An intermediate language and compiler for tiled neural network computations. 2019. URL https://www.eecs.harvard.edu/~htk/ publication/2019-mapl-tillet-kung-cox.pdf

2019

[41] [41]

Torchao: Pytorch-native training-to-serving model optimization, oct 2024

torchao. Torchao: Pytorch-native training-to-serving model optimization, oct 2024. URL https://github.com/pytorch/ao

2024

[42] [42]

Tseng, J

A. Tseng, J. Chee, Q. Sun, V . Kuleshov, and C. De Sa. Quip#: Even better llm quantization with hadamard incoherence and lattice codebooks, 2024. URL https://arxiv.org/abs/ 2402.04396

arXiv 2024

[43] [43]

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, and D. Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2022. URL https: //arxiv.org/abs/2201.11903. 12

Pith/arXiv arXiv 2022

[44] [44]

T. Wolf, L. Debut, V . Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y . Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger, M. Drame, Q. Lhoest, and A. M. Rush. Huggingface’s transformers: State-of-the-art natural language processing, 2019. URLhttps://arxiv.org/abs/1910.03771

Pith/arXiv arXiv 2019

[45] [45]

G. Xiao, J. Lin, M. Seznec, H. Wu, J. Demouth, and S. Han. Smoothquant: Accurate and efficient post-training quantization for large language models, 2022. URL https://arxiv. org/abs/2211.10438

arXiv 2022

[46] [46]

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, C. Zheng, D. Liu, F. Zhou, F. Huang, F. Hu, H. Ge, H. Wei, H. Lin, J. Tang, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Zhou, J. Lin, K. Dang, K. Bao, K. Yang, L. Yu, L. Deng, M. Li, M. Xue, M. Li, P. Zhang, P. Wang, Q. Zhu, R. Men, R. Gao, S. Liu, S. Luo, T. ...

Pith/arXiv arXiv 2025

[47] [47]

Zellers, A

R. Zellers, A. Holtzman, Y . Bisk, A. Farhadi, and Y . Choi. Hellaswag: Can a machine really finish your sentence?, 2019. URLhttps://arxiv.org/abs/1905.07830

Pith/arXiv arXiv 2019

[48] [48]

free sparsity

J. Zhou, T. Lu, S. Mishra, S. Brahma, S. Basu, Y . Luan, D. Zhou, and L. Hou. Instruction- following evaluation for large language models.arXiv preprint arXiv:2311.07911, 2023. 13 A Appendix: Hyperparameter Selection and Tuning This section details the formalization of our precision distribution (Section A.1), describes the random sweep used to validate t...

Pith/arXiv arXiv 2023

[49] [49]

Sample:Use a constrained random sampler (e.g., based on a Dirichlet distribution) to generate ∼15 candidate vectors P that satisfy the simplex constraints and align with the sparsity regimes identified above

[50] [50]

4.Select:Deploy the configuration that yields the lowest calibration PPL

Calibrate:Perform a single calibration forward pass for each P to determine the layer-wise thresholds and the resulting calibration PPL. 4.Select:Deploy the configuration that yields the lowest calibration PPL. This empirical approach effectively identifies high-performance distributions without the need for an exhaustive search. Future work may further a...

2048