GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Dan Alistarh; Elias Frantar; Saleh Ashkboos; Torsten Hoefler

arxiv: 2210.17323 · v2 · submitted 2022-10-31 · 💻 cs.LG

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Elias Frantar , Saleh Ashkboos , Torsten Hoefler , Dan Alistarh This is my paper

Pith reviewed 2026-05-10 17:12 UTC · model grok-4.3

classification 💻 cs.LG

keywords post-training quantizationgenerative pre-trained transformersGPT modelsmodel compressionlarge language modelsone-shot quantizationHessian approximationinference speedup

0 comments

The pith

GPTQ quantizes 175 billion parameter GPT models to 3 or 4 bits per weight in about four GPU hours with negligible accuracy loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents GPTQ, a one-shot weight quantization method for large generative pre-trained transformer models. It relies on approximate second-order information to achieve high accuracy at low bit widths. The method is efficient enough to handle models as large as 175 billion parameters in roughly four hours on a GPU. This compression allows such models to run inference on a single GPU for the first time and delivers significant speedups over full precision versions.

Core claim

The authors show that their GPTQ method, based on approximate second-order information, can quantize GPT models with up to 175 billion parameters down to 3 or 4 bits per weight. This process takes approximately four GPU hours and results in negligible accuracy degradation relative to the uncompressed model. The approach more than doubles the compression gains of previous one-shot methods and enables single-GPU inference for these massive models, with observed speedups of 3.25x on A100 GPUs and 4.5x on A6000 GPUs.

What carries the argument

Approximate second-order information, specifically Hessian-based approximations, used to make layer-wise quantization decisions in a one-shot post-training process.

If this is right

175 billion parameter models become runnable for generative inference inside a single GPU.
Accuracy is preserved at 3-4 bit quantization, more than doubling prior compression gains for one-shot methods.
End-to-end inference achieves speedups of approximately 3.25x on high-end GPUs like the NVIDIA A100 and 4.5x on cost-effective ones like the NVIDIA A6000.
Reasonable accuracy holds in extreme cases of 2-bit or ternary quantization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

GPTQ's efficiency could make quantization standard practice for deploying very large language models on limited hardware.
The one-shot nature suggests the approach may extend to other large transformer families with similar scale challenges.
Observed speedups point to new possibilities for interactive or real-time use of generative models that previously required multiple GPUs.

Load-bearing premise

The Hessian-based approximate second-order information stays sufficiently accurate for all layers in 175B-scale models without accumulating errors that would necessitate retraining.

What would settle it

A test showing large accuracy degradation or much longer quantization time when applying the method to a 175B parameter GPT model would disprove the central performance claims.

read the original abstract

Generative Pre-trained Transformer models, known as GPT or OPT, set themselves apart through breakthrough performance across complex language modelling tasks, but also by their extremely high computational and storage costs. Specifically, due to their massive size, even inference for large, highly-accurate GPT models may require multiple performant GPUs, which limits the usability of such models. While there is emerging work on relieving this pressure via model compression, the applicability and performance of existing compression techniques is limited by the scale and complexity of GPT models. In this paper, we address this challenge, and propose GPTQ, a new one-shot weight quantization method based on approximate second-order information, that is both highly-accurate and highly-efficient. Specifically, GPTQ can quantize GPT models with 175 billion parameters in approximately four GPU hours, reducing the bitwidth down to 3 or 4 bits per weight, with negligible accuracy degradation relative to the uncompressed baseline. Our method more than doubles the compression gains relative to previously-proposed one-shot quantization methods, preserving accuracy, allowing us for the first time to execute an 175 billion-parameter model inside a single GPU for generative inference. Moreover, we also show that our method can still provide reasonable accuracy in the extreme quantization regime, in which weights are quantized to 2-bit or even ternary quantization levels. We show experimentally that these improvements can be leveraged for end-to-end inference speedups over FP16, of around 3.25x when using high-end GPUs (NVIDIA A100) and 4.5x when using more cost-effective ones (NVIDIA A6000). The implementation is available at https://github.com/IST-DASLab/gptq.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GPTQ shows a practical Hessian-based one-shot method that quantizes 175B models to 3-4 bits in hours with small perplexity loss and real speedups, but the approximation's behavior at scale rests mostly on empirics.

read the letter

GPTQ gives a workable route to drop 175B-scale GPT models to 3 or 4 bits per weight in roughly four GPU hours while keeping perplexity close to the full-precision baseline. The result is single-GPU inference and measured speedups of 3.25x on A100s and 4.5x on A6000s, which is the part that matters for deployment right now. They also push into 2-bit and ternary regimes with usable accuracy, though those are secondary. The core technical step is adapting an approximate inverse-Hessian update, computed layer-wise on a small calibration set with Cholesky factorization and error compensation, to decide quantization choices. This is more than a simple extension of older Optimal Brain Surgeon ideas; the combination is tuned for transformer weight matrices at this size and delivers more than double the compression of prior one-shot baselines on the models they test. Releasing the code helps anyone who wants to reproduce or extend it. The experiments cover OPT models up to 175B and report concrete numbers, which is the main strength. The soft spot is exactly the one the stress test flags: there is no derivation of error bounds on the Hessian approximation, no scaling study for how the Cholesky updates hold when hidden dimensions hit 12k, and limited ablation on calibration-set size versus model depth. The paper shows the method works empirically on the models tried, but if the second-order proxy drifts systematically at larger scales or with different data, the “negligible degradation” claim would need re-checking. That gap is real but not fatal given the results they do provide. This paper is for people who need to run or serve large generative models under tight memory or latency constraints. Practitioners in efficient inference and quantization researchers will get immediate value from the numbers and the implementation. It shows clear engagement with the existing literature and produces falsifiable claims backed by experiments, so it deserves a serious referee. I would send it for peer review.

Referee Report

3 major / 2 minor

Summary. The paper presents GPTQ, a one-shot post-training quantization method for large GPT and OPT models that uses approximate second-order (Hessian) information to quantize weights to 3-4 bits (and even 2-bit/ternary) while claiming negligible accuracy loss relative to FP16 baselines. It reports that a 175B-parameter model can be quantized in ~4 GPU hours, more than doubling prior one-shot compression ratios, enabling single-GPU generative inference, and delivering 3.25-4.5x end-to-end speedups on A100/A6000 hardware.

Significance. If the empirical claims hold, the work would be significant for efficient deployment of large language models: it offers a practical, retraining-free route to 3-4 bit inference on models previously requiring multiple high-end GPUs, with open-source code that could accelerate follow-on research in post-training compression.

major comments (3)

[Method section (quantization procedure and Hessian approximation)] The central claim that the layer-wise approximate inverse-Hessian (computed on a small calibration set via Cholesky updates) remains sufficiently accurate for 175B-scale models without compounding errors lacks any supporting analysis or bounds. No scaling study or error-propagation argument is given for layer dimensions ~12k or network depth of 175B parameters.
[Experiments section (OPT-175B results and ablations)] Experimental results on OPT-175B report low perplexity degradation at 3-4 bits, but the manuscript supplies no ablation on calibration-set size versus model scale, no variance across multiple calibration draws or random seeds, and no comparison of Hessian approximation quality at different model widths.
[Introduction and Experiments] The claim of 'negligible accuracy degradation' and 'more than doubles the compression gains' relative to prior one-shot methods rests on the unverified assumption that second-order information stays reliable at this scale; without the missing analysis, the four-GPU-hour practicality claim cannot be generalized beyond the specific empirical runs shown.

minor comments (2)

[Abstract] The abstract states that the method 'more than doubles the compression gains' but does not name the exact prior one-shot baselines or report the precise ratio in the summary paragraph.
[Method] Notation for the per-layer Hessian update and error compensation step could be clarified with an explicit algorithm box or pseudocode to aid reproducibility.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive feedback and positive assessment of the work's potential significance. We address each major comment point by point below, with proposed revisions to the manuscript where appropriate.

read point-by-point responses

Referee: [Method section (quantization procedure and Hessian approximation)] The central claim that the layer-wise approximate inverse-Hessian (computed on a small calibration set via Cholesky updates) remains sufficiently accurate for 175B-scale models without compounding errors lacks any supporting analysis or bounds. No scaling study or error-propagation argument is given for layer dimensions ~12k or network depth of 175B parameters.

Authors: We agree that the manuscript provides no formal theoretical bounds or error-propagation analysis for the layer-wise Hessian approximation. The method is presented as an efficient, practical approximation whose reliability is demonstrated empirically on models up to 175B parameters. Deriving rigorous bounds at this scale is a substantial theoretical undertaking that lies outside the paper's empirical focus. In the revision we will add a short discussion of the approximation's observed stability (based on per-layer quantization error and end-to-end perplexity) together with a scaling plot that compares performance from 1B to 175B models. revision: partial
Referee: [Experiments section (OPT-175B results and ablations)] Experimental results on OPT-175B report low perplexity degradation at 3-4 bits, but the manuscript supplies no ablation on calibration-set size versus model scale, no variance across multiple calibration draws or random seeds, and no comparison of Hessian approximation quality at different model widths.

Authors: We accept that these ablations are missing and would strengthen the experimental section. The revised manuscript will include: (i) perplexity versus calibration-set size for models of varying scale, (ii) mean and standard deviation of perplexity over at least three independent calibration draws, and (iii) a comparison of per-layer quantization error for layers of different widths. These additions will be placed in the experiments section and the associated appendix. revision: yes
Referee: [Introduction and Experiments] The claim of 'negligible accuracy degradation' and 'more than doubles the compression gains' relative to prior one-shot methods rests on the unverified assumption that second-order information stays reliable at this scale; without the missing analysis, the four-GPU-hour practicality claim cannot be generalized beyond the specific empirical runs shown.

Authors: The statements in the introduction and experiments are tied directly to the concrete empirical results reported for the tested models (including the 175B OPT model quantized in roughly four GPU hours). We do not assert that the second-order approximation is theoretically guaranteed to remain reliable at arbitrary scales. In the revision we will rephrase the relevant sentences to make the empirical basis explicit and to note that broader generalization remains an open question for future study. revision: partial

standing simulated objections not resolved

Deriving formal theoretical bounds or an error-propagation argument for the approximate inverse-Hessian at 175B scale and layer widths of ~12k.

Circularity Check

0 steps flagged

No significant circularity; empirical method with independent validation

full rationale

The paper presents GPTQ as a one-shot post-training quantization procedure that applies established approximate second-order (Hessian) information to weight quantization, with all performance claims (4-GPU-hour runtime, 3-4 bit accuracy on 175B models) resting on direct experimental measurements rather than any derivation that reduces by construction to the method's own fitted quantities or self-citations. No load-bearing step equates a claimed prediction to an input by definition, and the central accuracy result is externally falsifiable via perplexity and downstream metrics on held-out data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the effectiveness of approximate second-order information for one-shot quantization at extreme scale; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Approximate second-order information suffices to select quantization values that preserve model accuracy at 3-4 bits without retraining
This assumption underpins the entire one-shot procedure described.

pith-pipeline@v0.9.0 · 5618 in / 1318 out tokens · 37565 ms · 2026-05-10T17:12:46.637265+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

GPTQ... based on approximate second-order information... quantize GPT models with 175 billion parameters in approximately four GPU hours, reducing the bitwidth down to 3 or 4 bits per weight, with negligible accuracy degradation
Foundation.AlphaCoordinateFixation costAlphaLog_fourth_deriv_at_zero unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The approximate second-order information (Hessian-based) remains sufficiently accurate for guiding quantization decisions across all layers of 175B-scale models

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

HodgeCover: Higher-Order Topological Coverage Drives Compression of Sparse Mixture-of-Experts
cs.LG 2026-05 unverdicted novelty 8.0

HodgeCover isolates the harmonic kernel of a simplicial Laplacian on an expert 2-complex to identify irreducible merge cycles and selects experts for aggressive compression, matching or exceeding baselines on open-wei...
AIGaitor: Privacy-preserving and cloud-free motion analysis for everyone, using edge computing
cs.CV 2026-05 unverdicted novelty 7.0

The paper presents AIGaitor, a privacy-preserving on-device monocular motion analysis system that performs end-to-end pose estimation and deep learning gait analysis on consumer smartphones.
Llamas on the Web: Memory-Efficient, Performance-Portable, and Multi-Precision LLM Inference with WebGPU
cs.DC 2026-05 conditional novelty 7.0

LlamaWeb is a WebGPU backend for llama.cpp that uses static memory planning, tunable kernels, and templated multi-precision support to cut memory use by 29-33% and raise decode throughput by 45-69% versus prior browse...
LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation
cs.CV 2026-05 unverdicted novelty 7.0

LongLive-2.0 delivers an NVFP4 parallel infrastructure that enables direct training of long multi-shot autoregressive diffusion video models and achieves up to 2.15x training and 1.84x inference speedups on Blackwell ...
When Bits Break Recourse: Counterfactual-Faithful Quantization
cs.LG 2026-05 unverdicted novelty 7.0

CFQ trains quantizer parameters and mixed-precision allocation to preserve counterfactual recourse validity, cost, and direction on Adult, German Credit, and COMPAS while matching accuracy of standard quantizers.
Not All Tasks Quantize Equally: Fisher-Guided Quantization for Visual Geometry Transformer
cs.CV 2026-05 unverdicted novelty 7.0

Fisher-Guided Quantization uses the diagonal Fisher information matrix to measure and protect task-, block-, and channel-specific sensitivities during post-training quantization of multi-task 3D transformers, yielding...
From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents
cs.CL 2026-05 unverdicted novelty 7.0

A dataset-agnostic framework converts text tool-calling benchmarks to paired audio versions via TTS and noise, showing model-dependent performance with small text-to-voice gaps of 1.8-4.8 points on Confetti and When2Call.
Knowledge Beyond Language: Bridging the Gap in Multilingual Machine Unlearning Evaluation
cs.CL 2026-05 unverdicted novelty 7.0

New metrics KSS and KPS are introduced to evaluate multilingual machine unlearning quality and cross-language consistency in LLMs, addressing limitations of single-language evaluation protocols.
Multi-Scale Dequant: Eliminating Dequantization Bottleneck via Activation Decomposition for Efficient LLM Inference
stat.ML 2026-05 unverdicted novelty 7.0

MSD eliminates dequantization from the GEMM path by decomposing BF16 activations into multiple low-precision parts that multiply directly with INT8 or MXFP4 weights, achieving near-16 effective bits for INT8 and 6.6 f...
LoopQ: Quantization for Recursive Transformers
cs.LG 2026-05 unverdicted novelty 7.0

LoopQ provides a loop-aware PTQ framework for recursive Transformers that mitigates distribution shift, state reuse, and recursive error accumulation, yielding 68.8% higher average accuracy and 87.7% lower perplexity ...
Layer Collapse in Diffusion Language Models
cs.LG 2026-05 conditional novelty 7.0

Early layers in diffusion language models like LLaDA-8B collapse into redundant representations around a critical super-outlier activation due to overtraining, making them more robust to quantization and sparsity than...
Layer Collapse in Diffusion Language Models
cs.LG 2026-05 unverdicted novelty 7.0

Diffusion language models develop early-layer collapse around an indispensable super-outlier due to overtraining, resulting in higher compressibility and reversed optimal sparsity patterns versus autoregressive models.
When Quantization Is Free: An int4 KV Cache That Outruns fp16 on Apple Silicon
cs.PF 2026-05 unverdicted novelty 7.0

A single fused int4 KV cache kernel on Apple Silicon outperforms fp16 in latency with 3x memory compression and near-zero quality loss on tested models.
Quantamination: Dynamic Quantization Leaks Your Data Across the Batch
cs.CR 2026-04 conditional novelty 7.0

Dynamic quantization creates side channels allowing partial or full recovery of other users' batched data in at least four popular ML frameworks.
Variance Is Not Importance: Structural Analysis of Transformer Compressibility Across Model Scales
cs.LG 2026-04 unverdicted novelty 7.0

High-variance activation directions are uncorrelated with predictions, transformer blocks grow more linear with depth, and single-block linear replacement yields 34x compression on Mistral's final block at a 1.71 perp...
When W4A4 Breaks Camouflaged Object Detection: Token-Group Dual-Constraint Activation Quantization
cs.CV 2026-04 unverdicted novelty 7.0

COD-TDQ uses token-group scaling and dual-constraint projection to fix 4-bit activation quantization for camouflaged object detection, delivering more than 0.12 higher Sα scores than prior methods on four benchmarks w...
When Flat Minima Fail: Characterizing INT4 Quantization Collapse After FP32 Convergence
cs.LG 2026-04 conditional novelty 7.0

FP32-converged language models enter a post-convergence phase where INT4 quantization error explodes while FP32 perplexity remains stable, with onset tied to fine convergence rather than learning rate decay.
Hidden Reliability Risks in Large Language Models: Systematic Identification of Precision-Induced Output Disagreements
cs.AI 2026-04 unverdicted novelty 7.0

PrecisionDiff is a differential testing framework that uncovers widespread precision-induced behavioral disagreements in aligned LLMs, including safety-critical jailbreak divergences across precision formats.
ENEC: A Lossless AI Model Compression Method Enabling Fast Inference on Ascend NPUs
cs.AR 2026-03 unverdicted novelty 7.0

ENEC delivers 3.43X higher throughput than DietGPU and 1.12X better compression ratio than nvCOMP for lossless model weight compression on Ascend NPUs, yielding up to 6.3X end-to-end inference speedup.
Why Smaller Is Slower? Dimensional Misalignment in Compressed LLMs
cs.DC 2026-03 unverdicted novelty 7.0

Dimensional misalignment slows compressed LLMs on GPUs; GAC uses knapsack optimization to achieve full alignment and up to 1.5x speedup on Llama-3-8B while preserving quality.
QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models
cs.LG 2026-02 unverdicted novelty 7.0

QuantVLA is the first post-training quantization framework for VLA models that quantizes the diffusion transformer action head and reports higher task success rates than full-precision baselines with roughly 70% memor...
CORP: Closed-Form One-shot Representation-Preserving Structured Pruning for Transformers
cs.LG 2026-02 unverdicted novelty 7.0

CORP performs one-shot structured pruning of Transformers by modeling removed components as affine functions of retained ones and solving closed-form ridge regressions on calibration data to fold compensation into wei...
Vec-LUT: Vector Table Lookup for Parallel Ultra-Low-Bit LLM Inference on Edge Devices
cs.DC 2025-12 conditional novelty 7.0

Vec-LUT delivers up to 4.2x speedup over prior LUT methods for parallel ultra-low-bit LLM inference on edge devices by unifying lookups across tokens and adding cache-aware tensor layouts.
Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling
cs.CL 2025-12 conditional novelty 7.0

Four Over Six adaptively scales blocks in NVFP4 quantization to smaller FP4 values, making representable value distributions more uniform and reducing quantization error especially for near-maximal values.
Decomposed Trust: Privacy, Adversarial Robustness, Ethics, and Fairness in Low-Rank LLMs
cs.LG 2025-11 unverdicted novelty 7.0

Low-rank compression preserves training-data privacy and improves adversarial robustness but weakens personal-information protection, reduces ethical behavior in zero-shot use, and harms fairness.
SpinQuant: LLM quantization with learned rotations
cs.LG 2024-05 conditional novelty 7.0

SpinQuant learns optimal rotations to enable accurate 4-bit quantization of LLM weights, activations, and KV cache, reducing the zero-shot gap to full precision to 2.9 points on LLaMA-2 7B.
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
cs.LG 2024-01 conditional novelty 7.0

Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.
QLoRA: Efficient Finetuning of Quantized LLMs
cs.LG 2023-05 conditional novelty 7.0

QLoRA finetunes 4-bit quantized LLMs via LoRA adapters to match full-precision performance while using far less memory, enabling 65B-scale training on single GPUs and producing Guanaco models near ChatGPT level.
LLMs as Noisy Channels: A Shannon Perspective on Model Capacity and Scaling Laws
cs.LG 2026-05 unverdicted novelty 6.0

The Shannon Scaling Law treats LLM training as noisy-channel transmission and predicts U-shaped performance degradation when signal-to-noise ratio falls below a threshold, outperforming monotonic scaling laws on Pythi...
ThriftAttention: Selective Mixed Precision for Long-Context FP4 Attention
cs.LG 2026-05 unverdicted novelty 6.0

ThriftAttention recovers 89.1% of the FP16 quality gap versus pure FP4 attention by running only 5% of query-key blocks in FP16 on long-context benchmarks.
GEMQ: Global Expert-Level Mixed-Precision Quantization for MoE LLMs
cs.LG 2026-05 unverdicted novelty 6.0

GEMQ applies global LP-based expert importance estimation and router fine-tuning within progressive quantization to cut memory and speed inference in MoE LLMs with little accuracy loss.
Breaking Modality Heterogeneity in Low-Bit Quantization for Large Vision-Language Models
cs.CV 2026-05 unverdicted novelty 6.0

SplitQ improves low-bit PTQ for VLMs by isolating modality-specific outlier channels via MOCD and applying dual-branch adaptive calibration via ACC, outperforming prior methods on six datasets across W4A8 to W3A2 settings.
OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond
cs.LG 2026-05 unverdicted novelty 6.0

OScaR mitigates token norm imbalance via canalized rotation and omni-token scaling to enable near-lossless INT2 KV cache quantization with up to 3x decoding speedup and 5.3x memory reduction.
A Geometric Analysis of Sign-Magnitude Asymmetry in a ReLU + RMSNorm Block under Ternary Quantization
cs.LG 2026-05 unverdicted novelty 6.0

Sign-flip perturbations produce π/(π-2) ≈ 2.75 times more transverse output energy than equal-norm sign-preserving perturbations in a ReLU + RMSNorm block because ReLU creates directional asymmetry that RMSNorm's tran...
Dynamic Model Merging Made Slim
cs.LG 2026-05 unverdicted novelty 6.0

DiDi-Merging achieves dynamic model merging performance matching or exceeding prior methods while using only 1.24x to 1.4x the parameters of a single fine-tuned model.
OpenJarvis: Personal AI, On Personal Devices
cs.LG 2026-05 unverdicted novelty 6.0

OpenJarvis decomposes personal AI into Intelligence, Engine, Agents, Tools & Memory, and Learning primitives and applies LLM-guided spec search to produce on-device configurations that reach within 3.2 pp of cloud bas...
Not All Tasks Quantize Equally: Fisher-Guided Quantization for Visual Geometry Transformer
cs.CV 2026-05 unverdicted novelty 6.0

FGQ applies diagonal Fisher information to guide learnable affine transformations in PTQ for multi-task VGGT, yielding up to 39% relative gains over baselines at 4-bit quantization.
From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents
cs.CL 2026-05 unverdicted novelty 6.0

A dataset-agnostic framework converts text tool-calling benchmarks to paired audio evaluations via TTS, speaker variation and noise, then evaluates seven omni-modal models showing model- and task-dependent performance...
Nonlinear Bipolar Compensation: Handling Outliers in Post-Training Quantization
cs.CV 2026-05 unverdicted novelty 6.0

Nonlinear Bipolar Compensation with Bipolar Logarithmic Transformation reduces outlier effects in post-training quantization by performing compensation in a compressed transformed space.
Search Your Block Floating Point Scales!
cs.LG 2026-05 unverdicted novelty 6.0

ScaleSearch optimizes block floating point scales via fine-grained search to cut quantization error by 27% for NVFP4, improving PTQ by up to 15 points on MATH500 for Qwen3-8B and attention PPL by 0.77 on Llama 3.1 70B.
ADMM-Q: An Improved Hessian-based Weight Quantizer for Post-Training Quantization of Large Language Models
cs.LG 2026-05 unverdicted novelty 6.0

ADMM-Q is a new post-training quantization method using ADMM operator splitting that reduces WikiText-2 perplexity compared to GPTQ on Qwen3-8B across W3A16, W4A8, and W2A4KV4 settings.
Theory-optimal Quantization Based on Flatness
cs.LG 2026-05 unverdicted novelty 6.0

The paper introduces the Flatness metric, derives a theory-optimal quantization solution, and presents BDQ that uses bidirectional diagonal transformations to reduce outlier impact, achieving under 1% drop at W4A4 on ...
Fitting Is Not Enough: Smoothness in Extremely Quantized LLMs
cs.CL 2026-05 unverdicted novelty 6.0

Extremely quantized LLMs degrade in smoothness, sparsifying the decoding tree and hurting generation quality; a smoothness-preserving principle delivers gains beyond numerical fitting.
Different Prompts, Different Ranks: Prompt-aware Dynamic Rank Selection for SVD-based LLM Compression
cs.LG 2026-05 unverdicted novelty 6.0

PARSE trains a prompt-aware linear router on dense-model outputs to select dynamic SVD ranks, improving accuracy up to 10% at 0.6 compression ratio on LLaMA-7B while delivering 2.5x prefill and 2.4x decode speedups.
XtraMAC: An Efficient MAC Architecture for Mixed-Precision LLM Inference on FPGA
cs.AR 2026-05 conditional novelty 6.0

XtraMAC unifies mixed-precision MAC on FPGA via shared integer mantissa products, delivering 1.4-2.0x higher compute density and up to 1.9x better energy efficiency.
OSAQ: Outlier Self-Absorption for Accurate Low-bit LLM Quantization
cs.LG 2026-05 unverdicted novelty 6.0

OSAQ suppresses weight outliers in LLMs via a closed-form additive transformation from the Hessian's stable null space, improving 2-bit quantization perplexity by over 40% versus vanilla GPTQ with no inference overhead.
OSAQ: Outlier Self-Absorption for Accurate Low-bit LLM Quantization
cs.LG 2026-05 unverdicted novelty 6.0

OSAQ uses the low-rank structure of the Hessian to construct a closed-form additive weight transformation that suppresses outliers without changing task loss, enabling better low-bit LLM quantization.
Compress Then Adapt? No, Do It Together via Task-aware Union of Subspaces
cs.AI 2026-05 unverdicted novelty 6.0

JACTUS unifies low-rank compression and task adaptation via a task-aware union of subspaces and global rank allocation by marginal gain, outperforming 100% PEFT methods like DoRA on ViT-Base (89.2% avg) and Llama2-7B ...
WindowQuant: Mixed-Precision KV Cache Quantization based on Window-Level Similarity for VLMs Inference Optimization
cs.CV 2026-05 unverdicted novelty 6.0

WindowQuant performs window-adaptive mixed-precision KV cache quantization guided by similarity to the text prompt, with reordering to enable efficient inference in VLMs.
Sharpness-Aware Pretraining Mitigates Catastrophic Forgetting
cs.LG 2026-05 unverdicted novelty 6.0

Sharpness-aware pretraining and related flat-minima interventions reduce catastrophic forgetting by up to 80% after post-training across 20M-150M models and by 31-40% at 1B scale.
Technical Report: Activation Residual Hessian Quantization (ARHQ) for Low-Bit LLM Quantization
cs.LG 2026-04 unverdicted novelty 6.0

ARHQ isolates error-sensitive weight directions in LLMs via truncated SVD on the scaled matrix W G_x^{1/2} from activation residuals, improving SNR and preserving performance under aggressive low-bit quantization.
CoQuant: Joint Weight-Activation Subspace Projection for Mixed-Precision LLMs
cs.LG 2026-04 unverdicted novelty 6.0

CoQuant selects optimal high-precision subspaces for mixed-precision LLM quantization via a closed-form weighted PCA that balances weight and activation covariances derived from expected output error.
BitRL: Reinforcement Learning with 1-bit Quantized Language Models for Resource-Constrained Edge Deployment
cs.LG 2026-04 unverdicted novelty 6.0

BitRL enables on-device RL agents via 1-bit quantized language models, delivering 10-16x memory reduction and 3-5x energy efficiency gains with 85-98% retained performance.
FASQ: Flexible Accelerated Subspace Quantization for Calibration-Free LLM Compression
cs.LG 2026-04 unverdicted novelty 6.0

FASQ delivers calibration-free LLM compression with continuous size trade-offs via product quantization and custom CUDA kernels that accelerate decode beyond FP16 speeds on consumer hardware.
MCAP: Deployment-Time Layer Profiling for Memory-Constrained LLM Inference
cs.LG 2026-04 unverdicted novelty 6.0

MCAP uses load-time Monte Carlo profiling to estimate layer importance, enabling dynamic quantization (W4A8 vs W4A16) and memory tiering (GPU/RAM/SSD) that delivers 1.5-1.8x higher decode throughput than llama-cpp Q4_...
From Signal Degradation to Computation Collapse: Uncovering the Two Failure Modes of LLM Quantization
cs.CL 2026-04 unverdicted novelty 6.0

LLM 2-bit quantization fails via either cumulative signal degradation or early computation collapse in key components.
GRASPrune: Global Gating for Budgeted Structured Pruning of Large Language Models
cs.AI 2026-04 unverdicted novelty 6.0

GRASPrune removes 50% of parameters from LLaMA-2-7B via global gating and projected straight-through estimation, reaching 12.18 WikiText-2 perplexity and competitive zero-shot accuracy after four epochs on 512 calibra...
Are Large Language Models Economically Viable for Industry Deployment?
cs.CL 2026-04 unverdicted novelty 6.0

Small LLMs under 2B parameters achieve better economic break-even, energy efficiency, and hardware density than larger models on legacy GPUs for industrial tasks.
LBLLM: Lightweight Binarization of Large Language Models via Three-Stage Distillation
cs.LG 2026-04 unverdicted novelty 6.0

LBLLM achieves better accuracy than prior binarization methods for LLMs by decoupling weight and activation quantization through initialization, layer-wise distillation, and learnable activation scaling.
SAW-INT4: System-Aware 4-Bit KV-Cache Quantization for Real-World LLM Serving
cs.LG 2026-04 unverdicted novelty 6.0

Token-wise INT4 KV-cache quantization plus block-diagonal Hadamard rotation recovers nearly all accuracy lost by naive INT4 while adding zero end-to-end overhead under paged serving constraints.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · cited by 148 Pith papers · 5 internal anchors

[1]

On the dangers of stochastic parrots: Can language models be too big? In 2021 ACM Conference on Fairness, Accountability, and Transparency,

Emily M Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big? In 2021 ACM Conference on Fairness, Accountability, and Transparency,

work page 2021
[2]

A Systematic Classification of Knowledge, Reasoning, and Context within the ARC Dataset

Michael Boratko, Harshit Padigela, Divyendra Mikkilineni, Pritish Yuvraj, Rajarshi Das, Andrew McCallum, Maria Chang, Achille Fokoue-Nkoutche, Pavan Kapanipathi, Nicholas Mattei, et al. A systematic classiﬁcation of knowledge, reasoning, and context within the ARC dataset. arXiv preprint arXiv:1806.00358,

work page Pith review arXiv
[3]

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Tri Dao, Daniel Y Fu, Stefano Ermon, Atri Rudra, and Christopher R ´e. FlashAttention: Fast and memory-efﬁcient exact attention with io-awareness.arXiv preprint arXiv:2205.14135,

work page internal anchor Pith review arXiv
[4]

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

10 Published as a conference paper at ICLR 2023 Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. LLM.int8(): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339,

work page internal anchor Pith review arXiv 2023
[5]

Optimal brain compression: A framework for accurate post-training quantization and pruning,

Elias Frantar, Sidak Pal Singh, and Dan Alistarh. Optimal Brain Compression: A framework for ac- curate post-training quantization and pruning. arXiv preprint arXiv:2208.11580,

work page arXiv
[6]

W., and Keutzer, K

Accepted to NeurIPS 2022, to appear. Amir Gholami, Sehoon Kim, Zhen Dong, Zhewei Yao, Michael W Mahoney, and Kurt Keutzer. A survey of quantization methods for efﬁcient neural network inference. arXiv preprint arXiv:2103.13630,

work page arXiv 2022
[7]

Sparsity in deep learning: Pruning and growth for efﬁcient inference and training in neural networks

Torsten Hoeﬂer, Dan Alistarh, Tal Ben-Nun, Nikoli Dryden, and Alexandra Peste. Sparsity in deep learning: Pruning and growth for efﬁcient inference and training in neural networks. arXiv preprint arXiv:2102.00554,

work page arXiv
[8]

Improving post training neural quantization: Layer-wise calibration and integer programming

Itay Hubara, Yury Nahshan, Yair Hanani, Ron Banner, and Daniel Soudry. Improving post training neural quantization: Layer-wise calibration and integer programming. arXiv preprint arXiv:2006.10518,

work page arXiv 2006
[9]

The penn treebank: Annotating predicate argument structure

Mitch Marcus, Grace Kim, Mary Ann Marcinkiewicz, Robert MacIntyre, Ann Bies, Mark Ferguson, Karen Katz, and Britta Schasberger. The penn treebank: Annotating predicate argument structure. In Human Language Technology: Proceedings of a Workshop held at Plainsboro, New Jersey, March 8-11, 1994,

work page 1994
[10]

Pointer Sentinel Mixture Models

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843,

work page internal anchor Pith review arXiv
[11]

A White Paper on Neural Network Quantization

Markus Nagel, Marios Fournarakis, Rana Ali Amjad, Yelysei Bondarenko, Mart van Baalen, and Tijmen Blankevoort. A white paper on neural network quantization. arXiv preprint arXiv:2106.08295,

work page internal anchor Pith review arXiv
[12]

The LAMBADA dataset: Word prediction requiring a broad discourse context

Denis Paperno, Germ ´an Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fern´andez. The LAMBADA dataset: Word prediction requiring a broad discourse context. arXiv preprint arXiv:1606.06031,

work page Pith review arXiv
[13]

J., Kim, B., Lee, Y ., and Lee, D

11 Published as a conference paper at ICLR 2023 Gunho Park, Baeseong Park, Se Jung Kwon, Byeongwook Kim, Youngjoo Lee, and Dongsoo Lee. nuQmm: Quantized matmul for efﬁcient inference of large-scale generative language models. arXiv preprint arXiv:2206.09557,

work page arXiv 2023
[14]

Extreme compression for pre-trained transformers made simple and efﬁcient

Xiaoxia Wu, Zhewei Yao, Minjia Zhang, Conglong Li, and Yuxiong He. Extreme compression for pre-trained transformers made simple and efﬁcient. arXiv preprint arXiv:2206.01859,

work page arXiv
[15]

Yu, W., Yang, Z., Li, L., Wang, J., Lin, K., Liu, Z., Wang, X., and Wang, L

Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, Conglong Li, and Yuxiong He. ZeroQuant: Efﬁcient and affordable post-training quantization for large-scale transformers.arXiv preprint arXiv:2206.01861,

work page arXiv
[16]

OPT: Open Pre-trained Transformer Language Models

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christo- pher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. OPT: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068,

work page internal anchor Pith review arXiv
[17]

Xing, Joseph E

Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Joseph E Gonzalez, et al. Alpa: Automating inter-and intra-operator parallelism for distributed deep learning. arXiv preprint arXiv:2201.12023,

work page arXiv
[18]

(2016) and OPT-125M/WikiText2, which is one of the largest models to which OBQ can be reasonably applied

12 Published as a conference paper at ICLR 2023 A A PPENDIX A.1 A DDITIONAL COMPARISON WITH OBQ We now provide an additional comparison between GPTQ and OBQ on BERT-base/SQuAD Ra- jpurkar et al. (2016) and OPT-125M/WikiText2, which is one of the largest models to which OBQ can be reasonably applied. Method BERT-base OPT-125M 88.53 F1↑ 27.66 PPL↓ 4bit 3bit...

work page 2023
[19]

For non-generative and large-batch applications, operations may be compute- rather than memory-bound and our kernels thus not directly applicable

where the underlying (close to) matrix-vector products are memory- bound. For non-generative and large-batch applications, operations may be compute- rather than memory-bound and our kernels thus not directly applicable. Instead, one could simply decompress the matrix before performing the corresponding matrix-matrix calculations: this takes < 1.5ms on an...

work page 2022
[20]

We note that the calibration data used by GPTQ is sampled from the C4 training set, this task is thus not fully zero-shot

GPTQ 3 35.78 28.83 25.34 21.25 17.67 12.27 Table 12: BLOOM perplexity results for C4. We note that the calibration data used by GPTQ is sampled from the C4 training set, this task is thus not fully zero-shot. 14 Published as a conference paper at ICLR 2023 A.4 A DDITIONAL ZEROSHOT RESULTS This section contains additional results for zero-shot tasks. OPT B...

work page 2023

[1] [1]

On the dangers of stochastic parrots: Can language models be too big? In 2021 ACM Conference on Fairness, Accountability, and Transparency,

Emily M Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big? In 2021 ACM Conference on Fairness, Accountability, and Transparency,

work page 2021

[2] [2]

A Systematic Classification of Knowledge, Reasoning, and Context within the ARC Dataset

Michael Boratko, Harshit Padigela, Divyendra Mikkilineni, Pritish Yuvraj, Rajarshi Das, Andrew McCallum, Maria Chang, Achille Fokoue-Nkoutche, Pavan Kapanipathi, Nicholas Mattei, et al. A systematic classiﬁcation of knowledge, reasoning, and context within the ARC dataset. arXiv preprint arXiv:1806.00358,

work page Pith review arXiv

[3] [3]

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Tri Dao, Daniel Y Fu, Stefano Ermon, Atri Rudra, and Christopher R ´e. FlashAttention: Fast and memory-efﬁcient exact attention with io-awareness.arXiv preprint arXiv:2205.14135,

work page internal anchor Pith review arXiv

[4] [4]

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

10 Published as a conference paper at ICLR 2023 Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. LLM.int8(): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339,

work page internal anchor Pith review arXiv 2023

[5] [5]

Optimal brain compression: A framework for accurate post-training quantization and pruning,

Elias Frantar, Sidak Pal Singh, and Dan Alistarh. Optimal Brain Compression: A framework for ac- curate post-training quantization and pruning. arXiv preprint arXiv:2208.11580,

work page arXiv

[6] [6]

W., and Keutzer, K

Accepted to NeurIPS 2022, to appear. Amir Gholami, Sehoon Kim, Zhen Dong, Zhewei Yao, Michael W Mahoney, and Kurt Keutzer. A survey of quantization methods for efﬁcient neural network inference. arXiv preprint arXiv:2103.13630,

work page arXiv 2022

[7] [7]

Sparsity in deep learning: Pruning and growth for efﬁcient inference and training in neural networks

Torsten Hoeﬂer, Dan Alistarh, Tal Ben-Nun, Nikoli Dryden, and Alexandra Peste. Sparsity in deep learning: Pruning and growth for efﬁcient inference and training in neural networks. arXiv preprint arXiv:2102.00554,

work page arXiv

[8] [8]

Improving post training neural quantization: Layer-wise calibration and integer programming

Itay Hubara, Yury Nahshan, Yair Hanani, Ron Banner, and Daniel Soudry. Improving post training neural quantization: Layer-wise calibration and integer programming. arXiv preprint arXiv:2006.10518,

work page arXiv 2006

[9] [9]

The penn treebank: Annotating predicate argument structure

Mitch Marcus, Grace Kim, Mary Ann Marcinkiewicz, Robert MacIntyre, Ann Bies, Mark Ferguson, Karen Katz, and Britta Schasberger. The penn treebank: Annotating predicate argument structure. In Human Language Technology: Proceedings of a Workshop held at Plainsboro, New Jersey, March 8-11, 1994,

work page 1994

[10] [10]

Pointer Sentinel Mixture Models

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843,

work page internal anchor Pith review arXiv

[11] [11]

A White Paper on Neural Network Quantization

Markus Nagel, Marios Fournarakis, Rana Ali Amjad, Yelysei Bondarenko, Mart van Baalen, and Tijmen Blankevoort. A white paper on neural network quantization. arXiv preprint arXiv:2106.08295,

work page internal anchor Pith review arXiv

[12] [12]

The LAMBADA dataset: Word prediction requiring a broad discourse context

Denis Paperno, Germ ´an Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fern´andez. The LAMBADA dataset: Word prediction requiring a broad discourse context. arXiv preprint arXiv:1606.06031,

work page Pith review arXiv

[13] [13]

J., Kim, B., Lee, Y ., and Lee, D

11 Published as a conference paper at ICLR 2023 Gunho Park, Baeseong Park, Se Jung Kwon, Byeongwook Kim, Youngjoo Lee, and Dongsoo Lee. nuQmm: Quantized matmul for efﬁcient inference of large-scale generative language models. arXiv preprint arXiv:2206.09557,

work page arXiv 2023

[14] [14]

Extreme compression for pre-trained transformers made simple and efﬁcient

Xiaoxia Wu, Zhewei Yao, Minjia Zhang, Conglong Li, and Yuxiong He. Extreme compression for pre-trained transformers made simple and efﬁcient. arXiv preprint arXiv:2206.01859,

work page arXiv

[15] [15]

Yu, W., Yang, Z., Li, L., Wang, J., Lin, K., Liu, Z., Wang, X., and Wang, L

Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, Conglong Li, and Yuxiong He. ZeroQuant: Efﬁcient and affordable post-training quantization for large-scale transformers.arXiv preprint arXiv:2206.01861,

work page arXiv

[16] [16]

OPT: Open Pre-trained Transformer Language Models

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christo- pher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. OPT: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068,

work page internal anchor Pith review arXiv

[17] [17]

Xing, Joseph E

Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Joseph E Gonzalez, et al. Alpa: Automating inter-and intra-operator parallelism for distributed deep learning. arXiv preprint arXiv:2201.12023,

work page arXiv

[18] [18]

(2016) and OPT-125M/WikiText2, which is one of the largest models to which OBQ can be reasonably applied

12 Published as a conference paper at ICLR 2023 A A PPENDIX A.1 A DDITIONAL COMPARISON WITH OBQ We now provide an additional comparison between GPTQ and OBQ on BERT-base/SQuAD Ra- jpurkar et al. (2016) and OPT-125M/WikiText2, which is one of the largest models to which OBQ can be reasonably applied. Method BERT-base OPT-125M 88.53 F1↑ 27.66 PPL↓ 4bit 3bit...

work page 2023

[19] [19]

For non-generative and large-batch applications, operations may be compute- rather than memory-bound and our kernels thus not directly applicable

where the underlying (close to) matrix-vector products are memory- bound. For non-generative and large-batch applications, operations may be compute- rather than memory-bound and our kernels thus not directly applicable. Instead, one could simply decompress the matrix before performing the corresponding matrix-matrix calculations: this takes < 1.5ms on an...

work page 2022

[20] [20]

We note that the calibration data used by GPTQ is sampled from the C4 training set, this task is thus not fully zero-shot

GPTQ 3 35.78 28.83 25.34 21.25 17.67 12.27 Table 12: BLOOM perplexity results for C4. We note that the calibration data used by GPTQ is sampled from the C4 training set, this task is thus not fully zero-shot. 14 Published as a conference paper at ICLR 2023 A.4 A DDITIONAL ZEROSHOT RESULTS This section contains additional results for zero-shot tasks. OPT B...

work page 2023