GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
Pith reviewed 2026-05-10 17:12 UTC · model grok-4.3
The pith
GPTQ quantizes 175 billion parameter GPT models to 3 or 4 bits per weight in about four GPU hours with negligible accuracy loss.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors show that their GPTQ method, based on approximate second-order information, can quantize GPT models with up to 175 billion parameters down to 3 or 4 bits per weight. This process takes approximately four GPU hours and results in negligible accuracy degradation relative to the uncompressed model. The approach more than doubles the compression gains of previous one-shot methods and enables single-GPU inference for these massive models, with observed speedups of 3.25x on A100 GPUs and 4.5x on A6000 GPUs.
What carries the argument
Approximate second-order information, specifically Hessian-based approximations, used to make layer-wise quantization decisions in a one-shot post-training process.
If this is right
- 175 billion parameter models become runnable for generative inference inside a single GPU.
- Accuracy is preserved at 3-4 bit quantization, more than doubling prior compression gains for one-shot methods.
- End-to-end inference achieves speedups of approximately 3.25x on high-end GPUs like the NVIDIA A100 and 4.5x on cost-effective ones like the NVIDIA A6000.
- Reasonable accuracy holds in extreme cases of 2-bit or ternary quantization.
Where Pith is reading between the lines
- GPTQ's efficiency could make quantization standard practice for deploying very large language models on limited hardware.
- The one-shot nature suggests the approach may extend to other large transformer families with similar scale challenges.
- Observed speedups point to new possibilities for interactive or real-time use of generative models that previously required multiple GPUs.
Load-bearing premise
The Hessian-based approximate second-order information stays sufficiently accurate for all layers in 175B-scale models without accumulating errors that would necessitate retraining.
What would settle it
A test showing large accuracy degradation or much longer quantization time when applying the method to a 175B parameter GPT model would disprove the central performance claims.
read the original abstract
Generative Pre-trained Transformer models, known as GPT or OPT, set themselves apart through breakthrough performance across complex language modelling tasks, but also by their extremely high computational and storage costs. Specifically, due to their massive size, even inference for large, highly-accurate GPT models may require multiple performant GPUs, which limits the usability of such models. While there is emerging work on relieving this pressure via model compression, the applicability and performance of existing compression techniques is limited by the scale and complexity of GPT models. In this paper, we address this challenge, and propose GPTQ, a new one-shot weight quantization method based on approximate second-order information, that is both highly-accurate and highly-efficient. Specifically, GPTQ can quantize GPT models with 175 billion parameters in approximately four GPU hours, reducing the bitwidth down to 3 or 4 bits per weight, with negligible accuracy degradation relative to the uncompressed baseline. Our method more than doubles the compression gains relative to previously-proposed one-shot quantization methods, preserving accuracy, allowing us for the first time to execute an 175 billion-parameter model inside a single GPU for generative inference. Moreover, we also show that our method can still provide reasonable accuracy in the extreme quantization regime, in which weights are quantized to 2-bit or even ternary quantization levels. We show experimentally that these improvements can be leveraged for end-to-end inference speedups over FP16, of around 3.25x when using high-end GPUs (NVIDIA A100) and 4.5x when using more cost-effective ones (NVIDIA A6000). The implementation is available at https://github.com/IST-DASLab/gptq.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents GPTQ, a one-shot post-training quantization method for large GPT and OPT models that uses approximate second-order (Hessian) information to quantize weights to 3-4 bits (and even 2-bit/ternary) while claiming negligible accuracy loss relative to FP16 baselines. It reports that a 175B-parameter model can be quantized in ~4 GPU hours, more than doubling prior one-shot compression ratios, enabling single-GPU generative inference, and delivering 3.25-4.5x end-to-end speedups on A100/A6000 hardware.
Significance. If the empirical claims hold, the work would be significant for efficient deployment of large language models: it offers a practical, retraining-free route to 3-4 bit inference on models previously requiring multiple high-end GPUs, with open-source code that could accelerate follow-on research in post-training compression.
major comments (3)
- [Method section (quantization procedure and Hessian approximation)] The central claim that the layer-wise approximate inverse-Hessian (computed on a small calibration set via Cholesky updates) remains sufficiently accurate for 175B-scale models without compounding errors lacks any supporting analysis or bounds. No scaling study or error-propagation argument is given for layer dimensions ~12k or network depth of 175B parameters.
- [Experiments section (OPT-175B results and ablations)] Experimental results on OPT-175B report low perplexity degradation at 3-4 bits, but the manuscript supplies no ablation on calibration-set size versus model scale, no variance across multiple calibration draws or random seeds, and no comparison of Hessian approximation quality at different model widths.
- [Introduction and Experiments] The claim of 'negligible accuracy degradation' and 'more than doubles the compression gains' relative to prior one-shot methods rests on the unverified assumption that second-order information stays reliable at this scale; without the missing analysis, the four-GPU-hour practicality claim cannot be generalized beyond the specific empirical runs shown.
minor comments (2)
- [Abstract] The abstract states that the method 'more than doubles the compression gains' but does not name the exact prior one-shot baselines or report the precise ratio in the summary paragraph.
- [Method] Notation for the per-layer Hessian update and error compensation step could be clarified with an explicit algorithm box or pseudocode to aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and positive assessment of the work's potential significance. We address each major comment point by point below, with proposed revisions to the manuscript where appropriate.
read point-by-point responses
-
Referee: [Method section (quantization procedure and Hessian approximation)] The central claim that the layer-wise approximate inverse-Hessian (computed on a small calibration set via Cholesky updates) remains sufficiently accurate for 175B-scale models without compounding errors lacks any supporting analysis or bounds. No scaling study or error-propagation argument is given for layer dimensions ~12k or network depth of 175B parameters.
Authors: We agree that the manuscript provides no formal theoretical bounds or error-propagation analysis for the layer-wise Hessian approximation. The method is presented as an efficient, practical approximation whose reliability is demonstrated empirically on models up to 175B parameters. Deriving rigorous bounds at this scale is a substantial theoretical undertaking that lies outside the paper's empirical focus. In the revision we will add a short discussion of the approximation's observed stability (based on per-layer quantization error and end-to-end perplexity) together with a scaling plot that compares performance from 1B to 175B models. revision: partial
-
Referee: [Experiments section (OPT-175B results and ablations)] Experimental results on OPT-175B report low perplexity degradation at 3-4 bits, but the manuscript supplies no ablation on calibration-set size versus model scale, no variance across multiple calibration draws or random seeds, and no comparison of Hessian approximation quality at different model widths.
Authors: We accept that these ablations are missing and would strengthen the experimental section. The revised manuscript will include: (i) perplexity versus calibration-set size for models of varying scale, (ii) mean and standard deviation of perplexity over at least three independent calibration draws, and (iii) a comparison of per-layer quantization error for layers of different widths. These additions will be placed in the experiments section and the associated appendix. revision: yes
-
Referee: [Introduction and Experiments] The claim of 'negligible accuracy degradation' and 'more than doubles the compression gains' relative to prior one-shot methods rests on the unverified assumption that second-order information stays reliable at this scale; without the missing analysis, the four-GPU-hour practicality claim cannot be generalized beyond the specific empirical runs shown.
Authors: The statements in the introduction and experiments are tied directly to the concrete empirical results reported for the tested models (including the 175B OPT model quantized in roughly four GPU hours). We do not assert that the second-order approximation is theoretically guaranteed to remain reliable at arbitrary scales. In the revision we will rephrase the relevant sentences to make the empirical basis explicit and to note that broader generalization remains an open question for future study. revision: partial
- Deriving formal theoretical bounds or an error-propagation argument for the approximate inverse-Hessian at 175B scale and layer widths of ~12k.
Circularity Check
No significant circularity; empirical method with independent validation
full rationale
The paper presents GPTQ as a one-shot post-training quantization procedure that applies established approximate second-order (Hessian) information to weight quantization, with all performance claims (4-GPU-hour runtime, 3-4 bit accuracy on 175B models) resting on direct experimental measurements rather than any derivation that reduces by construction to the method's own fitted quantities or self-citations. No load-bearing step equates a claimed prediction to an input by definition, and the central accuracy result is externally falsifiable via perplexity and downstream metrics on held-out data.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Approximate second-order information suffices to select quantization values that preserve model accuracy at 3-4 bits without retraining
Lean theorems connected to this paper
-
Cost.FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
GPTQ... based on approximate second-order information... quantize GPT models with 175 billion parameters in approximately four GPU hours, reducing the bitwidth down to 3 or 4 bits per weight, with negligible accuracy degradation
-
Foundation.AlphaCoordinateFixationcostAlphaLog_fourth_deriv_at_zero unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The approximate second-order information (Hessian-based) remains sufficiently accurate for guiding quantization decisions across all layers of 175B-scale models
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 60 Pith papers
-
HodgeCover: Higher-Order Topological Coverage Drives Compression of Sparse Mixture-of-Experts
HodgeCover isolates the harmonic kernel of a simplicial Laplacian on an expert 2-complex to identify irreducible merge cycles and selects experts for aggressive compression, matching or exceeding baselines on open-wei...
-
AIGaitor: Privacy-preserving and cloud-free motion analysis for everyone, using edge computing
The paper presents AIGaitor, a privacy-preserving on-device monocular motion analysis system that performs end-to-end pose estimation and deep learning gait analysis on consumer smartphones.
-
Llamas on the Web: Memory-Efficient, Performance-Portable, and Multi-Precision LLM Inference with WebGPU
LlamaWeb is a WebGPU backend for llama.cpp that uses static memory planning, tunable kernels, and templated multi-precision support to cut memory use by 29-33% and raise decode throughput by 45-69% versus prior browse...
-
LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation
LongLive-2.0 delivers an NVFP4 parallel infrastructure that enables direct training of long multi-shot autoregressive diffusion video models and achieves up to 2.15x training and 1.84x inference speedups on Blackwell ...
-
When Bits Break Recourse: Counterfactual-Faithful Quantization
CFQ trains quantizer parameters and mixed-precision allocation to preserve counterfactual recourse validity, cost, and direction on Adult, German Credit, and COMPAS while matching accuracy of standard quantizers.
-
Not All Tasks Quantize Equally: Fisher-Guided Quantization for Visual Geometry Transformer
Fisher-Guided Quantization uses the diagonal Fisher information matrix to measure and protect task-, block-, and channel-specific sensitivities during post-training quantization of multi-task 3D transformers, yielding...
-
From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents
A dataset-agnostic framework converts text tool-calling benchmarks to paired audio versions via TTS and noise, showing model-dependent performance with small text-to-voice gaps of 1.8-4.8 points on Confetti and When2Call.
-
Knowledge Beyond Language: Bridging the Gap in Multilingual Machine Unlearning Evaluation
New metrics KSS and KPS are introduced to evaluate multilingual machine unlearning quality and cross-language consistency in LLMs, addressing limitations of single-language evaluation protocols.
-
Multi-Scale Dequant: Eliminating Dequantization Bottleneck via Activation Decomposition for Efficient LLM Inference
MSD eliminates dequantization from the GEMM path by decomposing BF16 activations into multiple low-precision parts that multiply directly with INT8 or MXFP4 weights, achieving near-16 effective bits for INT8 and 6.6 f...
-
LoopQ: Quantization for Recursive Transformers
LoopQ provides a loop-aware PTQ framework for recursive Transformers that mitigates distribution shift, state reuse, and recursive error accumulation, yielding 68.8% higher average accuracy and 87.7% lower perplexity ...
-
Layer Collapse in Diffusion Language Models
Early layers in diffusion language models like LLaDA-8B collapse into redundant representations around a critical super-outlier activation due to overtraining, making them more robust to quantization and sparsity than...
-
Layer Collapse in Diffusion Language Models
Diffusion language models develop early-layer collapse around an indispensable super-outlier due to overtraining, resulting in higher compressibility and reversed optimal sparsity patterns versus autoregressive models.
-
When Quantization Is Free: An int4 KV Cache That Outruns fp16 on Apple Silicon
A single fused int4 KV cache kernel on Apple Silicon outperforms fp16 in latency with 3x memory compression and near-zero quality loss on tested models.
-
Quantamination: Dynamic Quantization Leaks Your Data Across the Batch
Dynamic quantization creates side channels allowing partial or full recovery of other users' batched data in at least four popular ML frameworks.
-
Variance Is Not Importance: Structural Analysis of Transformer Compressibility Across Model Scales
High-variance activation directions are uncorrelated with predictions, transformer blocks grow more linear with depth, and single-block linear replacement yields 34x compression on Mistral's final block at a 1.71 perp...
-
When W4A4 Breaks Camouflaged Object Detection: Token-Group Dual-Constraint Activation Quantization
COD-TDQ uses token-group scaling and dual-constraint projection to fix 4-bit activation quantization for camouflaged object detection, delivering more than 0.12 higher Sα scores than prior methods on four benchmarks w...
-
When Flat Minima Fail: Characterizing INT4 Quantization Collapse After FP32 Convergence
FP32-converged language models enter a post-convergence phase where INT4 quantization error explodes while FP32 perplexity remains stable, with onset tied to fine convergence rather than learning rate decay.
-
Hidden Reliability Risks in Large Language Models: Systematic Identification of Precision-Induced Output Disagreements
PrecisionDiff is a differential testing framework that uncovers widespread precision-induced behavioral disagreements in aligned LLMs, including safety-critical jailbreak divergences across precision formats.
-
ENEC: A Lossless AI Model Compression Method Enabling Fast Inference on Ascend NPUs
ENEC delivers 3.43X higher throughput than DietGPU and 1.12X better compression ratio than nvCOMP for lossless model weight compression on Ascend NPUs, yielding up to 6.3X end-to-end inference speedup.
-
Why Smaller Is Slower? Dimensional Misalignment in Compressed LLMs
Dimensional misalignment slows compressed LLMs on GPUs; GAC uses knapsack optimization to achieve full alignment and up to 1.5x speedup on Llama-3-8B while preserving quality.
-
QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models
QuantVLA is the first post-training quantization framework for VLA models that quantizes the diffusion transformer action head and reports higher task success rates than full-precision baselines with roughly 70% memor...
-
CORP: Closed-Form One-shot Representation-Preserving Structured Pruning for Transformers
CORP performs one-shot structured pruning of Transformers by modeling removed components as affine functions of retained ones and solving closed-form ridge regressions on calibration data to fold compensation into wei...
-
Vec-LUT: Vector Table Lookup for Parallel Ultra-Low-Bit LLM Inference on Edge Devices
Vec-LUT delivers up to 4.2x speedup over prior LUT methods for parallel ultra-low-bit LLM inference on edge devices by unifying lookups across tokens and adding cache-aware tensor layouts.
-
Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling
Four Over Six adaptively scales blocks in NVFP4 quantization to smaller FP4 values, making representable value distributions more uniform and reducing quantization error especially for near-maximal values.
-
Decomposed Trust: Privacy, Adversarial Robustness, Ethics, and Fairness in Low-Rank LLMs
Low-rank compression preserves training-data privacy and improves adversarial robustness but weakens personal-information protection, reduces ethical behavior in zero-shot use, and harms fairness.
-
SpinQuant: LLM quantization with learned rotations
SpinQuant learns optimal rotations to enable accurate 4-bit quantization of LLM weights, activations, and KV cache, reducing the zero-shot gap to full precision to 2.9 points on LLaMA-2 7B.
-
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.
-
QLoRA: Efficient Finetuning of Quantized LLMs
QLoRA finetunes 4-bit quantized LLMs via LoRA adapters to match full-precision performance while using far less memory, enabling 65B-scale training on single GPUs and producing Guanaco models near ChatGPT level.
-
LLMs as Noisy Channels: A Shannon Perspective on Model Capacity and Scaling Laws
The Shannon Scaling Law treats LLM training as noisy-channel transmission and predicts U-shaped performance degradation when signal-to-noise ratio falls below a threshold, outperforming monotonic scaling laws on Pythi...
-
ThriftAttention: Selective Mixed Precision for Long-Context FP4 Attention
ThriftAttention recovers 89.1% of the FP16 quality gap versus pure FP4 attention by running only 5% of query-key blocks in FP16 on long-context benchmarks.
-
GEMQ: Global Expert-Level Mixed-Precision Quantization for MoE LLMs
GEMQ applies global LP-based expert importance estimation and router fine-tuning within progressive quantization to cut memory and speed inference in MoE LLMs with little accuracy loss.
-
Breaking Modality Heterogeneity in Low-Bit Quantization for Large Vision-Language Models
SplitQ improves low-bit PTQ for VLMs by isolating modality-specific outlier channels via MOCD and applying dual-branch adaptive calibration via ACC, outperforming prior methods on six datasets across W4A8 to W3A2 settings.
-
OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond
OScaR mitigates token norm imbalance via canalized rotation and omni-token scaling to enable near-lossless INT2 KV cache quantization with up to 3x decoding speedup and 5.3x memory reduction.
-
A Geometric Analysis of Sign-Magnitude Asymmetry in a ReLU + RMSNorm Block under Ternary Quantization
Sign-flip perturbations produce π/(π-2) ≈ 2.75 times more transverse output energy than equal-norm sign-preserving perturbations in a ReLU + RMSNorm block because ReLU creates directional asymmetry that RMSNorm's tran...
-
Dynamic Model Merging Made Slim
DiDi-Merging achieves dynamic model merging performance matching or exceeding prior methods while using only 1.24x to 1.4x the parameters of a single fine-tuned model.
-
OpenJarvis: Personal AI, On Personal Devices
OpenJarvis decomposes personal AI into Intelligence, Engine, Agents, Tools & Memory, and Learning primitives and applies LLM-guided spec search to produce on-device configurations that reach within 3.2 pp of cloud bas...
-
Not All Tasks Quantize Equally: Fisher-Guided Quantization for Visual Geometry Transformer
FGQ applies diagonal Fisher information to guide learnable affine transformations in PTQ for multi-task VGGT, yielding up to 39% relative gains over baselines at 4-bit quantization.
-
From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents
A dataset-agnostic framework converts text tool-calling benchmarks to paired audio evaluations via TTS, speaker variation and noise, then evaluates seven omni-modal models showing model- and task-dependent performance...
-
Nonlinear Bipolar Compensation: Handling Outliers in Post-Training Quantization
Nonlinear Bipolar Compensation with Bipolar Logarithmic Transformation reduces outlier effects in post-training quantization by performing compensation in a compressed transformed space.
-
Search Your Block Floating Point Scales!
ScaleSearch optimizes block floating point scales via fine-grained search to cut quantization error by 27% for NVFP4, improving PTQ by up to 15 points on MATH500 for Qwen3-8B and attention PPL by 0.77 on Llama 3.1 70B.
-
ADMM-Q: An Improved Hessian-based Weight Quantizer for Post-Training Quantization of Large Language Models
ADMM-Q is a new post-training quantization method using ADMM operator splitting that reduces WikiText-2 perplexity compared to GPTQ on Qwen3-8B across W3A16, W4A8, and W2A4KV4 settings.
-
Theory-optimal Quantization Based on Flatness
The paper introduces the Flatness metric, derives a theory-optimal quantization solution, and presents BDQ that uses bidirectional diagonal transformations to reduce outlier impact, achieving under 1% drop at W4A4 on ...
-
Fitting Is Not Enough: Smoothness in Extremely Quantized LLMs
Extremely quantized LLMs degrade in smoothness, sparsifying the decoding tree and hurting generation quality; a smoothness-preserving principle delivers gains beyond numerical fitting.
-
Different Prompts, Different Ranks: Prompt-aware Dynamic Rank Selection for SVD-based LLM Compression
PARSE trains a prompt-aware linear router on dense-model outputs to select dynamic SVD ranks, improving accuracy up to 10% at 0.6 compression ratio on LLaMA-7B while delivering 2.5x prefill and 2.4x decode speedups.
-
XtraMAC: An Efficient MAC Architecture for Mixed-Precision LLM Inference on FPGA
XtraMAC unifies mixed-precision MAC on FPGA via shared integer mantissa products, delivering 1.4-2.0x higher compute density and up to 1.9x better energy efficiency.
-
OSAQ: Outlier Self-Absorption for Accurate Low-bit LLM Quantization
OSAQ suppresses weight outliers in LLMs via a closed-form additive transformation from the Hessian's stable null space, improving 2-bit quantization perplexity by over 40% versus vanilla GPTQ with no inference overhead.
-
OSAQ: Outlier Self-Absorption for Accurate Low-bit LLM Quantization
OSAQ uses the low-rank structure of the Hessian to construct a closed-form additive weight transformation that suppresses outliers without changing task loss, enabling better low-bit LLM quantization.
-
Compress Then Adapt? No, Do It Together via Task-aware Union of Subspaces
JACTUS unifies low-rank compression and task adaptation via a task-aware union of subspaces and global rank allocation by marginal gain, outperforming 100% PEFT methods like DoRA on ViT-Base (89.2% avg) and Llama2-7B ...
-
WindowQuant: Mixed-Precision KV Cache Quantization based on Window-Level Similarity for VLMs Inference Optimization
WindowQuant performs window-adaptive mixed-precision KV cache quantization guided by similarity to the text prompt, with reordering to enable efficient inference in VLMs.
-
Sharpness-Aware Pretraining Mitigates Catastrophic Forgetting
Sharpness-aware pretraining and related flat-minima interventions reduce catastrophic forgetting by up to 80% after post-training across 20M-150M models and by 31-40% at 1B scale.
-
Technical Report: Activation Residual Hessian Quantization (ARHQ) for Low-Bit LLM Quantization
ARHQ isolates error-sensitive weight directions in LLMs via truncated SVD on the scaled matrix W G_x^{1/2} from activation residuals, improving SNR and preserving performance under aggressive low-bit quantization.
-
CoQuant: Joint Weight-Activation Subspace Projection for Mixed-Precision LLMs
CoQuant selects optimal high-precision subspaces for mixed-precision LLM quantization via a closed-form weighted PCA that balances weight and activation covariances derived from expected output error.
-
BitRL: Reinforcement Learning with 1-bit Quantized Language Models for Resource-Constrained Edge Deployment
BitRL enables on-device RL agents via 1-bit quantized language models, delivering 10-16x memory reduction and 3-5x energy efficiency gains with 85-98% retained performance.
-
FASQ: Flexible Accelerated Subspace Quantization for Calibration-Free LLM Compression
FASQ delivers calibration-free LLM compression with continuous size trade-offs via product quantization and custom CUDA kernels that accelerate decode beyond FP16 speeds on consumer hardware.
-
MCAP: Deployment-Time Layer Profiling for Memory-Constrained LLM Inference
MCAP uses load-time Monte Carlo profiling to estimate layer importance, enabling dynamic quantization (W4A8 vs W4A16) and memory tiering (GPU/RAM/SSD) that delivers 1.5-1.8x higher decode throughput than llama-cpp Q4_...
-
From Signal Degradation to Computation Collapse: Uncovering the Two Failure Modes of LLM Quantization
LLM 2-bit quantization fails via either cumulative signal degradation or early computation collapse in key components.
-
GRASPrune: Global Gating for Budgeted Structured Pruning of Large Language Models
GRASPrune removes 50% of parameters from LLaMA-2-7B via global gating and projected straight-through estimation, reaching 12.18 WikiText-2 perplexity and competitive zero-shot accuracy after four epochs on 512 calibra...
-
Are Large Language Models Economically Viable for Industry Deployment?
Small LLMs under 2B parameters achieve better economic break-even, energy efficiency, and hardware density than larger models on legacy GPUs for industrial tasks.
-
LBLLM: Lightweight Binarization of Large Language Models via Three-Stage Distillation
LBLLM achieves better accuracy than prior binarization methods for LLMs by decoupling weight and activation quantization through initialization, layer-wise distillation, and learnable activation scaling.
-
SAW-INT4: System-Aware 4-Bit KV-Cache Quantization for Real-World LLM Serving
Token-wise INT4 KV-cache quantization plus block-diagonal Hadamard rotation recovers nearly all accuracy lost by naive INT4 while adding zero end-to-end overhead under paged serving constraints.
Reference graph
Works this paper leans on
-
[1]
Emily M Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big? In 2021 ACM Conference on Fairness, Accountability, and Transparency,
work page 2021
-
[2]
A Systematic Classification of Knowledge, Reasoning, and Context within the ARC Dataset
Michael Boratko, Harshit Padigela, Divyendra Mikkilineni, Pritish Yuvraj, Rajarshi Das, Andrew McCallum, Maria Chang, Achille Fokoue-Nkoutche, Pavan Kapanipathi, Nicholas Mattei, et al. A systematic classification of knowledge, reasoning, and context within the ARC dataset. arXiv preprint arXiv:1806.00358,
-
[3]
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Tri Dao, Daniel Y Fu, Stefano Ermon, Atri Rudra, and Christopher R ´e. FlashAttention: Fast and memory-efficient exact attention with io-awareness.arXiv preprint arXiv:2205.14135,
work page internal anchor Pith review arXiv
-
[4]
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
10 Published as a conference paper at ICLR 2023 Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. LLM.int8(): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339,
work page internal anchor Pith review arXiv 2023
-
[5]
Optimal brain compression: A framework for accurate post-training quantization and pruning,
Elias Frantar, Sidak Pal Singh, and Dan Alistarh. Optimal Brain Compression: A framework for ac- curate post-training quantization and pruning. arXiv preprint arXiv:2208.11580,
-
[6]
Accepted to NeurIPS 2022, to appear. Amir Gholami, Sehoon Kim, Zhen Dong, Zhewei Yao, Michael W Mahoney, and Kurt Keutzer. A survey of quantization methods for efficient neural network inference. arXiv preprint arXiv:2103.13630,
-
[7]
Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks
Torsten Hoefler, Dan Alistarh, Tal Ben-Nun, Nikoli Dryden, and Alexandra Peste. Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks. arXiv preprint arXiv:2102.00554,
-
[8]
Improving post training neural quantization: Layer-wise calibration and integer programming
Itay Hubara, Yury Nahshan, Yair Hanani, Ron Banner, and Daniel Soudry. Improving post training neural quantization: Layer-wise calibration and integer programming. arXiv preprint arXiv:2006.10518,
-
[9]
The penn treebank: Annotating predicate argument structure
Mitch Marcus, Grace Kim, Mary Ann Marcinkiewicz, Robert MacIntyre, Ann Bies, Mark Ferguson, Karen Katz, and Britta Schasberger. The penn treebank: Annotating predicate argument structure. In Human Language Technology: Proceedings of a Workshop held at Plainsboro, New Jersey, March 8-11, 1994,
work page 1994
-
[10]
Pointer Sentinel Mixture Models
Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843,
work page internal anchor Pith review arXiv
-
[11]
A White Paper on Neural Network Quantization
Markus Nagel, Marios Fournarakis, Rana Ali Amjad, Yelysei Bondarenko, Mart van Baalen, and Tijmen Blankevoort. A white paper on neural network quantization. arXiv preprint arXiv:2106.08295,
work page internal anchor Pith review arXiv
-
[12]
The LAMBADA dataset: Word prediction requiring a broad discourse context
Denis Paperno, Germ ´an Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fern´andez. The LAMBADA dataset: Word prediction requiring a broad discourse context. arXiv preprint arXiv:1606.06031,
-
[13]
J., Kim, B., Lee, Y ., and Lee, D
11 Published as a conference paper at ICLR 2023 Gunho Park, Baeseong Park, Se Jung Kwon, Byeongwook Kim, Youngjoo Lee, and Dongsoo Lee. nuQmm: Quantized matmul for efficient inference of large-scale generative language models. arXiv preprint arXiv:2206.09557,
-
[14]
Extreme compression for pre-trained transformers made simple and efficient
Xiaoxia Wu, Zhewei Yao, Minjia Zhang, Conglong Li, and Yuxiong He. Extreme compression for pre-trained transformers made simple and efficient. arXiv preprint arXiv:2206.01859,
-
[15]
Yu, W., Yang, Z., Li, L., Wang, J., Lin, K., Liu, Z., Wang, X., and Wang, L
Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, Conglong Li, and Yuxiong He. ZeroQuant: Efficient and affordable post-training quantization for large-scale transformers.arXiv preprint arXiv:2206.01861,
-
[16]
OPT: Open Pre-trained Transformer Language Models
Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christo- pher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. OPT: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068,
work page internal anchor Pith review arXiv
-
[17]
Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Joseph E Gonzalez, et al. Alpa: Automating inter-and intra-operator parallelism for distributed deep learning. arXiv preprint arXiv:2201.12023,
-
[18]
12 Published as a conference paper at ICLR 2023 A A PPENDIX A.1 A DDITIONAL COMPARISON WITH OBQ We now provide an additional comparison between GPTQ and OBQ on BERT-base/SQuAD Ra- jpurkar et al. (2016) and OPT-125M/WikiText2, which is one of the largest models to which OBQ can be reasonably applied. Method BERT-base OPT-125M 88.53 F1↑ 27.66 PPL↓ 4bit 3bit...
work page 2023
-
[19]
where the underlying (close to) matrix-vector products are memory- bound. For non-generative and large-batch applications, operations may be compute- rather than memory-bound and our kernels thus not directly applicable. Instead, one could simply decompress the matrix before performing the corresponding matrix-matrix calculations: this takes < 1.5ms on an...
work page 2022
-
[20]
GPTQ 3 35.78 28.83 25.34 21.25 17.67 12.27 Table 12: BLOOM perplexity results for C4. We note that the calibration data used by GPTQ is sampled from the C4 training set, this task is thus not fully zero-shot. 14 Published as a conference paper at ICLR 2023 A.4 A DDITIONAL ZEROSHOT RESULTS This section contains additional results for zero-shot tasks. OPT B...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.