Recognition: 2 theorem links
· Lean TheoremGPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
Pith reviewed 2026-05-10 17:12 UTC · model grok-4.3
The pith
GPTQ quantizes 175 billion parameter GPT models to 3 or 4 bits per weight in about four GPU hours with negligible accuracy loss.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors show that their GPTQ method, based on approximate second-order information, can quantize GPT models with up to 175 billion parameters down to 3 or 4 bits per weight. This process takes approximately four GPU hours and results in negligible accuracy degradation relative to the uncompressed model. The approach more than doubles the compression gains of previous one-shot methods and enables single-GPU inference for these massive models, with observed speedups of 3.25x on A100 GPUs and 4.5x on A6000 GPUs.
What carries the argument
Approximate second-order information, specifically Hessian-based approximations, used to make layer-wise quantization decisions in a one-shot post-training process.
If this is right
- 175 billion parameter models become runnable for generative inference inside a single GPU.
- Accuracy is preserved at 3-4 bit quantization, more than doubling prior compression gains for one-shot methods.
- End-to-end inference achieves speedups of approximately 3.25x on high-end GPUs like the NVIDIA A100 and 4.5x on cost-effective ones like the NVIDIA A6000.
- Reasonable accuracy holds in extreme cases of 2-bit or ternary quantization.
Where Pith is reading between the lines
- GPTQ's efficiency could make quantization standard practice for deploying very large language models on limited hardware.
- The one-shot nature suggests the approach may extend to other large transformer families with similar scale challenges.
- Observed speedups point to new possibilities for interactive or real-time use of generative models that previously required multiple GPUs.
Load-bearing premise
The Hessian-based approximate second-order information stays sufficiently accurate for all layers in 175B-scale models without accumulating errors that would necessitate retraining.
What would settle it
A test showing large accuracy degradation or much longer quantization time when applying the method to a 175B parameter GPT model would disprove the central performance claims.
read the original abstract
Generative Pre-trained Transformer models, known as GPT or OPT, set themselves apart through breakthrough performance across complex language modelling tasks, but also by their extremely high computational and storage costs. Specifically, due to their massive size, even inference for large, highly-accurate GPT models may require multiple performant GPUs, which limits the usability of such models. While there is emerging work on relieving this pressure via model compression, the applicability and performance of existing compression techniques is limited by the scale and complexity of GPT models. In this paper, we address this challenge, and propose GPTQ, a new one-shot weight quantization method based on approximate second-order information, that is both highly-accurate and highly-efficient. Specifically, GPTQ can quantize GPT models with 175 billion parameters in approximately four GPU hours, reducing the bitwidth down to 3 or 4 bits per weight, with negligible accuracy degradation relative to the uncompressed baseline. Our method more than doubles the compression gains relative to previously-proposed one-shot quantization methods, preserving accuracy, allowing us for the first time to execute an 175 billion-parameter model inside a single GPU for generative inference. Moreover, we also show that our method can still provide reasonable accuracy in the extreme quantization regime, in which weights are quantized to 2-bit or even ternary quantization levels. We show experimentally that these improvements can be leveraged for end-to-end inference speedups over FP16, of around 3.25x when using high-end GPUs (NVIDIA A100) and 4.5x when using more cost-effective ones (NVIDIA A6000). The implementation is available at https://github.com/IST-DASLab/gptq.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents GPTQ, a one-shot post-training quantization method for large GPT and OPT models that uses approximate second-order (Hessian) information to quantize weights to 3-4 bits (and even 2-bit/ternary) while claiming negligible accuracy loss relative to FP16 baselines. It reports that a 175B-parameter model can be quantized in ~4 GPU hours, more than doubling prior one-shot compression ratios, enabling single-GPU generative inference, and delivering 3.25-4.5x end-to-end speedups on A100/A6000 hardware.
Significance. If the empirical claims hold, the work would be significant for efficient deployment of large language models: it offers a practical, retraining-free route to 3-4 bit inference on models previously requiring multiple high-end GPUs, with open-source code that could accelerate follow-on research in post-training compression.
major comments (3)
- [Method section (quantization procedure and Hessian approximation)] The central claim that the layer-wise approximate inverse-Hessian (computed on a small calibration set via Cholesky updates) remains sufficiently accurate for 175B-scale models without compounding errors lacks any supporting analysis or bounds. No scaling study or error-propagation argument is given for layer dimensions ~12k or network depth of 175B parameters.
- [Experiments section (OPT-175B results and ablations)] Experimental results on OPT-175B report low perplexity degradation at 3-4 bits, but the manuscript supplies no ablation on calibration-set size versus model scale, no variance across multiple calibration draws or random seeds, and no comparison of Hessian approximation quality at different model widths.
- [Introduction and Experiments] The claim of 'negligible accuracy degradation' and 'more than doubles the compression gains' relative to prior one-shot methods rests on the unverified assumption that second-order information stays reliable at this scale; without the missing analysis, the four-GPU-hour practicality claim cannot be generalized beyond the specific empirical runs shown.
minor comments (2)
- [Abstract] The abstract states that the method 'more than doubles the compression gains' but does not name the exact prior one-shot baselines or report the precise ratio in the summary paragraph.
- [Method] Notation for the per-layer Hessian update and error compensation step could be clarified with an explicit algorithm box or pseudocode to aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and positive assessment of the work's potential significance. We address each major comment point by point below, with proposed revisions to the manuscript where appropriate.
read point-by-point responses
-
Referee: [Method section (quantization procedure and Hessian approximation)] The central claim that the layer-wise approximate inverse-Hessian (computed on a small calibration set via Cholesky updates) remains sufficiently accurate for 175B-scale models without compounding errors lacks any supporting analysis or bounds. No scaling study or error-propagation argument is given for layer dimensions ~12k or network depth of 175B parameters.
Authors: We agree that the manuscript provides no formal theoretical bounds or error-propagation analysis for the layer-wise Hessian approximation. The method is presented as an efficient, practical approximation whose reliability is demonstrated empirically on models up to 175B parameters. Deriving rigorous bounds at this scale is a substantial theoretical undertaking that lies outside the paper's empirical focus. In the revision we will add a short discussion of the approximation's observed stability (based on per-layer quantization error and end-to-end perplexity) together with a scaling plot that compares performance from 1B to 175B models. revision: partial
-
Referee: [Experiments section (OPT-175B results and ablations)] Experimental results on OPT-175B report low perplexity degradation at 3-4 bits, but the manuscript supplies no ablation on calibration-set size versus model scale, no variance across multiple calibration draws or random seeds, and no comparison of Hessian approximation quality at different model widths.
Authors: We accept that these ablations are missing and would strengthen the experimental section. The revised manuscript will include: (i) perplexity versus calibration-set size for models of varying scale, (ii) mean and standard deviation of perplexity over at least three independent calibration draws, and (iii) a comparison of per-layer quantization error for layers of different widths. These additions will be placed in the experiments section and the associated appendix. revision: yes
-
Referee: [Introduction and Experiments] The claim of 'negligible accuracy degradation' and 'more than doubles the compression gains' relative to prior one-shot methods rests on the unverified assumption that second-order information stays reliable at this scale; without the missing analysis, the four-GPU-hour practicality claim cannot be generalized beyond the specific empirical runs shown.
Authors: The statements in the introduction and experiments are tied directly to the concrete empirical results reported for the tested models (including the 175B OPT model quantized in roughly four GPU hours). We do not assert that the second-order approximation is theoretically guaranteed to remain reliable at arbitrary scales. In the revision we will rephrase the relevant sentences to make the empirical basis explicit and to note that broader generalization remains an open question for future study. revision: partial
- Deriving formal theoretical bounds or an error-propagation argument for the approximate inverse-Hessian at 175B scale and layer widths of ~12k.
Circularity Check
No significant circularity; empirical method with independent validation
full rationale
The paper presents GPTQ as a one-shot post-training quantization procedure that applies established approximate second-order (Hessian) information to weight quantization, with all performance claims (4-GPU-hour runtime, 3-4 bit accuracy on 175B models) resting on direct experimental measurements rather than any derivation that reduces by construction to the method's own fitted quantities or self-citations. No load-bearing step equates a claimed prediction to an input by definition, and the central accuracy result is externally falsifiable via perplexity and downstream metrics on held-out data.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Approximate second-order information suffices to select quantization values that preserve model accuracy at 3-4 bits without retraining
Lean theorems connected to this paper
-
Cost.FunctionalEquationwashburn_uniqueness_aczel unclearGPTQ... based on approximate second-order information... quantize GPT models with 175 billion parameters in approximately four GPU hours, reducing the bitwidth down to 3 or 4 bits per weight, with negligible accuracy degradation
-
Foundation.AlphaCoordinateFixationcostAlphaLog_fourth_deriv_at_zero unclearThe approximate second-order information (Hessian-based) remains sufficiently accurate for guiding quantization decisions across all layers of 175B-scale models
Forward citations
Cited by 60 Pith papers
-
Layer Collapse in Diffusion Language Models
Diffusion language models develop early-layer collapse around an indispensable super-outlier due to overtraining, resulting in higher compressibility and reversed optimal sparsity patterns versus autoregressive models.
-
Layer Collapse in Diffusion Language Models
Early layers in diffusion language models like LLaDA-8B collapse into redundant representations around a critical super-outlier activation due to overtraining, making them more robust to quantization and sparsity than...
-
When Quantization Is Free: An int4 KV Cache That Outruns fp16 on Apple Silicon
A single fused int4 KV cache kernel on Apple Silicon outperforms fp16 in latency with 3x memory compression and near-zero quality loss on tested models.
-
Quantamination: Dynamic Quantization Leaks Your Data Across the Batch
Dynamic quantization creates side channels allowing partial or full recovery of other users' batched data in at least four popular ML frameworks.
-
Variance Is Not Importance: Structural Analysis of Transformer Compressibility Across Model Scales
High-variance activation directions are uncorrelated with predictions, transformer blocks grow more linear with depth, and single-block linear replacement yields 34x compression on Mistral's final block at a 1.71 perp...
-
When W4A4 Breaks Camouflaged Object Detection: Token-Group Dual-Constraint Activation Quantization
COD-TDQ uses token-group scaling and dual-constraint projection to fix 4-bit activation quantization for camouflaged object detection, delivering more than 0.12 higher Sα scores than prior methods on four benchmarks w...
-
When Flat Minima Fail: Characterizing INT4 Quantization Collapse After FP32 Convergence
FP32-converged language models enter a post-convergence phase where INT4 quantization error explodes while FP32 perplexity remains stable, with onset tied to fine convergence rather than learning rate decay.
-
Hidden Reliability Risks in Large Language Models: Systematic Identification of Precision-Induced Output Disagreements
PrecisionDiff is a differential testing framework that uncovers widespread precision-induced behavioral disagreements in aligned LLMs, including safety-critical jailbreak divergences across precision formats.
-
ENEC: A Lossless AI Model Compression Method Enabling Fast Inference on Ascend NPUs
ENEC delivers 3.43X higher throughput than DietGPU and 1.12X better compression ratio than nvCOMP for lossless model weight compression on Ascend NPUs, yielding up to 6.3X end-to-end inference speedup.
-
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.
-
QLoRA: Efficient Finetuning of Quantized LLMs
QLoRA finetunes 4-bit quantized LLMs via LoRA adapters to match full-precision performance while using far less memory, enabling 65B-scale training on single GPUs and producing Guanaco models near ChatGPT level.
-
Search Your Block Floating Point Scales!
ScaleSearch optimizes block floating point scales via fine-grained search to cut quantization error by 27% for NVFP4, improving PTQ by up to 15 points on MATH500 for Qwen3-8B and attention PPL by 0.77 on Llama 3.1 70B.
-
ADMM-Q: An Improved Hessian-based Weight Quantizer for Post-Training Quantization of Large Language Models
ADMM-Q is a new post-training quantization method using ADMM operator splitting that reduces WikiText-2 perplexity compared to GPTQ on Qwen3-8B across W3A16, W4A8, and W2A4KV4 settings.
-
Fitting Is Not Enough: Smoothness in Extremely Quantized LLMs
Extremely quantized LLMs degrade in smoothness, sparsifying the decoding tree and hurting generation quality; a smoothness-preserving principle delivers gains beyond numerical fitting.
-
Different Prompts, Different Ranks: Prompt-aware Dynamic Rank Selection for SVD-based LLM Compression
PARSE trains a prompt-aware linear router on dense-model outputs to select dynamic SVD ranks, improving accuracy up to 10% at 0.6 compression ratio on LLaMA-7B while delivering 2.5x prefill and 2.4x decode speedups.
-
XtraMAC: An Efficient MAC Architecture for Mixed-Precision LLM Inference on FPGA
XtraMAC unifies mixed-precision MAC on FPGA via shared integer mantissa products, delivering 1.4-2.0x higher compute density and up to 1.9x better energy efficiency.
-
OSAQ: Outlier Self-Absorption for Accurate Low-bit LLM Quantization
OSAQ uses the low-rank structure of the Hessian to construct a closed-form additive weight transformation that suppresses outliers without changing task loss, enabling better low-bit LLM quantization.
-
OSAQ: Outlier Self-Absorption for Accurate Low-bit LLM Quantization
OSAQ suppresses weight outliers in LLMs via a closed-form additive transformation from the Hessian's stable null space, improving 2-bit quantization perplexity by over 40% versus vanilla GPTQ with no inference overhead.
-
Compress Then Adapt? No, Do It Together via Task-aware Union of Subspaces
JACTUS unifies low-rank compression and task adaptation via a task-aware union of subspaces and global rank allocation by marginal gain, outperforming 100% PEFT methods like DoRA on ViT-Base (89.2% avg) and Llama2-7B ...
-
WindowQuant: Mixed-Precision KV Cache Quantization based on Window-Level Similarity for VLMs Inference Optimization
WindowQuant performs window-adaptive mixed-precision KV cache quantization guided by similarity to the text prompt, with reordering to enable efficient inference in VLMs.
-
Sharpness-Aware Pretraining Mitigates Catastrophic Forgetting
Sharpness-aware pretraining and related flat-minima interventions reduce catastrophic forgetting by up to 80% after post-training across 20M-150M models and by 31-40% at 1B scale.
-
Technical Report: Activation Residual Hessian Quantization (ARHQ) for Low-Bit LLM Quantization
ARHQ isolates error-sensitive weight directions in LLMs via truncated SVD on the scaled matrix W G_x^{1/2} from activation residuals, improving SNR and preserving performance under aggressive low-bit quantization.
-
CoQuant: Joint Weight-Activation Subspace Projection for Mixed-Precision LLMs
CoQuant selects optimal high-precision subspaces for mixed-precision LLM quantization via a closed-form weighted PCA that balances weight and activation covariances derived from expected output error.
-
BitRL: Reinforcement Learning with 1-bit Quantized Language Models for Resource-Constrained Edge Deployment
BitRL enables on-device RL agents via 1-bit quantized language models, delivering 10-16x memory reduction and 3-5x energy efficiency gains with 85-98% retained performance.
-
FASQ: Flexible Accelerated Subspace Quantization for Calibration-Free LLM Compression
FASQ delivers calibration-free LLM compression with continuous size trade-offs via product quantization and custom CUDA kernels that accelerate decode beyond FP16 speeds on consumer hardware.
-
MCAP: Deployment-Time Layer Profiling for Memory-Constrained LLM Inference
MCAP uses load-time Monte Carlo profiling to estimate layer importance, enabling dynamic quantization (W4A8 vs W4A16) and memory tiering (GPU/RAM/SSD) that delivers 1.5-1.8x higher decode throughput than llama-cpp Q4_...
-
From Signal Degradation to Computation Collapse: Uncovering the Two Failure Modes of LLM Quantization
LLM 2-bit quantization fails via either cumulative signal degradation or early computation collapse in key components.
-
GRASPrune: Global Gating for Budgeted Structured Pruning of Large Language Models
GRASPrune removes 50% of parameters from LLaMA-2-7B via global gating and projected straight-through estimation, reaching 12.18 WikiText-2 perplexity and competitive zero-shot accuracy after four epochs on 512 calibra...
-
Are Large Language Models Economically Viable for Industry Deployment?
Small LLMs under 2B parameters achieve better economic break-even, energy efficiency, and hardware density than larger models on legacy GPUs for industrial tasks.
-
LBLLM: Lightweight Binarization of Large Language Models via Three-Stage Distillation
LBLLM achieves better accuracy than prior binarization methods for LLMs by decoupling weight and activation quantization through initialization, layer-wise distillation, and learnable activation scaling.
-
SAW-INT4: System-Aware 4-Bit KV-Cache Quantization for Real-World LLM Serving
Token-wise INT4 KV-cache quantization plus block-diagonal Hadamard rotation recovers nearly all accuracy lost by naive INT4 while adding zero end-to-end overhead under paged serving constraints.
-
Benchmarking System Dynamics AI Assistants: Cloud Versus Local LLMs on CLD Extraction and Discussion
Cloud LLMs reach 77-89% on CLD extraction while the best local model hits 77%; local models perform well on model-building steps but drop to 0-50% on error fixing due to long-context memory limits, with backend choice...
-
GSQ: Highly-Accurate Low-Precision Scalar Quantization for LLMs via Gumbel-Softmax Sampling
GSQ applies a Gumbel-Softmax relaxation to learn discrete grid assignments in scalar quantization, closing most of the accuracy gap to vector methods like QTIP on Llama-3.1 models at 2-3 bits while using only symmetri...
-
Depth Registers Unlock W4A4 on SwiGLU: A Reader/Generator Decomposition
Depth Registers plus hinge loss cut W4A4-induced perplexity collapse from 1727 to 119 in a 300M SwiGLU model by selectively taming reader-layer activations while leaving bilinear generator tails largely untouched.
-
Open-TQ-Metal: Fused Compressed-Domain Attention for Long-Context LLM Inference on Apple Silicon
Fused compressed-domain int4 attention on Apple Silicon delivers 48x speedup and 3.2x KV cache compression for 128K-context 70B models while matching FP16 token predictions.
-
MemExplorer: Navigating the Heterogeneous Memory Design Space for Agentic Inference NPUs
MemExplorer optimizes heterogeneous memory systems for agentic LLM inference on NPUs and reports up to 2.3x higher energy efficiency than baselines under fixed power budgets.
-
The Illusion of Equivalence: Systematic FP16 Divergence in KV-Cached Autoregressive Inference
FP16 KV caching in transformers causes deterministic token divergence versus cache-free inference due to non-associative floating-point accumulation orderings.
-
Robust Ultra Low-Bit Post-Training Quantization via Stable Diagonal Curvature Estimate
DASH-Q uses a stable diagonal curvature estimate and weighted least squares to achieve robust ultra-low-bit post-training quantization of LLMs, improving zero-shot accuracy by 7% on average over baselines.
-
Numerical Instability and Chaos: Quantifying the Unpredictability of Large Language Models
Large language models display three universal scale-dependent regimes of behavior—stable, chaotic, and signal-dominated—driven by floating-point rounding errors that produce an avalanche effect in early layers.
-
EdgeCIM: A Hardware-Software Co-Design for CIM-Based Acceleration of Small Language Models
A CIM-based hardware-software co-design in 65nm achieves up to 7.3x higher throughput and 49.59x better energy efficiency than NVIDIA Orin Nano for LLaMA3.2-1B, averaging 336 tokens/s and 173 tokens/J under INT4 acros...
-
Quantization Dominates Rank Reduction for KV-Cache Compression
Quantization of the KV cache beats rank reduction for matched storage budgets by 4-364 PPL, because dimension removal can flip attention token selection under softmax while bounded quantization noise usually preserves...
-
EdgeRazor: A Lightweight Framework for Large Language Models via Mixed-Precision Quantization-Aware Distillation
EdgeRazor delivers 1.58-1.88 bit quantized LLMs that outperform 2-3 bit baselines by up to 11.3 points while using 4-10x less training compute than leading QAT methods.
-
EdgeFlow: Fast Cold Starts for LLMs on Mobile Devices
EdgeFlow reduces mobile LLM cold-start latency up to 4.07x versus llama.cpp, MNN, and llm.npu by NPU-aware adaptive quantization, SIMD-friendly packing, and synergistic granular CPU-NPU pipelining at comparable accuracy.
-
MP-ISMoE: Mixed-Precision Interactive Side Mixture-of-Experts for Efficient Transfer Learning
MP-ISMoE uses Gaussian noise perturbed iterative quantization and interactive side mixture-of-experts to deliver higher accuracy than prior memory-efficient transfer learning methods while keeping similar parameter an...
-
DeFakeQ: Enabling Real-Time Deepfake Detection on Edge Devices via Adaptive Bidirectional Quantization
DeFakeQ introduces an adaptive bidirectional quantization method tailored for deepfake detectors that maintains detection accuracy while enabling real-time performance on resource-constrained edge devices.
-
Rethinking Residual Errors in Compensation-based LLM Quantization
Redefining residual errors to include compensation-aware discrepancies and realigning calibration to full-precision outputs improves GPTQ and GPTAQ performance on LLMs.
-
FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling
Sol-RL decouples FP4-based candidate exploration from BF16 policy optimization in diffusion RL, delivering up to 4.64x faster convergence with maintained or superior alignment performance on models like FLUX.1 and SD3.5.
-
Your LLM Agent Can Leak Your Data: Data Exfiltration via Backdoored Tool Use
Back-Reveal shows that LLM agents with tool access can be backdoored via fine-tuning to exfiltrate stored user context through memory and retrieval tool calls, with multi-turn interactions enabling sustained leakage.
-
Semantic Trimming and Auxiliary Multi-step Prediction for Generative Recommendation
STAMP mitigates semantic dilution in SID-based generative recommendation via adaptive input pruning and densified output supervision, delivering 1.23-1.38x speedup and 17-55% VRAM savings with maintained or improved accuracy.
-
RUQuant: Towards Refining Uniform Quantization for Large Language Models
RUQuant uses block-wise composite orthogonal matrices from Householder reflections and Givens rotations plus a fine-tuned global reflection to achieve 99.8% full-precision accuracy at W6A6 and 97% at W4A4 for 13B LLMs...
-
Querying Structured Data Through Natural Language Using Language Models
Fine-tuning an 8B LLM with synthetic data enables accurate natural language querying of structured datasets like accessibility services in Spain, generalizing to new locations.
-
Compiling Code LLMs into Lightweight Executables
Ditto quantizes Code LLMs with K-Means codebooks and compiles inference via LLVM-BLAS replacement to deliver up to 10.5x faster, 6.4x smaller, and 10.5x lower-energy execution on commodity hardware while losing only 0...
-
NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models
NV-Embed achieves first place on the MTEB leaderboard across 56 tasks by combining a latent attention layer, causal-mask removal, two-stage contrastive training, and data curation for LLM-based embedding models.
-
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies
MiniCPM 1.2B and 2.4B models reach parity with 7B-13B LLMs via model wind-tunnel scaling and a WSD scheduler that yields a higher optimal data-to-model ratio than Chinchilla scaling.
-
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
KIVI applies asymmetric 2-bit quantization to KV cache with per-channel keys and per-token values, reducing memory 2.6x and boosting throughput up to 3.47x with near-identical quality on Llama, Falcon, and Mistral.
-
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
Uptraining multi-head transformer checkpoints to grouped-query attention models achieves near multi-head quality at multi-query inference speeds using 5% additional compute.
-
Position: LLM Inference Should Be Evaluated as Energy-to-Token Production
LLM inference should be reframed and evaluated as energy-to-token production with a Token Production Function that accounts for power, cooling, and efficiency ceilings.
-
31.1 A 14.08-to-135.69Token/s ReRAM-on-Logic Stacked Outlier-Free Large-Language-Model Accelerator with Block-Clustered Weight-Compression and Adaptive Parallel-Speculative-Decoding
A ReRAM-on-logic stacked chip delivers 14.08-135.69 tokens/s LLM inference with block-clustered compression and adaptive parallel speculative decoding, yielding 4.46-7.17x speedup over standard methods.
-
HCInfer: An Efficient Inference System via Error Compensation for Resource-Constrained Devices
HCInfer recovers up to 5.2% accuracy over compressed LLMs and delivers 10.4x speedup versus full-precision models by offloading compensation parameters to CPU with async execution on resource-limited hardware.
-
Colinearity Decay: Training Quantization-Friendly ViTs with Outlier Decay
Colinearity-Decay regularizer trains ViTs that maintain or improve full-precision accuracy while delivering higher accuracy after low-bit quantization on ImageNet and COCO tasks.
Reference graph
Works this paper leans on
-
[1]
Emily M Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big? In 2021 ACM Conference on Fairness, Accountability, and Transparency,
work page 2021
-
[2]
Michael Boratko, Harshit Padigela, Divyendra Mikkilineni, Pritish Yuvraj, Rajarshi Das, Andrew McCallum, Maria Chang, Achille Fokoue-Nkoutche, Pavan Kapanipathi, Nicholas Mattei, et al. A systematic classification of knowledge, reasoning, and context within the ARC dataset. arXiv preprint arXiv:1806.00358,
-
[3]
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Tri Dao, Daniel Y Fu, Stefano Ermon, Atri Rudra, and Christopher R ´e. FlashAttention: Fast and memory-efficient exact attention with io-awareness.arXiv preprint arXiv:2205.14135,
work page internal anchor Pith review arXiv
-
[4]
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
10 Published as a conference paper at ICLR 2023 Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. LLM.int8(): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339,
work page internal anchor Pith review arXiv 2023
-
[5]
Optimal Brain Compression: A framework for ac- curate post-training quantization and pruning
Elias Frantar, Sidak Pal Singh, and Dan Alistarh. Optimal Brain Compression: A framework for ac- curate post-training quantization and pruning. arXiv preprint arXiv:2208.11580,
-
[6]
A survey of quan- tization methods for efficient neural network inference,
Accepted to NeurIPS 2022, to appear. Amir Gholami, Sehoon Kim, Zhen Dong, Zhewei Yao, Michael W Mahoney, and Kurt Keutzer. A survey of quantization methods for efficient neural network inference. arXiv preprint arXiv:2103.13630,
-
[7]
Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks
Torsten Hoefler, Dan Alistarh, Tal Ben-Nun, Nikoli Dryden, and Alexandra Peste. Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks. arXiv preprint arXiv:2102.00554,
-
[8]
Improving post training neural quantization: Layer-wise calibration and integer programming
Itay Hubara, Yury Nahshan, Yair Hanani, Ron Banner, and Daniel Soudry. Improving post training neural quantization: Layer-wise calibration and integer programming. arXiv preprint arXiv:2006.10518,
-
[9]
The penn treebank: Annotating predicate argument structure
Mitch Marcus, Grace Kim, Mary Ann Marcinkiewicz, Robert MacIntyre, Ann Bies, Mark Ferguson, Karen Katz, and Britta Schasberger. The penn treebank: Annotating predicate argument structure. In Human Language Technology: Proceedings of a Workshop held at Plainsboro, New Jersey, March 8-11, 1994,
work page 1994
-
[10]
Pointer Sentinel Mixture Models
Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843,
work page internal anchor Pith review arXiv
-
[11]
A white paper on neural network quantization.arXiv preprint arXiv:2106.08295,
Markus Nagel, Marios Fournarakis, Rana Ali Amjad, Yelysei Bondarenko, Mart van Baalen, and Tijmen Blankevoort. A white paper on neural network quantization. arXiv preprint arXiv:2106.08295,
-
[12]
The LAMBADA dataset: Word prediction requiring a broad discourse context
Denis Paperno, Germ ´an Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fern´andez. The LAMBADA dataset: Word prediction requiring a broad discourse context. arXiv preprint arXiv:1606.06031,
-
[13]
J., Kim, B., Lee, Y., and Lee, D
11 Published as a conference paper at ICLR 2023 Gunho Park, Baeseong Park, Se Jung Kwon, Byeongwook Kim, Youngjoo Lee, and Dongsoo Lee. nuQmm: Quantized matmul for efficient inference of large-scale generative language models. arXiv preprint arXiv:2206.09557,
-
[14]
Extreme compression for pre-trained transformers made simple and efficient
Xiaoxia Wu, Zhewei Yao, Minjia Zhang, Conglong Li, and Yuxiong He. Extreme compression for pre-trained transformers made simple and efficient. arXiv preprint arXiv:2206.01859,
-
[15]
Y., Zhang, M., Wu, X., Li, C., and He, Y
Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, Conglong Li, and Yuxiong He. ZeroQuant: Efficient and affordable post-training quantization for large-scale transformers.arXiv preprint arXiv:2206.01861,
-
[16]
OPT: Open Pre-trained Transformer Language Models
Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christo- pher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. OPT: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068,
work page internal anchor Pith review arXiv
-
[17]
Alpa: Automating inter- and intra-operator parallelism for distributed deep learning,
Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Joseph E Gonzalez, et al. Alpa: Automating inter-and intra-operator parallelism for distributed deep learning. arXiv preprint arXiv:2201.12023,
-
[18]
12 Published as a conference paper at ICLR 2023 A A PPENDIX A.1 A DDITIONAL COMPARISON WITH OBQ We now provide an additional comparison between GPTQ and OBQ on BERT-base/SQuAD Ra- jpurkar et al. (2016) and OPT-125M/WikiText2, which is one of the largest models to which OBQ can be reasonably applied. Method BERT-base OPT-125M 88.53 F1↑ 27.66 PPL↓ 4bit 3bit...
work page 2023
-
[19]
where the underlying (close to) matrix-vector products are memory- bound. For non-generative and large-batch applications, operations may be compute- rather than memory-bound and our kernels thus not directly applicable. Instead, one could simply decompress the matrix before performing the corresponding matrix-matrix calculations: this takes < 1.5ms on an...
work page 2022
-
[20]
GPTQ 3 35.78 28.83 25.34 21.25 17.67 12.27 Table 12: BLOOM perplexity results for C4. We note that the calibration data used by GPTQ is sampled from the C4 training set, this task is thus not fully zero-shot. 14 Published as a conference paper at ICLR 2023 A.4 A DDITIONAL ZEROSHOT RESULTS This section contains additional results for zero-shot tasks. OPT B...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.