arxiv: 2210.17323 · v2 · submitted 2022-10-31 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Dan Alistarh, Elias Frantar, Saleh Ashkboos, Torsten Hoefler

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:12 UTC · model grok-4.3

classification 💻 cs.LG

keywords post-training quantizationgenerative pre-trained transformersGPT modelsmodel compressionlarge language modelsone-shot quantizationHessian approximationinference speedup

0 comments

The pith

GPTQ quantizes 175 billion parameter GPT models to 3 or 4 bits per weight in about four GPU hours with negligible accuracy loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents GPTQ, a one-shot weight quantization method for large generative pre-trained transformer models. It relies on approximate second-order information to achieve high accuracy at low bit widths. The method is efficient enough to handle models as large as 175 billion parameters in roughly four hours on a GPU. This compression allows such models to run inference on a single GPU for the first time and delivers significant speedups over full precision versions.

Core claim

The authors show that their GPTQ method, based on approximate second-order information, can quantize GPT models with up to 175 billion parameters down to 3 or 4 bits per weight. This process takes approximately four GPU hours and results in negligible accuracy degradation relative to the uncompressed model. The approach more than doubles the compression gains of previous one-shot methods and enables single-GPU inference for these massive models, with observed speedups of 3.25x on A100 GPUs and 4.5x on A6000 GPUs.

What carries the argument

Approximate second-order information, specifically Hessian-based approximations, used to make layer-wise quantization decisions in a one-shot post-training process.

If this is right

175 billion parameter models become runnable for generative inference inside a single GPU.
Accuracy is preserved at 3-4 bit quantization, more than doubling prior compression gains for one-shot methods.
End-to-end inference achieves speedups of approximately 3.25x on high-end GPUs like the NVIDIA A100 and 4.5x on cost-effective ones like the NVIDIA A6000.
Reasonable accuracy holds in extreme cases of 2-bit or ternary quantization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

GPTQ's efficiency could make quantization standard practice for deploying very large language models on limited hardware.
The one-shot nature suggests the approach may extend to other large transformer families with similar scale challenges.
Observed speedups point to new possibilities for interactive or real-time use of generative models that previously required multiple GPUs.

Load-bearing premise

The Hessian-based approximate second-order information stays sufficiently accurate for all layers in 175B-scale models without accumulating errors that would necessitate retraining.

What would settle it

A test showing large accuracy degradation or much longer quantization time when applying the method to a 175B parameter GPT model would disprove the central performance claims.

read the original abstract

Generative Pre-trained Transformer models, known as GPT or OPT, set themselves apart through breakthrough performance across complex language modelling tasks, but also by their extremely high computational and storage costs. Specifically, due to their massive size, even inference for large, highly-accurate GPT models may require multiple performant GPUs, which limits the usability of such models. While there is emerging work on relieving this pressure via model compression, the applicability and performance of existing compression techniques is limited by the scale and complexity of GPT models. In this paper, we address this challenge, and propose GPTQ, a new one-shot weight quantization method based on approximate second-order information, that is both highly-accurate and highly-efficient. Specifically, GPTQ can quantize GPT models with 175 billion parameters in approximately four GPU hours, reducing the bitwidth down to 3 or 4 bits per weight, with negligible accuracy degradation relative to the uncompressed baseline. Our method more than doubles the compression gains relative to previously-proposed one-shot quantization methods, preserving accuracy, allowing us for the first time to execute an 175 billion-parameter model inside a single GPU for generative inference. Moreover, we also show that our method can still provide reasonable accuracy in the extreme quantization regime, in which weights are quantized to 2-bit or even ternary quantization levels. We show experimentally that these improvements can be leveraged for end-to-end inference speedups over FP16, of around 3.25x when using high-end GPUs (NVIDIA A100) and 4.5x when using more cost-effective ones (NVIDIA A6000). The implementation is available at https://github.com/IST-DASLab/gptq.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GPTQ shows a practical Hessian-based one-shot method that quantizes 175B models to 3-4 bits in hours with small perplexity loss and real speedups, but the approximation's behavior at scale rests mostly on empirics.

read the letter

GPTQ gives a workable route to drop 175B-scale GPT models to 3 or 4 bits per weight in roughly four GPU hours while keeping perplexity close to the full-precision baseline. The result is single-GPU inference and measured speedups of 3.25x on A100s and 4.5x on A6000s, which is the part that matters for deployment right now. They also push into 2-bit and ternary regimes with usable accuracy, though those are secondary. The core technical step is adapting an approximate inverse-Hessian update, computed layer-wise on a small calibration set with Cholesky factorization and error compensation, to decide quantization choices. This is more than a simple extension of older Optimal Brain Surgeon ideas; the combination is tuned for transformer weight matrices at this size and delivers more than double the compression of prior one-shot baselines on the models they test. Releasing the code helps anyone who wants to reproduce or extend it. The experiments cover OPT models up to 175B and report concrete numbers, which is the main strength. The soft spot is exactly the one the stress test flags: there is no derivation of error bounds on the Hessian approximation, no scaling study for how the Cholesky updates hold when hidden dimensions hit 12k, and limited ablation on calibration-set size versus model depth. The paper shows the method works empirically on the models tried, but if the second-order proxy drifts systematically at larger scales or with different data, the “negligible degradation” claim would need re-checking. That gap is real but not fatal given the results they do provide. This paper is for people who need to run or serve large generative models under tight memory or latency constraints. Practitioners in efficient inference and quantization researchers will get immediate value from the numbers and the implementation. It shows clear engagement with the existing literature and produces falsifiable claims backed by experiments, so it deserves a serious referee. I would send it for peer review.

Referee Report

3 major / 2 minor

Summary. The paper presents GPTQ, a one-shot post-training quantization method for large GPT and OPT models that uses approximate second-order (Hessian) information to quantize weights to 3-4 bits (and even 2-bit/ternary) while claiming negligible accuracy loss relative to FP16 baselines. It reports that a 175B-parameter model can be quantized in ~4 GPU hours, more than doubling prior one-shot compression ratios, enabling single-GPU generative inference, and delivering 3.25-4.5x end-to-end speedups on A100/A6000 hardware.

Significance. If the empirical claims hold, the work would be significant for efficient deployment of large language models: it offers a practical, retraining-free route to 3-4 bit inference on models previously requiring multiple high-end GPUs, with open-source code that could accelerate follow-on research in post-training compression.

major comments (3)

[Method section (quantization procedure and Hessian approximation)] The central claim that the layer-wise approximate inverse-Hessian (computed on a small calibration set via Cholesky updates) remains sufficiently accurate for 175B-scale models without compounding errors lacks any supporting analysis or bounds. No scaling study or error-propagation argument is given for layer dimensions ~12k or network depth of 175B parameters.
[Experiments section (OPT-175B results and ablations)] Experimental results on OPT-175B report low perplexity degradation at 3-4 bits, but the manuscript supplies no ablation on calibration-set size versus model scale, no variance across multiple calibration draws or random seeds, and no comparison of Hessian approximation quality at different model widths.
[Introduction and Experiments] The claim of 'negligible accuracy degradation' and 'more than doubles the compression gains' relative to prior one-shot methods rests on the unverified assumption that second-order information stays reliable at this scale; without the missing analysis, the four-GPU-hour practicality claim cannot be generalized beyond the specific empirical runs shown.

minor comments (2)

[Abstract] The abstract states that the method 'more than doubles the compression gains' but does not name the exact prior one-shot baselines or report the precise ratio in the summary paragraph.
[Method] Notation for the per-layer Hessian update and error compensation step could be clarified with an explicit algorithm box or pseudocode to aid reproducibility.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive feedback and positive assessment of the work's potential significance. We address each major comment point by point below, with proposed revisions to the manuscript where appropriate.

read point-by-point responses

Referee: [Method section (quantization procedure and Hessian approximation)] The central claim that the layer-wise approximate inverse-Hessian (computed on a small calibration set via Cholesky updates) remains sufficiently accurate for 175B-scale models without compounding errors lacks any supporting analysis or bounds. No scaling study or error-propagation argument is given for layer dimensions ~12k or network depth of 175B parameters.

Authors: We agree that the manuscript provides no formal theoretical bounds or error-propagation analysis for the layer-wise Hessian approximation. The method is presented as an efficient, practical approximation whose reliability is demonstrated empirically on models up to 175B parameters. Deriving rigorous bounds at this scale is a substantial theoretical undertaking that lies outside the paper's empirical focus. In the revision we will add a short discussion of the approximation's observed stability (based on per-layer quantization error and end-to-end perplexity) together with a scaling plot that compares performance from 1B to 175B models. revision: partial
Referee: [Experiments section (OPT-175B results and ablations)] Experimental results on OPT-175B report low perplexity degradation at 3-4 bits, but the manuscript supplies no ablation on calibration-set size versus model scale, no variance across multiple calibration draws or random seeds, and no comparison of Hessian approximation quality at different model widths.

Authors: We accept that these ablations are missing and would strengthen the experimental section. The revised manuscript will include: (i) perplexity versus calibration-set size for models of varying scale, (ii) mean and standard deviation of perplexity over at least three independent calibration draws, and (iii) a comparison of per-layer quantization error for layers of different widths. These additions will be placed in the experiments section and the associated appendix. revision: yes
Referee: [Introduction and Experiments] The claim of 'negligible accuracy degradation' and 'more than doubles the compression gains' relative to prior one-shot methods rests on the unverified assumption that second-order information stays reliable at this scale; without the missing analysis, the four-GPU-hour practicality claim cannot be generalized beyond the specific empirical runs shown.

Authors: The statements in the introduction and experiments are tied directly to the concrete empirical results reported for the tested models (including the 175B OPT model quantized in roughly four GPU hours). We do not assert that the second-order approximation is theoretically guaranteed to remain reliable at arbitrary scales. In the revision we will rephrase the relevant sentences to make the empirical basis explicit and to note that broader generalization remains an open question for future study. revision: partial

standing simulated objections not resolved

Deriving formal theoretical bounds or an error-propagation argument for the approximate inverse-Hessian at 175B scale and layer widths of ~12k.

Circularity Check

0 steps flagged

No significant circularity; empirical method with independent validation

full rationale

The paper presents GPTQ as a one-shot post-training quantization procedure that applies established approximate second-order (Hessian) information to weight quantization, with all performance claims (4-GPU-hour runtime, 3-4 bit accuracy on 175B models) resting on direct experimental measurements rather than any derivation that reduces by construction to the method's own fitted quantities or self-citations. No load-bearing step equates a claimed prediction to an input by definition, and the central accuracy result is externally falsifiable via perplexity and downstream metrics on held-out data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the effectiveness of approximate second-order information for one-shot quantization at extreme scale; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Approximate second-order information suffices to select quantization values that preserve model accuracy at 3-4 bits without retraining
This assumption underpins the entire one-shot procedure described.

pith-pipeline@v0.9.0 · 5618 in / 1318 out tokens · 37565 ms · 2026-05-10T17:12:46.637265+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation washburn_uniqueness_aczel unclear
GPTQ... based on approximate second-order information... quantize GPT models with 175 billion parameters in approximately four GPU hours, reducing the bitwidth down to 3 or 4 bits per weight, with negligible accuracy degradation
Foundation.AlphaCoordinateFixation costAlphaLog_fourth_deriv_at_zero unclear
The approximate second-order information (Hessian-based) remains sufficiently accurate for guiding quantization decisions across all layers of 175B-scale models

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Layer Collapse in Diffusion Language Models
cs.LG 2026-05 unverdicted novelty 7.0

Diffusion language models develop early-layer collapse around an indispensable super-outlier due to overtraining, resulting in higher compressibility and reversed optimal sparsity patterns versus autoregressive models.
Layer Collapse in Diffusion Language Models
cs.LG 2026-05 conditional novelty 7.0

Early layers in diffusion language models like LLaDA-8B collapse into redundant representations around a critical super-outlier activation due to overtraining, making them more robust to quantization and sparsity than...
When Quantization Is Free: An int4 KV Cache That Outruns fp16 on Apple Silicon
cs.PF 2026-05 unverdicted novelty 7.0

A single fused int4 KV cache kernel on Apple Silicon outperforms fp16 in latency with 3x memory compression and near-zero quality loss on tested models.
Quantamination: Dynamic Quantization Leaks Your Data Across the Batch
cs.CR 2026-04 conditional novelty 7.0

Dynamic quantization creates side channels allowing partial or full recovery of other users' batched data in at least four popular ML frameworks.
Variance Is Not Importance: Structural Analysis of Transformer Compressibility Across Model Scales
cs.LG 2026-04 unverdicted novelty 7.0

High-variance activation directions are uncorrelated with predictions, transformer blocks grow more linear with depth, and single-block linear replacement yields 34x compression on Mistral's final block at a 1.71 perp...
When W4A4 Breaks Camouflaged Object Detection: Token-Group Dual-Constraint Activation Quantization
cs.CV 2026-04 unverdicted novelty 7.0

COD-TDQ uses token-group scaling and dual-constraint projection to fix 4-bit activation quantization for camouflaged object detection, delivering more than 0.12 higher Sα scores than prior methods on four benchmarks w...
When Flat Minima Fail: Characterizing INT4 Quantization Collapse After FP32 Convergence
cs.LG 2026-04 conditional novelty 7.0

FP32-converged language models enter a post-convergence phase where INT4 quantization error explodes while FP32 perplexity remains stable, with onset tied to fine convergence rather than learning rate decay.
Hidden Reliability Risks in Large Language Models: Systematic Identification of Precision-Induced Output Disagreements
cs.AI 2026-04 unverdicted novelty 7.0

PrecisionDiff is a differential testing framework that uncovers widespread precision-induced behavioral disagreements in aligned LLMs, including safety-critical jailbreak divergences across precision formats.
ENEC: A Lossless AI Model Compression Method Enabling Fast Inference on Ascend NPUs
cs.AR 2026-03 unverdicted novelty 7.0

ENEC delivers 3.43X higher throughput than DietGPU and 1.12X better compression ratio than nvCOMP for lossless model weight compression on Ascend NPUs, yielding up to 6.3X end-to-end inference speedup.
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
cs.LG 2024-01 conditional novelty 7.0

Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.
QLoRA: Efficient Finetuning of Quantized LLMs
cs.LG 2023-05 conditional novelty 7.0

QLoRA finetunes 4-bit quantized LLMs via LoRA adapters to match full-precision performance while using far less memory, enabling 65B-scale training on single GPUs and producing Guanaco models near ChatGPT level.
Search Your Block Floating Point Scales!
cs.LG 2026-05 unverdicted novelty 6.0

ScaleSearch optimizes block floating point scales via fine-grained search to cut quantization error by 27% for NVFP4, improving PTQ by up to 15 points on MATH500 for Qwen3-8B and attention PPL by 0.77 on Llama 3.1 70B.
ADMM-Q: An Improved Hessian-based Weight Quantizer for Post-Training Quantization of Large Language Models
cs.LG 2026-05 unverdicted novelty 6.0

ADMM-Q is a new post-training quantization method using ADMM operator splitting that reduces WikiText-2 perplexity compared to GPTQ on Qwen3-8B across W3A16, W4A8, and W2A4KV4 settings.
Fitting Is Not Enough: Smoothness in Extremely Quantized LLMs
cs.CL 2026-05 unverdicted novelty 6.0

Extremely quantized LLMs degrade in smoothness, sparsifying the decoding tree and hurting generation quality; a smoothness-preserving principle delivers gains beyond numerical fitting.
Different Prompts, Different Ranks: Prompt-aware Dynamic Rank Selection for SVD-based LLM Compression
cs.LG 2026-05 unverdicted novelty 6.0

PARSE trains a prompt-aware linear router on dense-model outputs to select dynamic SVD ranks, improving accuracy up to 10% at 0.6 compression ratio on LLaMA-7B while delivering 2.5x prefill and 2.4x decode speedups.
XtraMAC: An Efficient MAC Architecture for Mixed-Precision LLM Inference on FPGA
cs.AR 2026-05 conditional novelty 6.0

XtraMAC unifies mixed-precision MAC on FPGA via shared integer mantissa products, delivering 1.4-2.0x higher compute density and up to 1.9x better energy efficiency.
OSAQ: Outlier Self-Absorption for Accurate Low-bit LLM Quantization
cs.LG 2026-05 unverdicted novelty 6.0

OSAQ uses the low-rank structure of the Hessian to construct a closed-form additive weight transformation that suppresses outliers without changing task loss, enabling better low-bit LLM quantization.
OSAQ: Outlier Self-Absorption for Accurate Low-bit LLM Quantization
cs.LG 2026-05 unverdicted novelty 6.0

OSAQ suppresses weight outliers in LLMs via a closed-form additive transformation from the Hessian's stable null space, improving 2-bit quantization perplexity by over 40% versus vanilla GPTQ with no inference overhead.
Compress Then Adapt? No, Do It Together via Task-aware Union of Subspaces
cs.AI 2026-05 unverdicted novelty 6.0

JACTUS unifies low-rank compression and task adaptation via a task-aware union of subspaces and global rank allocation by marginal gain, outperforming 100% PEFT methods like DoRA on ViT-Base (89.2% avg) and Llama2-7B ...
WindowQuant: Mixed-Precision KV Cache Quantization based on Window-Level Similarity for VLMs Inference Optimization
cs.CV 2026-05 unverdicted novelty 6.0

WindowQuant performs window-adaptive mixed-precision KV cache quantization guided by similarity to the text prompt, with reordering to enable efficient inference in VLMs.
Sharpness-Aware Pretraining Mitigates Catastrophic Forgetting
cs.LG 2026-05 unverdicted novelty 6.0

Sharpness-aware pretraining and related flat-minima interventions reduce catastrophic forgetting by up to 80% after post-training across 20M-150M models and by 31-40% at 1B scale.
Technical Report: Activation Residual Hessian Quantization (ARHQ) for Low-Bit LLM Quantization
cs.LG 2026-04 unverdicted novelty 6.0

ARHQ isolates error-sensitive weight directions in LLMs via truncated SVD on the scaled matrix W G_x^{1/2} from activation residuals, improving SNR and preserving performance under aggressive low-bit quantization.
CoQuant: Joint Weight-Activation Subspace Projection for Mixed-Precision LLMs
cs.LG 2026-04 unverdicted novelty 6.0

CoQuant selects optimal high-precision subspaces for mixed-precision LLM quantization via a closed-form weighted PCA that balances weight and activation covariances derived from expected output error.
BitRL: Reinforcement Learning with 1-bit Quantized Language Models for Resource-Constrained Edge Deployment
cs.LG 2026-04 unverdicted novelty 6.0

BitRL enables on-device RL agents via 1-bit quantized language models, delivering 10-16x memory reduction and 3-5x energy efficiency gains with 85-98% retained performance.
FASQ: Flexible Accelerated Subspace Quantization for Calibration-Free LLM Compression
cs.LG 2026-04 unverdicted novelty 6.0

FASQ delivers calibration-free LLM compression with continuous size trade-offs via product quantization and custom CUDA kernels that accelerate decode beyond FP16 speeds on consumer hardware.
MCAP: Deployment-Time Layer Profiling for Memory-Constrained LLM Inference
cs.LG 2026-04 unverdicted novelty 6.0

MCAP uses load-time Monte Carlo profiling to estimate layer importance, enabling dynamic quantization (W4A8 vs W4A16) and memory tiering (GPU/RAM/SSD) that delivers 1.5-1.8x higher decode throughput than llama-cpp Q4_...
From Signal Degradation to Computation Collapse: Uncovering the Two Failure Modes of LLM Quantization
cs.CL 2026-04 unverdicted novelty 6.0

LLM 2-bit quantization fails via either cumulative signal degradation or early computation collapse in key components.
GRASPrune: Global Gating for Budgeted Structured Pruning of Large Language Models
cs.AI 2026-04 unverdicted novelty 6.0

GRASPrune removes 50% of parameters from LLaMA-2-7B via global gating and projected straight-through estimation, reaching 12.18 WikiText-2 perplexity and competitive zero-shot accuracy after four epochs on 512 calibra...
Are Large Language Models Economically Viable for Industry Deployment?
cs.CL 2026-04 unverdicted novelty 6.0

Small LLMs under 2B parameters achieve better economic break-even, energy efficiency, and hardware density than larger models on legacy GPUs for industrial tasks.
LBLLM: Lightweight Binarization of Large Language Models via Three-Stage Distillation
cs.LG 2026-04 unverdicted novelty 6.0

LBLLM achieves better accuracy than prior binarization methods for LLMs by decoupling weight and activation quantization through initialization, layer-wise distillation, and learnable activation scaling.
SAW-INT4: System-Aware 4-Bit KV-Cache Quantization for Real-World LLM Serving
cs.LG 2026-04 unverdicted novelty 6.0

Token-wise INT4 KV-cache quantization plus block-diagonal Hadamard rotation recovers nearly all accuracy lost by naive INT4 while adding zero end-to-end overhead under paged serving constraints.
Benchmarking System Dynamics AI Assistants: Cloud Versus Local LLMs on CLD Extraction and Discussion
cs.AI 2026-04 unverdicted novelty 6.0

Cloud LLMs reach 77-89% on CLD extraction while the best local model hits 77%; local models perform well on model-building steps but drop to 0-50% on error fixing due to long-context memory limits, with backend choice...
GSQ: Highly-Accurate Low-Precision Scalar Quantization for LLMs via Gumbel-Softmax Sampling
cs.CL 2026-04 unverdicted novelty 6.0

GSQ applies a Gumbel-Softmax relaxation to learn discrete grid assignments in scalar quantization, closing most of the accuracy gap to vector methods like QTIP on Llama-3.1 models at 2-3 bits while using only symmetri...
Depth Registers Unlock W4A4 on SwiGLU: A Reader/Generator Decomposition
cs.CL 2026-04 unverdicted novelty 6.0

Depth Registers plus hinge loss cut W4A4-induced perplexity collapse from 1727 to 119 in a 300M SwiGLU model by selectively taming reader-layer activations while leaving bilinear generator tails largely untouched.
Open-TQ-Metal: Fused Compressed-Domain Attention for Long-Context LLM Inference on Apple Silicon
cs.LG 2026-04 unverdicted novelty 6.0

Fused compressed-domain int4 attention on Apple Silicon delivers 48x speedup and 3.2x KV cache compression for 128K-context 70B models while matching FP16 token predictions.
MemExplorer: Navigating the Heterogeneous Memory Design Space for Agentic Inference NPUs
cs.AR 2026-04 unverdicted novelty 6.0

MemExplorer optimizes heterogeneous memory systems for agentic LLM inference on NPUs and reports up to 2.3x higher energy efficiency than baselines under fixed power budgets.
The Illusion of Equivalence: Systematic FP16 Divergence in KV-Cached Autoregressive Inference
cs.LG 2026-04 unverdicted novelty 6.0

FP16 KV caching in transformers causes deterministic token divergence versus cache-free inference due to non-associative floating-point accumulation orderings.
Robust Ultra Low-Bit Post-Training Quantization via Stable Diagonal Curvature Estimate
cs.LG 2026-04 unverdicted novelty 6.0

DASH-Q uses a stable diagonal curvature estimate and weighted least squares to achieve robust ultra-low-bit post-training quantization of LLMs, improving zero-shot accuracy by 7% on average over baselines.
Numerical Instability and Chaos: Quantifying the Unpredictability of Large Language Models
cs.AI 2026-04 unverdicted novelty 6.0

Large language models display three universal scale-dependent regimes of behavior—stable, chaotic, and signal-dominated—driven by floating-point rounding errors that produce an avalanche effect in early layers.
EdgeCIM: A Hardware-Software Co-Design for CIM-Based Acceleration of Small Language Models
cs.AR 2026-04 unverdicted novelty 6.0

A CIM-based hardware-software co-design in 65nm achieves up to 7.3x higher throughput and 49.59x better energy efficiency than NVIDIA Orin Nano for LLaMA3.2-1B, averaging 336 tokens/s and 173 tokens/J under INT4 acros...
Quantization Dominates Rank Reduction for KV-Cache Compression
cs.LG 2026-04 conditional novelty 6.0

Quantization of the KV cache beats rank reduction for matched storage budgets by 4-364 PPL, because dimension removal can flip attention token selection under softmax while bounded quantization noise usually preserves...
EdgeRazor: A Lightweight Framework for Large Language Models via Mixed-Precision Quantization-Aware Distillation
cs.LG 2026-04 unverdicted novelty 6.0

EdgeRazor delivers 1.58-1.88 bit quantized LLMs that outperform 2-3 bit baselines by up to 11.3 points while using 4-10x less training compute than leading QAT methods.
EdgeFlow: Fast Cold Starts for LLMs on Mobile Devices
cs.OS 2026-04 unverdicted novelty 6.0

EdgeFlow reduces mobile LLM cold-start latency up to 4.07x versus llama.cpp, MNN, and llm.npu by NPU-aware adaptive quantization, SIMD-friendly packing, and synergistic granular CPU-NPU pipelining at comparable accuracy.
MP-ISMoE: Mixed-Precision Interactive Side Mixture-of-Experts for Efficient Transfer Learning
cs.LG 2026-04 unverdicted novelty 6.0

MP-ISMoE uses Gaussian noise perturbed iterative quantization and interactive side mixture-of-experts to deliver higher accuracy than prior memory-efficient transfer learning methods while keeping similar parameter an...
DeFakeQ: Enabling Real-Time Deepfake Detection on Edge Devices via Adaptive Bidirectional Quantization
cs.CV 2026-04 unverdicted novelty 6.0

DeFakeQ introduces an adaptive bidirectional quantization method tailored for deepfake detectors that maintains detection accuracy while enabling real-time performance on resource-constrained edge devices.
Rethinking Residual Errors in Compensation-based LLM Quantization
cs.LG 2026-04 conditional novelty 6.0

Redefining residual errors to include compensation-aware discrepancies and realigning calibration to full-precision outputs improves GPTQ and GPTAQ performance on LLMs.
FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling
cs.LG 2026-04 unverdicted novelty 6.0

Sol-RL decouples FP4-based candidate exploration from BF16 policy optimization in diffusion RL, delivering up to 4.64x faster convergence with maintained or superior alignment performance on models like FLUX.1 and SD3.5.
Your LLM Agent Can Leak Your Data: Data Exfiltration via Backdoored Tool Use
cs.CR 2026-04 unverdicted novelty 6.0

Back-Reveal shows that LLM agents with tool access can be backdoored via fine-tuning to exfiltrate stored user context through memory and retrieval tool calls, with multi-turn interactions enabling sustained leakage.
Semantic Trimming and Auxiliary Multi-step Prediction for Generative Recommendation
cs.IR 2026-04 unverdicted novelty 6.0

STAMP mitigates semantic dilution in SID-based generative recommendation via adaptive input pruning and densified output supervision, delivering 1.23-1.38x speedup and 17-55% VRAM savings with maintained or improved accuracy.
RUQuant: Towards Refining Uniform Quantization for Large Language Models
cs.CL 2026-04 unverdicted novelty 6.0

RUQuant uses block-wise composite orthogonal matrices from Householder reflections and Givens rotations plus a fine-tuned global reflection to achieve 99.8% full-precision accuracy at W6A6 and 97% at W4A4 for 13B LLMs...
Querying Structured Data Through Natural Language Using Language Models
cs.CL 2026-04 conditional novelty 6.0

Fine-tuning an 8B LLM with synthetic data enables accurate natural language querying of structured datasets like accessibility services in Spain, generalizing to new locations.
Compiling Code LLMs into Lightweight Executables
cs.SE 2026-03 conditional novelty 6.0

Ditto quantizes Code LLMs with K-Means codebooks and compiles inference via LLVM-BLAS replacement to deliver up to 10.5x faster, 6.4x smaller, and 10.5x lower-energy execution on commodity hardware while losing only 0...
NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models
cs.CL 2024-05 accept novelty 6.0

NV-Embed achieves first place on the MTEB leaderboard across 56 tasks by combining a latent attention layer, causal-mask removal, two-stage contrastive training, and data curation for LLM-based embedding models.
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies
cs.CL 2024-04 conditional novelty 6.0

MiniCPM 1.2B and 2.4B models reach parity with 7B-13B LLMs via model wind-tunnel scaling and a WSD scheduler that yields a higher optimal data-to-model ratio than Chinchilla scaling.
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
cs.CL 2024-02 conditional novelty 6.0

KIVI applies asymmetric 2-bit quantization to KV cache with per-channel keys and per-token values, reducing memory 2.6x and boosting throughput up to 3.47x with near-identical quality on Llama, Falcon, and Mistral.
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
cs.CL 2023-05 unverdicted novelty 6.0

Uptraining multi-head transformer checkpoints to grouped-query attention models achieves near multi-head quality at multi-query inference speeds using 5% additional compute.
Position: LLM Inference Should Be Evaluated as Energy-to-Token Production
cs.CE 2026-05 unverdicted novelty 5.0

LLM inference should be reframed and evaluated as energy-to-token production with a Token Production Function that accounts for power, cooling, and efficiency ceilings.
31.1 A 14.08-to-135.69Token/s ReRAM-on-Logic Stacked Outlier-Free Large-Language-Model Accelerator with Block-Clustered Weight-Compression and Adaptive Parallel-Speculative-Decoding
cs.AR 2026-05 unverdicted novelty 5.0

A ReRAM-on-logic stacked chip delivers 14.08-135.69 tokens/s LLM inference with block-clustered compression and adaptive parallel speculative decoding, yielding 4.46-7.17x speedup over standard methods.
HCInfer: An Efficient Inference System via Error Compensation for Resource-Constrained Devices
cs.LG 2026-05 unverdicted novelty 5.0

HCInfer recovers up to 5.2% accuracy over compressed LLMs and delivers 10.4x speedup versus full-precision models by offloading compensation parameters to CPU with async execution on resource-limited hardware.
Colinearity Decay: Training Quantization-Friendly ViTs with Outlier Decay
cs.CV 2026-05 unverdicted novelty 5.0

Colinearity-Decay regularizer trains ViTs that maintain or improve full-precision accuracy while delivering higher accuracy after low-bit quantization on ImageNet and COCO tasks.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · cited by 77 Pith papers · 4 internal anchors

[1]

On the dangers of stochastic parrots: Can language models be too big? In 2021 ACM Conference on Fairness, Accountability, and Transparency,

Emily M Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big? In 2021 ACM Conference on Fairness, Accountability, and Transparency,

work page 2021
[2]

A systematic classification of knowl- edge, reasoning, and context within the arc dataset.arXiv preprint arXiv:1806.00358, 2018

Michael Boratko, Harshit Padigela, Divyendra Mikkilineni, Pritish Yuvraj, Rajarshi Das, Andrew McCallum, Maria Chang, Achille Fokoue-Nkoutche, Pavan Kapanipathi, Nicholas Mattei, et al. A systematic classiﬁcation of knowledge, reasoning, and context within the ARC dataset. arXiv preprint arXiv:1806.00358,

work page arXiv
[3]

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Tri Dao, Daniel Y Fu, Stefano Ermon, Atri Rudra, and Christopher R ´e. FlashAttention: Fast and memory-efﬁcient exact attention with io-awareness.arXiv preprint arXiv:2205.14135,

work page internal anchor Pith review arXiv
[4]

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

10 Published as a conference paper at ICLR 2023 Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. LLM.int8(): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339,

work page internal anchor Pith review arXiv 2023
[5]

Optimal Brain Compression: A framework for ac- curate post-training quantization and pruning

Elias Frantar, Sidak Pal Singh, and Dan Alistarh. Optimal Brain Compression: A framework for ac- curate post-training quantization and pruning. arXiv preprint arXiv:2208.11580,

work page arXiv
[6]

A survey of quan- tization methods for efficient neural network inference,

Accepted to NeurIPS 2022, to appear. Amir Gholami, Sehoon Kim, Zhen Dong, Zhewei Yao, Michael W Mahoney, and Kurt Keutzer. A survey of quantization methods for efﬁcient neural network inference. arXiv preprint arXiv:2103.13630,

work page arXiv 2022
[7]

Sparsity in deep learning: Pruning and growth for efﬁcient inference and training in neural networks

Torsten Hoeﬂer, Dan Alistarh, Tal Ben-Nun, Nikoli Dryden, and Alexandra Peste. Sparsity in deep learning: Pruning and growth for efﬁcient inference and training in neural networks. arXiv preprint arXiv:2102.00554,

work page arXiv
[8]

Improving post training neural quantization: Layer-wise calibration and integer programming

Itay Hubara, Yury Nahshan, Yair Hanani, Ron Banner, and Daniel Soudry. Improving post training neural quantization: Layer-wise calibration and integer programming. arXiv preprint arXiv:2006.10518,

work page arXiv 2006
[9]

The penn treebank: Annotating predicate argument structure

Mitch Marcus, Grace Kim, Mary Ann Marcinkiewicz, Robert MacIntyre, Ann Bies, Mark Ferguson, Karen Katz, and Britta Schasberger. The penn treebank: Annotating predicate argument structure. In Human Language Technology: Proceedings of a Workshop held at Plainsboro, New Jersey, March 8-11, 1994,

work page 1994
[10]

Pointer Sentinel Mixture Models

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843,

work page internal anchor Pith review arXiv
[11]

A white paper on neural network quantization.arXiv preprint arXiv:2106.08295,

Markus Nagel, Marios Fournarakis, Rana Ali Amjad, Yelysei Bondarenko, Mart van Baalen, and Tijmen Blankevoort. A white paper on neural network quantization. arXiv preprint arXiv:2106.08295,

work page arXiv
[12]

The LAMBADA dataset: Word prediction requiring a broad discourse context

Denis Paperno, Germ ´an Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fern´andez. The LAMBADA dataset: Word prediction requiring a broad discourse context. arXiv preprint arXiv:1606.06031,

work page Pith review arXiv
[13]

J., Kim, B., Lee, Y., and Lee, D

11 Published as a conference paper at ICLR 2023 Gunho Park, Baeseong Park, Se Jung Kwon, Byeongwook Kim, Youngjoo Lee, and Dongsoo Lee. nuQmm: Quantized matmul for efﬁcient inference of large-scale generative language models. arXiv preprint arXiv:2206.09557,

work page arXiv 2023
[14]

Extreme compression for pre-trained transformers made simple and efﬁcient

Xiaoxia Wu, Zhewei Yao, Minjia Zhang, Conglong Li, and Yuxiong He. Extreme compression for pre-trained transformers made simple and efﬁcient. arXiv preprint arXiv:2206.01859,

work page arXiv
[15]

Y., Zhang, M., Wu, X., Li, C., and He, Y

Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, Conglong Li, and Yuxiong He. ZeroQuant: Efﬁcient and affordable post-training quantization for large-scale transformers.arXiv preprint arXiv:2206.01861,

work page arXiv
[16]

OPT: Open Pre-trained Transformer Language Models

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christo- pher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. OPT: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068,

work page internal anchor Pith review arXiv
[17]

Alpa: Automating inter- and intra-operator parallelism for distributed deep learning,

Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Joseph E Gonzalez, et al. Alpa: Automating inter-and intra-operator parallelism for distributed deep learning. arXiv preprint arXiv:2201.12023,

work page arXiv
[18]

(2016) and OPT-125M/WikiText2, which is one of the largest models to which OBQ can be reasonably applied

12 Published as a conference paper at ICLR 2023 A A PPENDIX A.1 A DDITIONAL COMPARISON WITH OBQ We now provide an additional comparison between GPTQ and OBQ on BERT-base/SQuAD Ra- jpurkar et al. (2016) and OPT-125M/WikiText2, which is one of the largest models to which OBQ can be reasonably applied. Method BERT-base OPT-125M 88.53 F1↑ 27.66 PPL↓ 4bit 3bit...

work page 2023
[19]

For non-generative and large-batch applications, operations may be compute- rather than memory-bound and our kernels thus not directly applicable

where the underlying (close to) matrix-vector products are memory- bound. For non-generative and large-batch applications, operations may be compute- rather than memory-bound and our kernels thus not directly applicable. Instead, one could simply decompress the matrix before performing the corresponding matrix-matrix calculations: this takes < 1.5ms on an...

work page 2022
[20]

We note that the calibration data used by GPTQ is sampled from the C4 training set, this task is thus not fully zero-shot

GPTQ 3 35.78 28.83 25.34 21.25 17.67 12.27 Table 12: BLOOM perplexity results for C4. We note that the calibration data used by GPTQ is sampled from the C4 training set, this task is thus not fully zero-shot. 14 Published as a conference paper at ICLR 2023 A.4 A DDITIONAL ZEROSHOT RESULTS This section contains additional results for zero-shot tasks. OPT B...

work page 2023