hub Canonical reference

SpinQuant: LLM quantization with learned rotations

Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi · 2024 · cs.LG · arXiv 2405.16406

Canonical reference. 80% of citing Pith papers cite this work as background.

49 Pith papers citing it

Background 80% of classified citations

open full Pith review browse 49 citing papers arXiv PDF

abstract

Post-training quantization (PTQ) techniques applied to weights, activations, and the KV cache greatly reduce memory usage, latency, and power consumption of Large Language Models (LLMs), but may lead to large quantization errors when outliers are present. Rotating activation or weight matrices helps remove outliers and benefits quantization. In this work, we identify a collection of applicable rotation parameterizations that lead to identical outputs in full-precision Transformer architectures while enhancing quantization accuracy. In addition, we find that some random rotations lead to much better quantization than others, with an up to 13 points difference in downstream zero-shot reasoning performance. As a result, we propose SpinQuant, a novel approach that incorporates learned rotation matrices for optimal quantized network accuracy. With 4-bit quantization of weight, activation, and KV-cache, SpinQuant narrows the accuracy gap on zero-shot reasoning tasks with full precision to merely 2.9 points on the LLaMA-2 7B model, surpassing LLM-QAT by 19.1 points and SmoothQuant by 25.0 points. Furthermore, SpinQuant also outperforms concurrent work QuaRot, which applies random rotations to remove outliers. In particular, for LLaMA-3 8B models that are hard to quantize, SpinQuant reduces the gap to full precision by up to 45.1% relative to QuaRot. Code is available at https://github.com/facebookresearch/SpinQuant.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4 baseline 1

citation-polarity summary

background 4 baseline 1

representative citing papers

GPTQ-intrinsic LoRA: A Near-optimal Algorithm for Low-precision Quantization with Low-rank Adaptation

cs.LG · 2026-05-31 · unverdicted · novelty 8.0

GPTQ-intrinsic LoRA augments GPTQ with intrinsic low-rank compensation via Hessian modification to achieve layer-wise reconstruction bounds that match information-theoretic lower bounds under structural assumptions.

Every Eval Ever: A Unifying Schema and Community Repository for AI Evaluation Results

cs.AI · 2026-06-12 · unverdicted · novelty 7.0

Introduces the first community-governed unified JSON schema and crowdsourced repository for AI evaluation results, with converters and a database spanning 22,235 models and 2,273 benchmarks.

Rotation-Aligned Key Channel Pruning for Efficient Vision-Language Model Inference

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

RotateK uses online PCA-based rotation to align token-dependent key channel importance into a shared subspace, enabling accurate head-wise structured pruning and faster decoding in VLMs compared to prior token or channel methods.

When Quantization Is Free: An int4 KV Cache That Outruns fp16 on Apple Silicon

cs.PF · 2026-05-07 · unverdicted · novelty 7.0

A single fused int4 KV cache kernel on Apple Silicon outperforms fp16 in latency with 3x memory compression and near-zero quality loss on tested models.

Coverage-Based Calibration for Post-Training Quantization via Weighted Set Cover over Outlier Channels

cs.LG · 2026-04-27 · conditional · novelty 7.0

COVERCAL selects PTQ calibration samples via weighted set cover over outlier channels, with a stylized clipping model showing missed coverage upper-bounds surrogate loss, yielding gains over random and other baselines on LLaMA and Mistral models.

Variance Is Not Importance: Structural Analysis of Transformer Compressibility Across Model Scales

cs.LG · 2026-04-22 · unverdicted · novelty 7.0

High-variance activation directions are uncorrelated with predictions, transformer blocks grow more linear with depth, and single-block linear replacement yields 34x compression on Mistral's final block at a 1.71 perplexity cost.

Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling

cs.CL · 2025-12-01 · conditional · novelty 7.0

Four Over Six adaptively scales blocks in NVFP4 quantization to smaller FP4 values, making representable value distributions more uniform and reducing quantization error especially for near-maximal values.

SharQ: Bridging Activation Sparsity and FP4 Quantization for LLM Inference

cs.LG · 2026-06-25 · unverdicted · novelty 6.0

SharQ combines input-adaptive N:M sparsity and FP4 quantization via sparse backbone plus dense residual, recovering 43-63% of the NVFP4-to-FP16 accuracy gap on Llama and Qwen models without calibration or retraining.

MonaVec: A Training-Free Embedded Vector Search Kernel for Edge and Offline AI Systems

cs.IR · 2026-06-17 · unverdicted · novelty 6.0

MonaVec provides a training-free 4-bit vector quantization and deterministic search kernel using Randomized Hadamard Transform and ChaCha20 seeding for embedded and offline use.

dMX: Differentiable Mixed-Precision Assignment for Low-Precision Floating-Point Formats

cs.LG · 2026-06-02 · unverdicted · novelty 6.0

dMX is a differentiable mixed-precision framework that learns per-layer MXFP bit-width assignments for LLMs and outperforms KL-based heuristics on perplexity and zero-shot accuracy under bit-width budgets.

LASER: Loss-Aware Singular-value Decomposition and Rank Allocation for Efficient Low-Precision Vision-Language Models

cs.LG · 2026-05-30 · unverdicted · novelty 6.0

LASER introduces curvature-weighted SVD from second-order loss approximation and loss-aware rank allocation to compress VLMs, reporting over 2.3x decoding speedup under low-precision settings.

DREAM-S: Speculative Decoding with Searchable Drafting and Target-Aware Refinement for Multimodal Generation

cs.LG · 2026-05-30 · unverdicted · novelty 6.0

DREAM-S combines neural architecture search, target-aware supernet training, and attention-entropy-guided distillation to accelerate speculative decoding in VLMs, reporting up to 3.85x speedup over standard methods.

ProactiveLLM: Learning Active Interaction for Streaming Large Language Models

cs.CL · 2026-05-30 · unverdicted · novelty 6.0

ProactiveLLM enables active interaction in streaming LLMs by learning semantic sufficiency cues from partial inputs through mask-based modeling and synchronized privileged self-distillation without external supervision.

Quantized Reasoning Models Think They Need to Think Longer, but They Do Not

cs.LG · 2026-05-29 · unverdicted · novelty 6.0

Post-training quantization increases overthinking errors in reasoning models; a logit penalty on curated overthinking markers reduces CoT length 12-23% without accuracy loss.

Breaking Modality Heterogeneity in Low-Bit Quantization for Large Vision-Language Models

cs.CV · 2026-05-19 · unverdicted · novelty 6.0

SplitQ improves low-bit PTQ for VLMs by isolating modality-specific outlier channels via MOCD and applying dual-branch adaptive calibration via ACC, outperforming prior methods on six datasets across W4A8 to W3A2 settings.

OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization

cs.LG · 2026-05-18 · unverdicted · novelty 6.0

OSCAR achieves near-BF16 accuracy for 2-bit KV cache quantization by using offline spectral covariance-aware rotations aligned with attention, plus a custom deployable INT2 kernel compatible with paged serving.

Not All Tasks Quantize Equally: Fisher-Guided Quantization for Visual Geometry Transformer

cs.CV · 2026-05-15 · unverdicted · novelty 6.0 · 2 refs

FGQ applies diagonal Fisher information to guide learnable affine transformations in PTQ for multi-task VGGT, yielding up to 39% relative gains over baselines at 4-bit quantization.

Theory-optimal Quantization Based on Flatness

cs.LG · 2026-05-11 · unverdicted · novelty 6.0

The paper introduces the Flatness metric, derives a theory-optimal quantization solution, and presents BDQ that uses bidirectional diagonal transformations to reduce outlier impact, achieving under 1% drop at W4A4 on LLaMA-3-8B.

OSAQ: Outlier Self-Absorption for Accurate Low-bit LLM Quantization

cs.LG · 2026-05-06 · unverdicted · novelty 6.0 · 2 refs

OSAQ suppresses weight outliers in LLMs via a closed-form additive transformation from the Hessian's stable null space, improving 2-bit quantization perplexity by over 40% versus vanilla GPTQ with no inference overhead.

Statistically-Lossless Quantization of Large Language Models

cs.LG · 2026-05-04 · unverdicted · novelty 6.0

SLQ achieves task-lossless LLM quantization below 4 bits per parameter and distribution-lossless at 5-6 bits on average, with 1.7-3.6x speedups over FP16.

Technical Report: Activation Residual Hessian Quantization (ARHQ) for Low-Bit LLM Quantization

cs.LG · 2026-04-30 · unverdicted · novelty 6.0

ARHQ isolates error-sensitive weight directions in LLMs via truncated SVD on the scaled matrix W G_x^{1/2} from activation residuals, improving SNR and preserving performance under aggressive low-bit quantization.

CoQuant: Joint Weight-Activation Subspace Projection for Mixed-Precision LLMs

cs.LG · 2026-04-29 · unverdicted · novelty 6.0

CoQuant selects optimal high-precision subspaces for mixed-precision LLM quantization via a closed-form weighted PCA that balances weight and activation covariances derived from expected output error.

QuantClaw: Precision Where It Matters for OpenClaw

cs.AI · 2026-04-24 · unverdicted · novelty 6.0

QuantClaw dynamically routes precision in agent workflows to cut cost by up to 21.4% and latency by 15.7% while keeping or improving task performance.

MCAP: Deployment-Time Layer Profiling for Memory-Constrained LLM Inference

cs.LG · 2026-04-22 · unverdicted · novelty 6.0

MCAP uses load-time Monte Carlo profiling to estimate layer importance, enabling dynamic quantization (W4A8 vs W4A16) and memory tiering (GPU/RAM/SSD) that delivers 1.5-1.8x higher decode throughput than llama-cpp Q4_0 on NVIDIA T4 while fitting models into previously infeasible memory budgets.

citing papers explorer

Showing 49 of 49 citing papers.

GPTQ-intrinsic LoRA: A Near-optimal Algorithm for Low-precision Quantization with Low-rank Adaptation cs.LG · 2026-05-31 · unverdicted · none · ref 47 · internal anchor
GPTQ-intrinsic LoRA augments GPTQ with intrinsic low-rank compensation via Hessian modification to achieve layer-wise reconstruction bounds that match information-theoretic lower bounds under structural assumptions.
Every Eval Ever: A Unifying Schema and Community Repository for AI Evaluation Results cs.AI · 2026-06-12 · unverdicted · none · ref 61 · internal anchor
Introduces the first community-governed unified JSON schema and crowdsourced repository for AI evaluation results, with converters and a database spanning 22,235 models and 2,273 benchmarks.
Rotation-Aligned Key Channel Pruning for Efficient Vision-Language Model Inference cs.CV · 2026-05-19 · unverdicted · none · ref 39 · internal anchor
RotateK uses online PCA-based rotation to align token-dependent key channel importance into a shared subspace, enabling accurate head-wise structured pruning and faster decoding in VLMs compared to prior token or channel methods.
When Quantization Is Free: An int4 KV Cache That Outruns fp16 on Apple Silicon cs.PF · 2026-05-07 · unverdicted · none · ref 21 · internal anchor
A single fused int4 KV cache kernel on Apple Silicon outperforms fp16 in latency with 3x memory compression and near-zero quality loss on tested models.
Coverage-Based Calibration for Post-Training Quantization via Weighted Set Cover over Outlier Channels cs.LG · 2026-04-27 · conditional · none · ref 16 · internal anchor
COVERCAL selects PTQ calibration samples via weighted set cover over outlier channels, with a stylized clipping model showing missed coverage upper-bounds surrogate loss, yielding gains over random and other baselines on LLaMA and Mistral models.
Variance Is Not Importance: Structural Analysis of Transformer Compressibility Across Model Scales cs.LG · 2026-04-22 · unverdicted · none · ref 5 · internal anchor
High-variance activation directions are uncorrelated with predictions, transformer blocks grow more linear with depth, and single-block linear replacement yields 34x compression on Mistral's final block at a 1.71 perplexity cost.
Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling cs.CL · 2025-12-01 · conditional · none · ref 36 · internal anchor
Four Over Six adaptively scales blocks in NVFP4 quantization to smaller FP4 values, making representable value distributions more uniform and reducing quantization error especially for near-maximal values.
SharQ: Bridging Activation Sparsity and FP4 Quantization for LLM Inference cs.LG · 2026-06-25 · unverdicted · none · ref 27 · internal anchor
SharQ combines input-adaptive N:M sparsity and FP4 quantization via sparse backbone plus dense residual, recovering 43-63% of the NVFP4-to-FP16 accuracy gap on Llama and Qwen models without calibration or retraining.
MonaVec: A Training-Free Embedded Vector Search Kernel for Edge and Offline AI Systems cs.IR · 2026-06-17 · unverdicted · none · ref 16 · internal anchor
MonaVec provides a training-free 4-bit vector quantization and deterministic search kernel using Randomized Hadamard Transform and ChaCha20 seeding for embedded and offline use.
dMX: Differentiable Mixed-Precision Assignment for Low-Precision Floating-Point Formats cs.LG · 2026-06-02 · unverdicted · none · ref 19 · internal anchor
dMX is a differentiable mixed-precision framework that learns per-layer MXFP bit-width assignments for LLMs and outperforms KL-based heuristics on perplexity and zero-shot accuracy under bit-width budgets.
LASER: Loss-Aware Singular-value Decomposition and Rank Allocation for Efficient Low-Precision Vision-Language Models cs.LG · 2026-05-30 · unverdicted · none · ref 33 · internal anchor
LASER introduces curvature-weighted SVD from second-order loss approximation and loss-aware rank allocation to compress VLMs, reporting over 2.3x decoding speedup under low-precision settings.
DREAM-S: Speculative Decoding with Searchable Drafting and Target-Aware Refinement for Multimodal Generation cs.LG · 2026-05-30 · unverdicted · none · ref 44 · internal anchor
DREAM-S combines neural architecture search, target-aware supernet training, and attention-entropy-guided distillation to accelerate speculative decoding in VLMs, reporting up to 3.85x speedup over standard methods.
ProactiveLLM: Learning Active Interaction for Streaming Large Language Models cs.CL · 2026-05-30 · unverdicted · none · ref 88 · internal anchor
ProactiveLLM enables active interaction in streaming LLMs by learning semantic sufficiency cues from partial inputs through mask-based modeling and synchronized privileged self-distillation without external supervision.
Quantized Reasoning Models Think They Need to Think Longer, but They Do Not cs.LG · 2026-05-29 · unverdicted · none · ref 35 · internal anchor
Post-training quantization increases overthinking errors in reasoning models; a logit penalty on curated overthinking markers reduces CoT length 12-23% without accuracy loss.
Breaking Modality Heterogeneity in Low-Bit Quantization for Large Vision-Language Models cs.CV · 2026-05-19 · unverdicted · none · ref 32 · internal anchor
SplitQ improves low-bit PTQ for VLMs by isolating modality-specific outlier channels via MOCD and applying dual-branch adaptive calibration via ACC, outperforming prior methods on six datasets across W4A8 to W3A2 settings.
OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization cs.LG · 2026-05-18 · unverdicted · none · ref 11 · internal anchor
OSCAR achieves near-BF16 accuracy for 2-bit KV cache quantization by using offline spectral covariance-aware rotations aligned with attention, plus a custom deployable INT2 kernel compatible with paged serving.
Not All Tasks Quantize Equally: Fisher-Guided Quantization for Visual Geometry Transformer cs.CV · 2026-05-15 · unverdicted · none · ref 5 · 2 links · internal anchor
FGQ applies diagonal Fisher information to guide learnable affine transformations in PTQ for multi-task VGGT, yielding up to 39% relative gains over baselines at 4-bit quantization.
Theory-optimal Quantization Based on Flatness cs.LG · 2026-05-11 · unverdicted · none · ref 10 · internal anchor
The paper introduces the Flatness metric, derives a theory-optimal quantization solution, and presents BDQ that uses bidirectional diagonal transformations to reduce outlier impact, achieving under 1% drop at W4A4 on LLaMA-3-8B.
OSAQ: Outlier Self-Absorption for Accurate Low-bit LLM Quantization cs.LG · 2026-05-06 · unverdicted · none · ref 12 · 2 links · internal anchor
OSAQ suppresses weight outliers in LLMs via a closed-form additive transformation from the Hessian's stable null space, improving 2-bit quantization perplexity by over 40% versus vanilla GPTQ with no inference overhead.
Statistically-Lossless Quantization of Large Language Models cs.LG · 2026-05-04 · unverdicted · none · ref 6 · internal anchor
SLQ achieves task-lossless LLM quantization below 4 bits per parameter and distribution-lossless at 5-6 bits on average, with 1.7-3.6x speedups over FP16.
Technical Report: Activation Residual Hessian Quantization (ARHQ) for Low-Bit LLM Quantization cs.LG · 2026-04-30 · unverdicted · none · ref 9 · internal anchor
ARHQ isolates error-sensitive weight directions in LLMs via truncated SVD on the scaled matrix W G_x^{1/2} from activation residuals, improving SNR and preserving performance under aggressive low-bit quantization.
CoQuant: Joint Weight-Activation Subspace Projection for Mixed-Precision LLMs cs.LG · 2026-04-29 · unverdicted · none · ref 7 · internal anchor
CoQuant selects optimal high-precision subspaces for mixed-precision LLM quantization via a closed-form weighted PCA that balances weight and activation covariances derived from expected output error.
QuantClaw: Precision Where It Matters for OpenClaw cs.AI · 2026-04-24 · unverdicted · none · ref 25 · internal anchor
QuantClaw dynamically routes precision in agent workflows to cut cost by up to 21.4% and latency by 15.7% while keeping or improving task performance.
MCAP: Deployment-Time Layer Profiling for Memory-Constrained LLM Inference cs.LG · 2026-04-22 · unverdicted · none · ref 7 · internal anchor
MCAP uses load-time Monte Carlo profiling to estimate layer importance, enabling dynamic quantization (W4A8 vs W4A16) and memory tiering (GPU/RAM/SSD) that delivers 1.5-1.8x higher decode throughput than llama-cpp Q4_0 on NVIDIA T4 while fitting models into previously infeasible memory budgets.
From Signal Degradation to Computation Collapse: Uncovering the Two Failure Modes of LLM Quantization cs.CL · 2026-04-21 · unverdicted · none · ref 29 · internal anchor
LLM 2-bit quantization fails via either cumulative signal degradation or early computation collapse in key components.
GSQ: Highly-Accurate Low-Precision Scalar Quantization for LLMs via Gumbel-Softmax Sampling cs.CL · 2026-04-20 · unverdicted · none · ref 23 · 2 links · internal anchor
GSQ uses Gumbel-Softmax to optimize scalar quantization grids for LLMs, closing most of the accuracy gap to vector methods like QTIP at 2-3 bits per parameter while using symmetric scalar grids compatible with existing kernels.
Rethinking Residual Errors in Compensation-based LLM Quantization cs.LG · 2026-04-09 · conditional · none · ref 11 · internal anchor
Redefining residual errors to include compensation-aware discrepancies and realigning calibration to full-precision outputs improves GPTQ and GPTAQ performance on LLMs.
AdaHOP: Fast and Accurate Low-Precision Training via Outlier-Pattern-Aware Rotation cs.LG · 2026-04-02 · unverdicted · none · ref 23 · internal anchor
AdaHOP applies pattern-aware Hadamard transforms and selective outlier extraction to enable from-scratch MXFP4 training of LLMs at BF16 quality with up to 3.6X memory compression and 1.46X speedup.
CoreQ: Learning-Free Mismatch Correction and Successive Rounding for Quantization cs.LG · 2026-02-05 · unverdicted · none · ref 13 · internal anchor
CoreQ delivers adaptive mismatch correction via closed-form geometric coefficient and successive rounding to improve PTQ accuracy for large language models.
OmniZip: Audio-Guided Dynamic Token Compression for Fast Omnimodal Large Language Models cs.CV · 2025-11-18 · conditional · none · ref 30 · internal anchor
OmniZip introduces an audio-guided dynamic token compression framework that achieves 3.42X inference speedup and 1.4X memory reduction for omnimodal LLMs without any training.
SpecQuant: Spectral Decomposition and Adaptive Truncation for Ultra-Low-Bit LLMs Quantization cs.LG · 2025-11-11 · unverdicted · none · ref 3 · internal anchor
SpecQuant uses outlier smoothing into weights followed by channel-wise low-frequency Fourier truncation to achieve 4-bit quantization of LLaMA-3 8B with only 1.5% zero-shot accuracy loss versus full precision.
BTC-LLM: Efficient Sub-1-Bit LLM Quantization via Learnable Transformation and Binary Codebook cs.LG · 2025-05-24 · conditional · none · ref 20 · internal anchor
BTC-LLM uses a binary codebook for pattern clustering and a learnable transformation to achieve 0.7-1.11 bit LLM quantization while limiting accuracy loss to a few percent on LLaMA and Qwen models.
SigmaScale: LLM Compression with SVD-based Low-Rank Decomposition and Learned Scaling Matrices cs.CL · 2026-06-05 · unverdicted · none · ref 33 · internal anchor
Learned diagonal scaling matrices optimized with activation-aware loss reduce effective rank in LLM weight matrices and yield competitive perplexity and zero-shot results versus prior SVD methods on Llama 3.1 8B and Qwen3-8B.
MorphoQuant: Modality-Aware Quantization for Omni-modal Large Language Models cs.CV · 2026-06-03 · unverdicted · none · ref 34 · internal anchor
MorphoQuant proposes DABC and MDQFO for 4-bit quantization of omni-modal LLMs, claiming superior performance over SOTA W4A4 methods and even W4A16 baselines on benchmarks like ScienceQA.
Does Compression Preserve Uncertainty? A Unified Benchmark for Quantized and Sparse LLMs via Conformal Prediction cs.AI · 2026-06-01 · unverdicted · none · ref 31 · internal anchor
Compression of LLMs often decouples accuracy from uncertainty, with larger models absorbing the effect better and inflation occurring in a threshold-like manner.
QVGGT: Post-Training Quantized Visual Geometry Grounded Transformer cs.CV · 2026-05-29 · unverdicted · none · ref 24 · internal anchor
QVGGT uses per-block mixed-precision analysis, outlier token filtering with PCA compensation, and task-aware scale search to achieve near-lossless W4A16 quantization of VGGT with 3-4.9x memory savings and 2.8x speedup.
MX-SAFE: Versatile Inference- and Training-Proof Microscaling Format with On-the-Fly Exponent and Mantissa Bit Allocation cs.AR · 2026-05-23 · unverdicted · none · ref 1 · internal anchor
MX-SAFE proposes a versatile MXFP format with on-the-fly bit allocation and tile-based design that reports small accuracy gains over prior MX formats and an accelerator using 24.9% less energy than BF16 while matching its accuracy.
MARR: Module-Adaptive Residual Reconstruction for Low-Bit Post-Training Quantization cs.LG · 2026-05-18 · unverdicted · none · ref 24 · internal anchor
MARR uses per-module adaptive residual scaling updated by PID feedback to balance error correction against Hessian-approximation bias in low-bit PTQ.
31.1 A 14.08-to-135.69Token/s ReRAM-on-Logic Stacked Outlier-Free Large-Language-Model Accelerator with Block-Clustered Weight-Compression and Adaptive Parallel-Speculative-Decoding cs.AR · 2026-05-10 · unverdicted · none · ref 12 · internal anchor
A ReRAM-on-logic stacked chip delivers 14.08-135.69 tokens/s LLM inference with block-clustered compression and adaptive parallel speculative decoding, yielding 4.46-7.17x speedup over standard methods.
Fitting Is Not Enough: Smoothness in Extremely Quantized LLMs cs.CL · 2026-05-09 · unverdicted · none · ref 25 · 2 links · internal anchor
Extremely quantized LLMs exhibit systematic smoothness degradation that reduces effective token candidates and degrades generation; a smoothness-preserving principle in PTQ and QAT delivers gains beyond numerical accuracy.
Revisiting the Effectiveness of LLM Pruning for Test-Time Scaling cs.AI · 2026-04-28 · unverdicted · none · ref 6 · 2 links · internal anchor
Unstructured pruning augments test-time scaling performance in reasoning LLMs on four benchmarks, outperforming structured pruning and at times the unpruned models.
ConFu: Contemplate the Future for Better Speculative Sampling cs.CL · 2026-03-09 · unverdicted · none · ref 13 · internal anchor
ConFu boosts speculative decoding acceptance rates 8-20% over EAGLE-3 by letting draft models use contemplate tokens and MoE to anticipate future generation direction.
High-Rate Quantized Matrix Multiplication I cs.IT · 2026-01-23 · unverdicted · none · ref 18 · internal anchor
High-rate quantization theory yields accurate approximations for the distortion of absmax INT and FP schemes in generic weight-plus-activation matrix multiplication.
MixLLM: LLM Quantization with Global Mixed-precision between Output-features and Highly-efficient System Design cs.LG · 2024-12-19 · unverdicted · none · ref 28 · internal anchor
MixLLM uses global output-feature importance to set mixed bit-widths for LLM quantization and adds two-step dequantization plus software pipelining for system efficiency.
GoQuant: Geometric Orthogonal Residual Projection for Multiplier-Free Power-of-Two Transformer Quantization cs.LG · 2026-05-25 · unverdicted · none · ref 12 · internal anchor
GoQuant formulates quantization as dual-basis geometric projection to create higher-resolution residual lattices for 3-bit PoT transformer quantization using only shift-and-add, reporting 6.10 perplexity on LLaMA-2-7B.
Influence-Inspired Spectral Rotations for Extreme Low-Bit LLM Quantization cs.LG · 2026-05-24 · unverdicted · none · ref 4 · internal anchor
A WHT rotation plus per-coordinate activation-energy rescaling before auto-round quantization lowers WikiText-2 perplexity 15-58% versus vanilla auto-round at W2A16 on models from 135M to 1.5B parameters.
On the Quantization Robustness of Diffusion Language Models in Coding Benchmarks cs.LG · 2026-04-22 · unverdicted · none · ref 19 · internal anchor
Diffusion coding model CoDA shows smaller accuracy drops than Qwen3-1.7B under 2-4 bit quantization on HumanEval and MBPP.
A Survey on Efficient Inference for Large Language Models cs.CL · 2024-04-22 · accept · none · ref 218 · internal anchor
The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.
You Had One Job: Per-Task Quantization Using LLMs' Hidden Representations cs.CL · 2025-11-09 · unreviewed · ref 41 · internal anchor

SpinQuant: LLM quantization with learned rotations

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer