super hub Mixed citations

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Dan Alistarh, Elias Frantar, Saleh Ashkboos, Torsten Hoefler · 2022 · cs.LG · arXiv 2210.17323

Mixed citation behavior. Most common role is background (68%).

228 Pith papers citing it

Background 68% of classified citations

open full Pith review browse 228 citing papers more from Dan Alistarh arXiv PDF

abstract

Generative Pre-trained Transformer models, known as GPT or OPT, set themselves apart through breakthrough performance across complex language modelling tasks, but also by their extremely high computational and storage costs. Specifically, due to their massive size, even inference for large, highly-accurate GPT models may require multiple performant GPUs, which limits the usability of such models. While there is emerging work on relieving this pressure via model compression, the applicability and performance of existing compression techniques is limited by the scale and complexity of GPT models. In this paper, we address this challenge, and propose GPTQ, a new one-shot weight quantization method based on approximate second-order information, that is both highly-accurate and highly-efficient. Specifically, GPTQ can quantize GPT models with 175 billion parameters in approximately four GPU hours, reducing the bitwidth down to 3 or 4 bits per weight, with negligible accuracy degradation relative to the uncompressed baseline. Our method more than doubles the compression gains relative to previously-proposed one-shot quantization methods, preserving accuracy, allowing us for the first time to execute an 175 billion-parameter model inside a single GPU for generative inference. Moreover, we also show that our method can still provide reasonable accuracy in the extreme quantization regime, in which weights are quantized to 2-bit or even ternary quantization levels. We show experimentally that these improvements can be leveraged for end-to-end inference speedups over FP16, of around 3.25x when using high-end GPUs (NVIDIA A100) and 4.5x when using more cost-effective ones (NVIDIA A6000). The implementation is available at https://github.com/IST-DASLab/gptq.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 20 baseline 3 method 3 dataset 1 other 1

citation-polarity summary

background 19 baseline 3 use method 3 unclear 2 use dataset 1

claims ledger

abstract Generative Pre-trained Transformer models, known as GPT or OPT, set themselves apart through breakthrough performance across complex language modelling tasks, but also by their extremely high computational and storage costs. Specifically, due to their massive size, even inference for large, highly-accurate GPT models may require multiple performant GPUs, which limits the usability of such models. While there is emerging work on relieving this pressure via model compression, the applicability and performance of existing compression techniques is limited by the scale and complexity of GPT models

authors

Dan Alistarh Elias Frantar Saleh Ashkboos Torsten Hoefler

co-cited works

representative citing papers

Do Activation Monitors Survive Model Updates? Benchmarking, Predicting, and Repairing Activation-Monitor Staleness

cs.LG · 2026-06-14 · unverdicted · novelty 8.0

Fine-tuning updates frequently stale activation monitors for language model safety while quantization does not, with degradation predictable and repairable via label-free realignment.

Alignment Collapse Under KV Cache Quantization: Diagnosis and Mitigation

cs.LG · 2026-06-01 · unverdicted · novelty 8.0

KV cache quantization silently erodes LLM safety alignment via vulnerable low-dimensional subspaces, diagnosed by Per-Channel Reduction into three failure modes and mitigated training-free with up to 97% recovery.

GPTQ-intrinsic LoRA: A Near-optimal Algorithm for Low-precision Quantization with Low-rank Adaptation

cs.LG · 2026-05-31 · unverdicted · novelty 8.0

GPTQ-intrinsic LoRA augments GPTQ with intrinsic low-rank compensation via Hessian modification to achieve layer-wise reconstruction bounds that match information-theoretic lower bounds under structural assumptions.

HodgeCover: Higher-Order Topological Coverage Drives Compression of Sparse Mixture-of-Experts

cs.LG · 2026-05-13 · unverdicted · novelty 8.0

HodgeCover isolates the harmonic kernel of a simplicial Laplacian on an expert 2-complex to identify irreducible merge cycles and selects experts for aggressive compression, matching or exceeding baselines on open-weight MoE models.

OrbitQuant: Data-Agnostic Quantization for Image and Video Diffusion Transformers

cs.CV · 2026-07-02 · unverdicted · novelty 7.0

OrbitQuant is a data-agnostic PTQ technique for DiTs that uses RPBH rotation in a normalized basis to enable a single codebook across all inputs, achieving SOTA low-bit performance on FLUX.1, CogVideoX and similar models.

Lynx: Progressive Speculative Quantization for accelerating KV Transfer in Long-Context Inference

cs.DC · 2026-07-02 · unverdicted · novelty 7.0

Lynx partitions KV cache bits into anchor and residual streams for progressive transfer, enabling speculative decoding on partial data followed by verification to match BF16 accuracy at 4-bit-like TTFT.

Neural Network Quantization by Learning Low-Loss Subspaces

cs.CV · 2026-06-23 · unverdicted · novelty 7.0

Learning quantization-aware linear paths in weight space yields a midpoint whose direct quantization matches quantization-aware training performance without using straight-through estimators.

Mix-QVLA: Task-Evidence-Aware Mixed-Precision Quantization of Vision-Language-Action Models

cs.CV · 2026-06-17 · unverdicted · novelty 7.0

Mix-QVLA is a task-evidence-aware mixed-precision PTQ framework for VLA models that preserves task-relevant evidence via evidence-mass and attribution-distribution metrics to guide bit allocation under memory and BitOps constraints.

APEX4: Efficient Pure W4A4 LLM Inference via Intra-SM Compute Rebalancing

cs.DC · 2026-06-07 · conditional · novelty 7.0 · 2 refs

APEX4 co-designs pure INT4 GEMM kernels with ρ-aware granularity adaptation to deliver up to 2.09× end-to-end speedup on GPUs with low ρ while keeping LLaMA-2-70B perplexity within 0.63 of FP16.

Calibration Data Trade-offs Across Capability Dimensions: Why Multi-Source Mixing Matters for High-Sparsity LLM Pruning

cs.LG · 2026-06-02 · unverdicted · novelty 7.0

Analysis of 15 calibration sources shows opposite-sign Spearman correlations between perplexity and retention across General vs. Math/Code dimensions in LLM pruning, and multi-source mixing via IGSP raises total retention from 40-50% to 58.8%.

From Layers to Submodules: Rethinking Granularity in Replacement-Based LLM Compression

cs.CL · 2026-06-01 · unverdicted · novelty 7.0

SubFit enables better LLM compression by fitting residual bypasses to non-contiguously selected submodules, outperforming layer-granularity baselines in accuracy-perplexity trade-offs at 12.5-37.5% sparsity.

Not All Errors Are Equal: A Systematic Study of Error Propagation in Large Language Model Inference

cs.DC · 2026-06-01 · unverdicted · novelty 7.0

A new fault-injection framework enables a systematic empirical study that produces 17 takeaways on error propagation in LLM inference and four software-only mitigation directions.

{\Omega}-QVLA: Robust Quantization for Vision-Language-Action Models via Composite Rotation and Per-step Scaling

cs.CV · 2026-05-27 · unverdicted · novelty 7.0

Omega-QVLA is a post-training quantization framework achieving uniform W4A4 for VLA models' LLM backbone and DiT action head via composite SVD-Hadamard rotation and per-step scaling, matching FP16 success rates on LIBERO.

AIGaitor: Privacy-preserving and cloud-free motion analysis for everyone, using edge computing

cs.CV · 2026-05-20 · unverdicted · novelty 7.0 · 2 refs

AIGaitor is the first claimed end-to-end on-device monocular motion-capture and deep-learning gait analysis pipeline demonstrated on consumer smartphones.

Llamas on the Web: Memory-Efficient, Performance-Portable, and Multi-Precision LLM Inference with WebGPU

cs.DC · 2026-05-20 · conditional · novelty 7.0

LlamaWeb is a WebGPU backend for llama.cpp that uses static memory planning, tunable kernels, and templated multi-precision support to cut memory use by 29-33% and raise decode throughput by 45-69% versus prior browser frameworks on tested hardware.

LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation

cs.CV · 2026-05-18 · unverdicted · novelty 7.0

LongLive-2.0 delivers an NVFP4 parallel infrastructure that enables direct training of long multi-shot autoregressive diffusion video models and achieves up to 2.15x training and 1.84x inference speedups on Blackwell and other GPUs.

When Bits Break Recourse: Counterfactual-Faithful Quantization

cs.LG · 2026-05-16 · unverdicted · novelty 7.0

CFQ trains quantizer parameters and mixed-precision allocation to preserve counterfactual recourse validity, cost, and direction on Adult, German Credit, and COMPAS while matching accuracy of standard quantizers.

Widening the Gap: Exploiting LLM Quantization via Outlier Injection

cs.LG · 2026-05-14 · conditional · novelty 7.0

The paper introduces an outlier-injection attack that induces targeted weight collapse in LLMs under advanced quantization schemes including AWQ, GPTQ, and GGUF I-quants.

Knowledge Beyond Language: Bridging the Gap in Multilingual Machine Unlearning Evaluation

cs.CL · 2026-05-14 · unverdicted · novelty 7.0

New metrics KSS and KPS are introduced to evaluate multilingual machine unlearning quality and cross-language consistency in LLMs, addressing limitations of single-language evaluation protocols.

Multi-Scale Dequant: Eliminating Dequantization Bottleneck via Activation Decomposition for Efficient LLM Inference

stat.ML · 2026-05-13 · unverdicted · novelty 7.0

MSD eliminates dequantization from the GEMM path by decomposing BF16 activations into multiple low-precision parts that multiply directly with INT8 or MXFP4 weights, achieving near-16 effective bits for INT8 and 6.6 for MXFP4 with reduced HBM traffic.

LoopQ: Quantization for Recursive Transformers

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

LoopQ provides a loop-aware PTQ framework for recursive Transformers that mitigates distribution shift, state reuse, and recursive error accumulation, yielding 68.8% higher average accuracy and 87.7% lower perplexity under W4A4 versus static baselines.

Layer Collapse in Diffusion Language Models

cs.LG · 2026-05-07 · unverdicted · novelty 7.0 · 2 refs

Diffusion language models develop early-layer collapse around an indispensable super-outlier due to overtraining, resulting in higher compressibility and reversed optimal sparsity patterns versus autoregressive models.

When Quantization Is Free: An int4 KV Cache That Outruns fp16 on Apple Silicon

cs.PF · 2026-05-07 · unverdicted · novelty 7.0

A single fused int4 KV cache kernel on Apple Silicon outperforms fp16 in latency with 3x memory compression and near-zero quality loss on tested models.

Quantamination: Dynamic Quantization Leaks Your Data Across the Batch

cs.CR · 2026-04-29 · conditional · novelty 7.0

Dynamic quantization creates side channels allowing partial or full recovery of other users' batched data in at least four popular ML frameworks.

citing papers explorer

Showing 50 of 228 citing papers.

Do Activation Monitors Survive Model Updates? Benchmarking, Predicting, and Repairing Activation-Monitor Staleness cs.LG · 2026-06-14 · unverdicted · none · ref 84 · internal anchor
Fine-tuning updates frequently stale activation monitors for language model safety while quantization does not, with degradation predictable and repairable via label-free realignment.
Alignment Collapse Under KV Cache Quantization: Diagnosis and Mitigation cs.LG · 2026-06-01 · unverdicted · none · ref 12 · internal anchor
KV cache quantization silently erodes LLM safety alignment via vulnerable low-dimensional subspaces, diagnosed by Per-Channel Reduction into three failure modes and mitigated training-free with up to 97% recovery.
GPTQ-intrinsic LoRA: A Near-optimal Algorithm for Low-precision Quantization with Low-rank Adaptation cs.LG · 2026-05-31 · unverdicted · none · ref 23 · internal anchor
GPTQ-intrinsic LoRA augments GPTQ with intrinsic low-rank compensation via Hessian modification to achieve layer-wise reconstruction bounds that match information-theoretic lower bounds under structural assumptions.
HodgeCover: Higher-Order Topological Coverage Drives Compression of Sparse Mixture-of-Experts cs.LG · 2026-05-13 · unverdicted · none · ref 16 · internal anchor
HodgeCover isolates the harmonic kernel of a simplicial Laplacian on an expert 2-complex to identify irreducible merge cycles and selects experts for aggressive compression, matching or exceeding baselines on open-weight MoE models.
OrbitQuant: Data-Agnostic Quantization for Image and Video Diffusion Transformers cs.CV · 2026-07-02 · unverdicted · none · ref 12 · internal anchor
OrbitQuant is a data-agnostic PTQ technique for DiTs that uses RPBH rotation in a normalized basis to enable a single codebook across all inputs, achieving SOTA low-bit performance on FLUX.1, CogVideoX and similar models.
Lynx: Progressive Speculative Quantization for accelerating KV Transfer in Long-Context Inference cs.DC · 2026-07-02 · unverdicted · none · ref 13 · internal anchor
Lynx partitions KV cache bits into anchor and residual streams for progressive transfer, enabling speculative decoding on partial data followed by verification to match BF16 accuracy at 4-bit-like TTFT.
Neural Network Quantization by Learning Low-Loss Subspaces cs.CV · 2026-06-23 · unverdicted · none · ref 14 · internal anchor
Learning quantization-aware linear paths in weight space yields a midpoint whose direct quantization matches quantization-aware training performance without using straight-through estimators.
Mix-QVLA: Task-Evidence-Aware Mixed-Precision Quantization of Vision-Language-Action Models cs.CV · 2026-06-17 · unverdicted · none · ref 9 · internal anchor
Mix-QVLA is a task-evidence-aware mixed-precision PTQ framework for VLA models that preserves task-relevant evidence via evidence-mass and attribution-distribution metrics to guide bit allocation under memory and BitOps constraints.
APEX4: Efficient Pure W4A4 LLM Inference via Intra-SM Compute Rebalancing cs.DC · 2026-06-07 · conditional · none · ref 13 · 2 links · internal anchor
APEX4 co-designs pure INT4 GEMM kernels with ρ-aware granularity adaptation to deliver up to 2.09× end-to-end speedup on GPUs with low ρ while keeping LLaMA-2-70B perplexity within 0.63 of FP16.
Calibration Data Trade-offs Across Capability Dimensions: Why Multi-Source Mixing Matters for High-Sparsity LLM Pruning cs.LG · 2026-06-02 · unverdicted · none · ref 9 · internal anchor
Analysis of 15 calibration sources shows opposite-sign Spearman correlations between perplexity and retention across General vs. Math/Code dimensions in LLM pruning, and multi-source mixing via IGSP raises total retention from 40-50% to 58.8%.
From Layers to Submodules: Rethinking Granularity in Replacement-Based LLM Compression cs.CL · 2026-06-01 · unverdicted · none · ref 82 · internal anchor
SubFit enables better LLM compression by fitting residual bypasses to non-contiguously selected submodules, outperforming layer-granularity baselines in accuracy-perplexity trade-offs at 12.5-37.5% sparsity.
Not All Errors Are Equal: A Systematic Study of Error Propagation in Large Language Model Inference cs.DC · 2026-06-01 · unverdicted · none · ref 26 · internal anchor
A new fault-injection framework enables a systematic empirical study that produces 17 takeaways on error propagation in LLM inference and four software-only mitigation directions.
{\Omega}-QVLA: Robust Quantization for Vision-Language-Action Models via Composite Rotation and Per-step Scaling cs.CV · 2026-05-27 · unverdicted · none · ref 3 · internal anchor
Omega-QVLA is a post-training quantization framework achieving uniform W4A4 for VLA models' LLM backbone and DiT action head via composite SVD-Hadamard rotation and per-step scaling, matching FP16 success rates on LIBERO.
AIGaitor: Privacy-preserving and cloud-free motion analysis for everyone, using edge computing cs.CV · 2026-05-20 · unverdicted · none · ref 83 · 2 links · internal anchor
AIGaitor is the first claimed end-to-end on-device monocular motion-capture and deep-learning gait analysis pipeline demonstrated on consumer smartphones.
Llamas on the Web: Memory-Efficient, Performance-Portable, and Multi-Precision LLM Inference with WebGPU cs.DC · 2026-05-20 · conditional · none · ref 14 · internal anchor
LlamaWeb is a WebGPU backend for llama.cpp that uses static memory planning, tunable kernels, and templated multi-precision support to cut memory use by 29-33% and raise decode throughput by 45-69% versus prior browser frameworks on tested hardware.
LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation cs.CV · 2026-05-18 · unverdicted · none · ref 20 · internal anchor
LongLive-2.0 delivers an NVFP4 parallel infrastructure that enables direct training of long multi-shot autoregressive diffusion video models and achieves up to 2.15x training and 1.84x inference speedups on Blackwell and other GPUs.
When Bits Break Recourse: Counterfactual-Faithful Quantization cs.LG · 2026-05-16 · unverdicted · none · ref 10 · internal anchor
CFQ trains quantizer parameters and mixed-precision allocation to preserve counterfactual recourse validity, cost, and direction on Adult, German Credit, and COMPAS while matching accuracy of standard quantizers.
Widening the Gap: Exploiting LLM Quantization via Outlier Injection cs.LG · 2026-05-14 · conditional · none · ref 10 · internal anchor
The paper introduces an outlier-injection attack that induces targeted weight collapse in LLMs under advanced quantization schemes including AWQ, GPTQ, and GGUF I-quants.
Knowledge Beyond Language: Bridging the Gap in Multilingual Machine Unlearning Evaluation cs.CL · 2026-05-14 · unverdicted · none · ref 34 · internal anchor
New metrics KSS and KPS are introduced to evaluate multilingual machine unlearning quality and cross-language consistency in LLMs, addressing limitations of single-language evaluation protocols.
Multi-Scale Dequant: Eliminating Dequantization Bottleneck via Activation Decomposition for Efficient LLM Inference stat.ML · 2026-05-13 · unverdicted · none · ref 3 · internal anchor
MSD eliminates dequantization from the GEMM path by decomposing BF16 activations into multiple low-precision parts that multiply directly with INT8 or MXFP4 weights, achieving near-16 effective bits for INT8 and 6.6 for MXFP4 with reduced HBM traffic.
LoopQ: Quantization for Recursive Transformers cs.LG · 2026-05-08 · unverdicted · none · ref 9 · internal anchor
LoopQ provides a loop-aware PTQ framework for recursive Transformers that mitigates distribution shift, state reuse, and recursive error accumulation, yielding 68.8% higher average accuracy and 87.7% lower perplexity under W4A4 versus static baselines.
Layer Collapse in Diffusion Language Models cs.LG · 2026-05-07 · unverdicted · none · ref 7 · 2 links · internal anchor
Diffusion language models develop early-layer collapse around an indispensable super-outlier due to overtraining, resulting in higher compressibility and reversed optimal sparsity patterns versus autoregressive models.
When Quantization Is Free: An int4 KV Cache That Outruns fp16 on Apple Silicon cs.PF · 2026-05-07 · unverdicted · none · ref 15 · internal anchor
A single fused int4 KV cache kernel on Apple Silicon outperforms fp16 in latency with 3x memory compression and near-zero quality loss on tested models.
Quantamination: Dynamic Quantization Leaks Your Data Across the Batch cs.CR · 2026-04-29 · conditional · none · ref 9 · internal anchor
Dynamic quantization creates side channels allowing partial or full recovery of other users' batched data in at least four popular ML frameworks.
Variance Is Not Importance: Structural Analysis of Transformer Compressibility Across Model Scales cs.LG · 2026-04-22 · unverdicted · none · ref 1 · internal anchor
High-variance activation directions are uncorrelated with predictions, transformer blocks grow more linear with depth, and single-block linear replacement yields 34x compression on Mistral's final block at a 1.71 perplexity cost.
When W4A4 Breaks Camouflaged Object Detection: Token-Group Dual-Constraint Activation Quantization cs.CV · 2026-04-18 · unverdicted · none · ref 8 · internal anchor
COD-TDQ uses token-group scaling and dual-constraint projection to fix 4-bit activation quantization for camouflaged object detection, delivering more than 0.12 higher Sα scores than prior methods on four benchmarks without retraining.
When Flat Minima Fail: Characterizing INT4 Quantization Collapse After FP32 Convergence cs.LG · 2026-04-16 · conditional · none · ref 5 · internal anchor
FP32-converged language models enter a post-convergence phase where INT4 quantization error explodes while FP32 perplexity remains stable, with onset tied to fine convergence rather than learning rate decay.
Hidden Reliability Risks in Large Language Models: Systematic Identification of Precision-Induced Output Disagreements cs.AI · 2026-04-02 · unverdicted · none · ref 14 · internal anchor
PrecisionDiff is a differential testing framework that uncovers widespread precision-induced behavioral disagreements in aligned LLMs, including safety-critical jailbreak divergences across precision formats.
ENEC: A Lossless AI Model Compression Method Enabling Fast Inference on Ascend NPUs cs.AR · 2026-03-28 · unverdicted · none · ref 20 · internal anchor
ENEC delivers 3.43X higher throughput than DietGPU and 1.12X better compression ratio than nvCOMP for lossless model weight compression on Ascend NPUs, yielding up to 6.3X end-to-end inference speedup.
Why Smaller Is Slower? Dimensional Misalignment in Compressed LLMs cs.DC · 2026-03-05 · unverdicted · none · ref 6 · internal anchor
Dimensional misalignment slows compressed LLMs on GPUs; GAC uses knapsack optimization to achieve full alignment and up to 1.5x speedup on Llama-3-8B while preserving quality.
QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models cs.LG · 2026-02-23 · unverdicted · none · ref 10 · internal anchor
QuantVLA is the first post-training quantization framework for VLA models that quantizes the diffusion transformer action head and reports higher task success rates than full-precision baselines with roughly 70% memory savings on the quantized components.
CORP: Closed-Form One-shot Representation-Preserving Structured Pruning for Transformers cs.LG · 2026-02-05 · unverdicted · none · ref 4 · internal anchor
CORP performs one-shot structured pruning of Transformers by modeling removed components as affine functions of retained ones and solving closed-form ridge regressions on calibration data to fold compensation into weights, retaining 83.27% Top-1 accuracy on DeiT-Huge after 50% pruning.
Vec-LUT: Vector Table Lookup for Parallel Ultra-Low-Bit LLM Inference on Edge Devices cs.DC · 2025-12-06 · conditional · none · ref 15 · internal anchor
Vec-LUT delivers up to 4.2x speedup over prior LUT methods for parallel ultra-low-bit LLM inference on edge devices by unifying lookups across tokens and adding cache-aware tensor layouts.
Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling cs.CL · 2025-12-01 · conditional · none · ref 13 · internal anchor
Four Over Six adaptively scales blocks in NVFP4 quantization to smaller FP4 values, making representable value distributions more uniform and reducing quantization error especially for near-maximal values.
Decomposed Trust: Privacy, Adversarial Robustness, Ethics, and Fairness in Low-Rank LLMs cs.LG · 2025-11-27 · unverdicted · none · ref 3 · internal anchor
Low-rank compression preserves training-data privacy and improves adversarial robustness but weakens personal-information protection, reduces ethical behavior in zero-shot use, and harms fairness.
SpinQuant: LLM quantization with learned rotations cs.LG · 2024-05-26 · conditional · none · ref 5 · internal anchor
SpinQuant learns optimal rotations to enable accurate 4-bit quantization of LLM weights, activations, and KV cache, reducing the zero-shot gap to full precision to 2.9 points on LLaMA-2 7B.
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads cs.LG · 2024-01-19 · conditional · none · ref 73 · internal anchor
Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.
QLoRA: Efficient Finetuning of Quantized LLMs cs.LG · 2023-05-23 · conditional · none · ref 18 · internal anchor
QLoRA finetunes 4-bit quantized LLMs via LoRA adapters to match full-precision performance while using far less memory, enabling 65B-scale training on single GPUs and producing Guanaco models near ChatGPT level.
SAB-LVLM: Significance-Aware Binarization for Large Vision-Language Models cs.CV · 2026-07-02 · unverdicted · none · ref 4 · internal anchor
SAB-LVLM proposes a significance-aware binarization technique for LVLMs that uses modality-guided Hessian-based maps to reweight binarization errors and improve performance under 1-bit constraints.
MxGLUT: A Reconfigurable LUT-Centric Broadcast Dataflow Accelerator for Mixed-Precision GEMM cs.AR · 2026-07-02 · unverdicted · none · ref 1 · internal anchor
MxGLUT introduces a reconfigurable LUT-centric broadcast dataflow accelerator with mixed-precision LUT-based PEs that unifies FP8-INT4 and FP8-FP8 GEMM without separate FP datapaths, reporting up to 2.16x prefill speedup and 0.492 TFLOPS/mm² area efficiency in 28nm synthesis.
OmniPilot: An Uncertainty-Aware LLM Inference Advisor for Heterogeneous GPU Clusters cs.DC · 2026-07-02 · unverdicted · none · ref 7 · internal anchor
OmniPilot combines conformal quantile regression with OOD detection to rank LLM serving configurations on mixed GPUs, reporting 6.2% MAPE throughput prediction and 95% top-1 accuracy on 460 benchmark runs while abstaining on unsupported cases.
When AI Reviews Its Own Code: Recursive Self-Training Collapse in Code LLMs cs.SE · 2026-06-26 · unverdicted · none · ref 99 · internal anchor
Experiments across code LLMs show no-review collapses fastest, human-gated filters slow collapse, and AI self-gates lose effect over time, degenerating to ungated self-training under self-confirming acceptance as proven via gated distributional reweighting and spectral analysis.
SharQ: Bridging Activation Sparsity and FP4 Quantization for LLM Inference cs.LG · 2026-06-25 · unverdicted · none · ref 12 · internal anchor
SharQ combines input-adaptive N:M sparsity and FP4 quantization via sparse backbone plus dense residual, recovering 43-63% of the NVFP4-to-FP16 accuracy gap on Llama and Qwen models without calibration or retraining.
BitNet Text Embeddings cs.CL · 2026-06-24 · unverdicted · none · ref 19 · internal anchor
BITEMBED converts LLM backbones to ternary BitNet-style encoders, adapts them with contrastive pre-training and teacher distillation, and produces text embeddings at multiple precisions that perform comparably to full-precision baselines on MMTEB.
GRINQH: Graded Input-based Quantization Hierarchy for Efficient LLM Generation cs.LG · 2026-06-22 · unverdicted · none · ref 14 · internal anchor
GRINQH introduces a graded input-based quantization hierarchy that dynamically assigns multi-precision weights using activation magnitudes as importance proxy, unifying quantization with sparsification to improve LLM decoding speed and quality trade-offs on Llama3 and Qwen3 models.
HyperQuant: A Rate-Distortion-Optimal Quantization Pipeline for Large Language and Diffusion Models cs.LG · 2026-06-22 · unverdicted · none · ref 14 · internal anchor
HyperQuant unifies Hadamard transform, optimal lattice quantization, and entropy coding to outperform prior schemes on LLM weight and KV cache quantization down to 1.7 bits per scalar while preserving quality on a 19B DiT model.
TWLA: Achieving Ternary Weights and Low-Bit Activations for LLMs via Post-Training Quantization cs.LG · 2026-06-11 · unverdicted · none · ref 8 · internal anchor
TWLA is a PTQ method using E2M-ATQ, KOTMS, and ILA-AMP to enable W1.58A4 quantization for LLMs with maintained accuracy.
DynamicPTQ: Mitigating Activation Quantization Collapse via Residual-Stream Dynamics cs.LG · 2026-06-10 · unverdicted · none · ref 7 · internal anchor
DynamicPTQ uses new metrics of residual-stream dynamics to apply 8-bit activation precision only to quantization-sensitive layers in W4A4KV4 LLM inference, improving perplexity and QA performance over static smoothing baselines.
LC-QAT: Data-Efficient 2-Bit QAT for LLMs via Linear-Constrained Vector Quantization cs.CL · 2026-06-09 · unverdicted · none · ref 6 · 2 links · internal anchor
LC-QAT achieves data-efficient 2-bit weight-only QAT for LLMs by representing quantized weights as a learned affine transform over discrete vectors, supporting end-to-end optimization from a high-quality PTQ start.
Quality Is Not a Safety Proxy Under Quantization cs.LG · 2026-06-08 · conditional · none · ref 14 · internal anchor
Across 51 quantized checkpoints, quality metrics fail to predict safety drops in 36 pairings and 10 hidden-danger cases, while a new RTSI screen routes all 10 dangerous rows to testing at matched bucket size.

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer