pith. machine review for the scientific record. sign in

arxiv: 2306.00978 · v6 · submitted 2023-06-01 · 💻 cs.CL

Recognition: unknown

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Authors on Pith no claims yet
classification 💻 cs.CL
keywords quantizationweightsalientchannelsreduceweightsactivationactivation-aware
0
0 comments X
read the original abstract

Large language models (LLMs) have transformed numerous AI applications. On-device LLM is becoming increasingly important: running LLMs locally on edge devices can reduce the cloud computing cost and protect users' privacy. However, the astronomical model size and the limited hardware resource pose significant deployment challenges. We propose Activation-aware Weight Quantization (AWQ), a hardware-friendly approach for LLM low-bit weight-only quantization. AWQ finds that not all weights in an LLM are equally important. Protecting only 1% salient weights can greatly reduce quantization error. To identify salient weight channels, we should refer to the activation distribution, not weights. To avoid the hardware-inefficient mix-precision quantization, we mathematically derive that scaling up the salient channels can reduce the quantization error. AWQ employs an equivalent transformation to scale the salient weight channels to protect them. The scale is determined by collecting the activation statistics offline. AWQ does not rely on any backpropagation or reconstruction, so it generalizes to different domains and modalities without overfitting the calibration set. AWQ outperforms existing work on various language modeling and domain-specific benchmarks (coding and math). Thanks to better generalization, it achieves excellent quantization performance for instruction-tuned LMs and, for the first time, multi-modal LMs. Alongside AWQ, we implement TinyChat, an efficient and flexible inference framework tailored for 4-bit on-device LLM/VLMs. With kernel fusion and platform-aware weight packing, TinyChat offers more than 3x speedup over the Huggingface FP16 implementation on both desktop and mobile GPUs. It also democratizes the deployment of the 70B Llama-2 model on mobile GPUs.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 28 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Multi-Scale Dequant: Eliminating Dequantization Bottleneck via Activation Decomposition for Efficient LLM Inference

    stat.ML 2026-05 unverdicted novelty 7.0

    MSD eliminates dequantization from the GEMM path by decomposing BF16 activations into multiple low-precision parts that multiply directly with INT8 or MXFP4 weights, achieving near-16 effective bits for INT8 and 6.6 f...

  2. When Quantization Is Free: An int4 KV Cache That Outruns fp16 on Apple Silicon

    cs.PF 2026-05 unverdicted novelty 7.0

    A single fused int4 KV cache kernel on Apple Silicon outperforms fp16 in latency with 3x memory compression and near-zero quality loss on tested models.

  3. Variance Is Not Importance: Structural Analysis of Transformer Compressibility Across Model Scales

    cs.LG 2026-04 unverdicted novelty 7.0

    High-variance activation directions are uncorrelated with predictions, transformer blocks grow more linear with depth, and single-block linear replacement yields 34x compression on Mistral's final block at a 1.71 perp...

  4. Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding

    cs.CV 2026-04 unverdicted novelty 7.0

    Visual token pruning in MLLMs fails on complex reasoning due to Relevant Visual Information Shift during decoding, but the DSTP framework fixes it training-free across models.

  5. Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

    cs.LG 2024-01 conditional novelty 7.0

    Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.

  6. Code-Centric Detection of Vulnerability-Fixing Commits: A Unified Benchmark and Empirical Study

    cs.SE 2026-05 accept novelty 6.0

    Code language models show no transferable security understanding from code diffs alone, rely on commit messages, miss over 93% of fixes at 0.5% false positive rate, and suffer large drops under group or temporal splits.

  7. OSAQ: Outlier Self-Absorption for Accurate Low-bit LLM Quantization

    cs.LG 2026-05 unverdicted novelty 6.0

    OSAQ uses the low-rank structure of the Hessian to construct a closed-form additive weight transformation that suppresses outliers without changing task loss, enabling better low-bit LLM quantization.

  8. OSAQ: Outlier Self-Absorption for Accurate Low-bit LLM Quantization

    cs.LG 2026-05 unverdicted novelty 6.0

    OSAQ suppresses weight outliers in LLMs via a closed-form additive transformation from the Hessian's stable null space, improving 2-bit quantization perplexity by over 40% versus vanilla GPTQ with no inference overhead.

  9. Technical Report: Activation Residual Hessian Quantization (ARHQ) for Low-Bit LLM Quantization

    cs.LG 2026-04 unverdicted novelty 6.0

    ARHQ isolates error-sensitive weight directions in LLMs via truncated SVD on the scaled matrix W G_x^{1/2} from activation residuals, improving SNR and preserving performance under aggressive low-bit quantization.

  10. BitRL: Reinforcement Learning with 1-bit Quantized Language Models for Resource-Constrained Edge Deployment

    cs.LG 2026-04 unverdicted novelty 6.0

    BitRL enables on-device RL agents via 1-bit quantized language models, delivering 10-16x memory reduction and 3-5x energy efficiency gains with 85-98% retained performance.

  11. MCAP: Deployment-Time Layer Profiling for Memory-Constrained LLM Inference

    cs.LG 2026-04 unverdicted novelty 6.0

    MCAP uses load-time Monte Carlo profiling to estimate layer importance, enabling dynamic quantization (W4A8 vs W4A16) and memory tiering (GPU/RAM/SSD) that delivers 1.5-1.8x higher decode throughput than llama-cpp Q4_...

  12. Parcae: Scaling Laws For Stable Looped Language Models

    cs.LG 2026-04 unverdicted novelty 6.0

    Parcae stabilizes looped LLMs via spectral norm constraints on injection parameters, enabling power-law scaling for training FLOPs and saturating exponential scaling at test time that improves quality over fixed-depth...

  13. Quantization Dominates Rank Reduction for KV-Cache Compression

    cs.LG 2026-04 conditional novelty 6.0

    Quantization of the KV cache beats rank reduction for matched storage budgets by 4-364 PPL, because dimension removal can flip attention token selection under softmax while bounded quantization noise usually preserves...

  14. Rethinking Residual Errors in Compensation-based LLM Quantization

    cs.LG 2026-04 conditional novelty 6.0

    Redefining residual errors to include compensation-aware discrepancies and realigning calibration to full-precision outputs improves GPTQ and GPTAQ performance on LLMs.

  15. FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling

    cs.LG 2026-04 unverdicted novelty 6.0

    Sol-RL decouples FP4-based candidate exploration from BF16 policy optimization in diffusion RL, delivering up to 4.64x faster convergence with maintained or superior alignment performance on models like FLUX.1 and SD3.5.

  16. RUQuant: Towards Refining Uniform Quantization for Large Language Models

    cs.CL 2026-04 unverdicted novelty 6.0

    RUQuant uses block-wise composite orthogonal matrices from Householder reflections and Givens rotations plus a fine-tuned global reflection to achieve 99.8% full-precision accuracy at W6A6 and 97% at W4A4 for 13B LLMs...

  17. Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

    eess.AS 2024-06 unverdicted novelty 6.0

    Seed-TTS models produce speech matching human naturalness and speaker similarity, with added controllability via self-distillation and reinforcement learning.

  18. KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

    cs.CL 2024-02 conditional novelty 6.0

    KIVI applies asymmetric 2-bit quantization to KV cache with per-channel keys and per-token values, reducing memory 2.6x and boosting throughput up to 3.47x with near-identical quality on Llama, Falcon, and Mistral.

  19. SGLang: Efficient Execution of Structured Language Model Programs

    cs.AI 2023-12 conditional novelty 6.0

    SGLang is a new system that speeds up structured LLM programs by up to 6.4x using RadixAttention for KV cache reuse and compressed finite state machines for output decoding.

  20. Position: LLM Inference Should Be Evaluated as Energy-to-Token Production

    cs.CE 2026-05 unverdicted novelty 5.0

    LLM inference should be reframed and evaluated as energy-to-token production with a Token Production Function that accounts for power, cooling, and efficiency ceilings.

  21. RadLite: Multi-Task LoRA Fine-Tuning of Small Language Models for CPU-Deployable Radiology AI

    cs.CL 2026-05 unverdicted novelty 5.0

    LoRA fine-tuning of 3-4B SLMs on 162K multi-task radiology data yields strong performance deployable on consumer CPUs at 4-8 tokens/second.

  22. Three Roles, One Model: Role Orchestration at Inference Time to Close the Performance Gap Between Small and Large Agents

    cs.AI 2026-04 unverdicted novelty 5.0

    Orchestrating one 8B model in three roles at inference time doubles task completion on AppWorld from 5.4% to 8.9%, surpassing a 33B baseline.

  23. Fast NF4 Dequantization Kernels for Large Language Model Inference

    cs.LG 2026-04 unverdicted novelty 5.0

    A lightweight shared-memory technique for NF4 dequantization kernels yields 2.0-2.2x kernel speedup and 1.54x end-to-end gains on models up to 70B parameters while using only 64 bytes of shared memory per block.

  24. DuQuant++: Fine-grained Rotation Enhances Microscaling FP4 Quantization

    cs.CV 2026-04 unverdicted novelty 4.0

    DuQuant++ adapts outlier-aware fine-grained rotation to MXFP4 by matching block size to the 32-element microscaling group, enabling a single rotation that smooths distributions and achieves SOTA performance on LLaMA-3...

  25. Pushing the Limits of On-Device Streaming ASR: A Compact, High-Accuracy English Model for Low-Latency Inference

    cs.AI 2026-04 unverdicted novelty 4.0

    A quantized int4 version of Nemotron ASR runs faster than real-time on CPU at 8.20% WER and 0.67 GB size, setting a new efficiency point for on-device streaming speech recognition.

  26. Secure On-Premise Deployment of Open-Weights Large Language Models in Radiology: An Isolation-First Architecture with Prospective Pilot Evaluation

    cs.CY 2026-03 conditional novelty 4.0

    An isolation-first on-premise architecture for open-weights LLMs in radiology achieved regulatory approval for processing PHI and showed good utility for text-anchored tasks in a one-week pilot with 22 users.

  27. Qwen Goes Brrr: Off-the-Shelf RAG for Ukrainian Multi-Domain Document Understanding

    cs.CL 2026-05 unverdicted novelty 3.0

    A RAG pipeline with contextual PDF chunking, question-and-answer-aware retrieval and reranking using Qwen3 models reaches 0.96 accuracy on a Ukrainian multi-domain document QA shared task.

  28. A Survey on Efficient Inference for Large Language Models

    cs.CL 2024-04 accept novelty 3.0

    The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.