TWLA is a PTQ method using E2M-ATQ, KOTMS, and ILA-AMP to enable W1.58A4 quantization for LLMs with maintained accuracy.
Pb-llm: Partially binarized large language models
8 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
SAFE-SVD introduces a sensitivity-aware fidelity-enforcing SVD framework for compressing physics foundation models that maintains higher accuracy than standard methods at greater compression ratios.
LBLLM achieves better accuracy than prior binarization methods for LLMs by decoupling weight and activation quantization through initialization, layer-wise distillation, and learnable activation scaling.
GSQ uses Gumbel-Softmax to optimize scalar quantization grids for LLMs, closing most of the accuracy gap to vector methods like QTIP at 2-3 bits per parameter while using symmetric scalar grids compatible with existing kernels.
A progressive training scheme with binary-aware initialization and dual-scaling allows pre-trained LLMs to be converted to high-performance 1-bit models without training from scratch.
SAGE-PTQ is a graph-guided ultra-low-bit PTQ framework that achieves 1.03 average weight bits and 0.004 scaling bits per matrix on LLMs while reporting lower perplexity and memory use than BiLLM and PB-LLM.
GAMMA is a post-training framework that learns stable module sensitivity rankings for mixed-precision LLM quantization and projects them to exact bit budgets via integer programming, enabling reuse across arbitrary memory targets.
BWTA achieves near full-precision accuracy on BERT and LLMs using binary weights and ternary activations, with 16-24x kernel speedups via specialized CUDA kernels.
citing papers explorer
-
TWLA: Achieving Ternary Weights and Low-Bit Activations for LLMs via Post-Training Quantization
TWLA is a PTQ method using E2M-ATQ, KOTMS, and ILA-AMP to enable W1.58A4 quantization for LLMs with maintained accuracy.
-
SAFE-SVD: Sensitivity-Aware Fidelity-Enforcing SVD for Physics Foundation Models
SAFE-SVD introduces a sensitivity-aware fidelity-enforcing SVD framework for compressing physics foundation models that maintains higher accuracy than standard methods at greater compression ratios.
-
LBLLM: Lightweight Binarization of Large Language Models via Three-Stage Distillation
LBLLM achieves better accuracy than prior binarization methods for LLMs by decoupling weight and activation quantization through initialization, layer-wise distillation, and learnable activation scaling.
-
GAMMA: Global Bit Allocation for Mixed-Precision Models under Arbitrary Budgets
GAMMA is a post-training framework that learns stable module sensitivity rankings for mixed-precision LLM quantization and projects them to exact bit budgets via integer programming, enabling reuse across arbitrary memory targets.
-
BWTA: Accurate and Efficient Binarized Transformer by Algorithm-Hardware Co-design
BWTA achieves near full-precision accuracy on BERT and LLMs using binary weights and ternary activations, with 16-24x kernel speedups via specialized CUDA kernels.