LBLLM achieves better accuracy than prior binarization methods for LLMs by decoupling weight and activation quantization through initialization, layer-wise distillation, and learnable activation scaling.
Pb-llm: Partially binarized large language models
3 Pith papers cite this work. Polarity classification is still indexing.
years
2026 3verdicts
UNVERDICTED 3representative citing papers
GSQ applies a Gumbel-Softmax relaxation to learn discrete grid assignments in scalar quantization, closing most of the accuracy gap to vector methods like QTIP on Llama-3.1 models at 2-3 bits while using only symmetric scalar grids.
BWTA achieves near full-precision accuracy on BERT and LLMs using binary weights and ternary activations, with 16-24x kernel speedups via specialized CUDA kernels.
citing papers explorer
-
LBLLM: Lightweight Binarization of Large Language Models via Three-Stage Distillation
LBLLM achieves better accuracy than prior binarization methods for LLMs by decoupling weight and activation quantization through initialization, layer-wise distillation, and learnable activation scaling.
-
GSQ: Highly-Accurate Low-Precision Scalar Quantization for LLMs via Gumbel-Softmax Sampling
GSQ applies a Gumbel-Softmax relaxation to learn discrete grid assignments in scalar quantization, closing most of the accuracy gap to vector methods like QTIP on Llama-3.1 models at 2-3 bits while using only symmetric scalar grids.
-
BWTA: Accurate and Efficient Binarized Transformer by Algorithm-Hardware Co-design
BWTA achieves near full-precision accuracy on BERT and LLMs using binary weights and ternary activations, with 16-24x kernel speedups via specialized CUDA kernels.