LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.
arXiv e-prints, art , author=
3 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.LG 3representative citing papers
LoKA enables practical FP8 use in numerically sensitive large recommendation models via profiling, model adaptations, and runtime kernel orchestration.
Diffusion coding model CoDA shows smaller accuracy drops than Qwen3-1.7B under 2-4 bit quantization on HumanEval and MBPP.
citing papers explorer
-
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.
-
LoKA: Low-precision Kernel Applications for Recommendation Models At Scale
LoKA enables practical FP8 use in numerically sensitive large recommendation models via profiling, model adaptations, and runtime kernel orchestration.
-
On the Quantization Robustness of Diffusion Language Models in Coding Benchmarks
Diffusion coding model CoDA shows smaller accuracy drops than Qwen3-1.7B under 2-4 bit quantization on HumanEval and MBPP.