Allowing each quantization group to select among multiple 4-bit grids improves accuracy over single-grid FP4 for both post-training and pre-training of LLMs.
Fp4 all the way: Fully quantized training of llms
7 Pith papers cite this work. Polarity classification is still indexing.
years
2026 7representative citing papers
ScaleSearch optimizes block floating point scales via fine-grained search to cut quantization error by 27% for NVFP4, improving PTQ by up to 15 points on MATH500 for Qwen3-8B and attention PPL by 0.77 on Llama 3.1 70B.
LoKA enables practical FP8 use in numerically sensitive large recommendation models via profiling, model adaptations, and runtime kernel orchestration.
Weight gradient quantization is the main driver of instability in full-pipeline FP4 LLM training, mitigated by deterministic Hadamard rotations rather than added stochasticity.
LOCALUT delivers 1.82x geometric mean speedup for quantized DNN inference on real UPMEM DRAM-PIM devices by using operation-packed LUTs with canonicalization, reordering, and slice streaming.
AdaHOP applies pattern-aware Hadamard transforms and selective outlier extraction to enable from-scratch MXFP4 training of LLMs at BF16 quality with up to 3.6X memory compression and 1.46X speedup.
HiFloat4 FP4 with stabilization techniques trains dense and MoE language models on Ascend NPUs at relative error within 1% of full-precision baselines.
citing papers explorer
No citing papers match the current filters.