Allowing each quantization group to select among multiple 4-bit grids improves accuracy over single-grid FP4 for both post-training and pre-training of LLMs.
Training llms with mxfp4
8 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
Exact reduction yields unconditional lower bounds TB = Omega(d) and T = Omega((sigma^2 d / eps^2) max{1, d/B}) for B-bit stochastic first-order oracles.
Four Over Six adaptively scales blocks in NVFP4 quantization to smaller FP4 values, making representable value distributions more uniform and reducing quantization error especially for near-maximal values.
Low-precision Flash Attention fails due to similar low-rank attention representations combined with biased rounding errors that accumulate and corrupt weight updates; a minimal fix to reduce rounding bias stabilizes training.
MXFP4 quantization error decomposes into scale bias, deadzone truncation, and grid noise; mode-targeted corrections recover BF16 accuracy within 0.7% on Qwen2.5-3B and exceed it by 1.0% on Qwen3-30B-A3B.
ScaleSearch optimizes block floating point scales via fine-grained search to cut quantization error by 27% for NVFP4, improving PTQ by up to 15 points on MATH500 for Qwen3-8B and attention PPL by 0.77 on Llama 3.1 70B.
Stochastic rounding lifts clusters of small singular values even in constant aspect ratio matrices, extending its role as a spectral regularizer.
VFA optimizes Flash Attention by pre-computing global max approximations from key blocks and reordering traversal to reduce vector bottlenecks while preserving exact computation.
citing papers explorer
-
Grid Games: The Power of Multiple Grids for Quantizing Large Language Models
Allowing each quantization group to select among multiple 4-bit grids improves accuracy over single-grid FP4 for both post-training and pre-training of LLMs.
-
Information-Theoretic Lower Bounds for Bit-Constrained Stochastic Optimization via a Reduction to Compressed Gaussian Mean Estimation
Exact reduction yields unconditional lower bounds TB = Omega(d) and T = Omega((sigma^2 d / eps^2) max{1, d/B}) for B-bit stochastic first-order oracles.
-
Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling
Four Over Six adaptively scales blocks in NVFP4 quantization to smaller FP4 values, making representable value distributions more uniform and reducing quantization error especially for near-maximal values.
-
Why Low-Precision Transformer Training Fails: An Analysis on Flash Attention
Low-precision Flash Attention fails due to similar low-rank attention representations combined with biased rounding errors that accumulate and corrupt weight updates; a minimal fix to reduce rounding bias stabilizes training.
-
Decomposing MXFP4 quantization error for LLM reinforcement learning: reducible bias, recoverable deadzone, and an irreducible floor
MXFP4 quantization error decomposes into scale bias, deadzone truncation, and grid noise; mode-targeted corrections recover BF16 accuracy within 0.7% on Qwen2.5-3B and exceed it by 1.0% on Qwen3-30B-A3B.
-
Search Your Block Floating Point Scales!
ScaleSearch optimizes block floating point scales via fine-grained search to cut quantization error by 27% for NVFP4, improving PTQ by up to 15 points on MATH500 for Qwen3-8B and attention PPL by 0.77 on Llama 3.1 70B.
-
Stochastic Rounding Increases Small Singular Values
Stochastic rounding lifts clusters of small singular values even in constant aspect ratio matrices, extending its role as a spectral regularizer.
-
VFA: Relieving Vector Operations in Flash Attention with Global Maximum Pre-computation
VFA optimizes Flash Attention by pre-computing global max approximations from key blocks and reordering traversal to reduce vector bottlenecks while preserving exact computation.