Recognition: no theorem link
Pretraining large language models with MXFP4 on Native FP4 Hardware
Pith reviewed 2026-05-14 21:01 UTC · model grok-4.3
The pith
Quantizing weight gradients with MXFP4 is the primary driver of divergence in full-pipeline FP4 training of large language models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In controlled experiments, quantizing Wgrad is the primary driver of convergence degradation, whereas FP4 in Fprop and Dgrad alone introduces only modest additional token requirements. Deterministic Hadamard rotations consistently restore stable optimization, suggesting that FP4 training instability is driven by structured micro-scaling errors along sensitive gradient paths rather than insufficient stochasticity.
What carries the argument
Progressive enabling of MXFP4 quantization in Fprop, Dgrad, and Wgrad combined with deterministic Hadamard rotations to address structured scaling errors in gradients.
If this is right
- FP4 quantization can be applied to forward and activation gradient computations with only small increases in required training tokens.
- Weight gradient quantization requires specific interventions such as deterministic Hadamard rotations to maintain optimization stability.
- Instability stems from structured errors in micro-scaling along gradient paths, not from a general lack of randomness.
- Native FP4 hardware support enables direct testing of these quantization effects without software emulation.
Where Pith is reading between the lines
- Targeting hardware optimizations at weight gradient paths could yield disproportionate efficiency gains in low-precision training.
- Deterministic Hadamard rotations might serve as a general technique for stabilizing other low-bit training setups beyond FP4.
- These findings could guide the design of future quantization-aware training methods that prioritize certain computation stages.
Load-bearing premise
Progressively enabling FP4 across Fprop, Dgrad, and Wgrad while holding all other factors fixed isolates the true source of instability and that the results from Llama 3.1-8B on C4 generalize to other settings.
What would settle it
If training with quantized weight gradients converges stably without deterministic Hadamard rotations, or if stochastic rounding achieves the same stabilization, the claim that structured errors are the main issue would be falsified.
Figures
read the original abstract
Why does full-pipeline FP4 training of large language models often diverge, even when forward activations and activation gradients remain stable? We address this question through a controlled study of MXFP4 quantization in transformer training, progressively enabling FP4 across forward propagation (Fprop), activation gradients (Dgrad), and weight gradients (Wgrad) while holding all other factors fixed. In full pretraining of Llama 3.1-8B on the C4 dataset, we observe that quantizing Wgrad is the primary driver of convergence degradation, whereas FP4 in Fprop and Dgrad alone introduces only modest additional token requirements. To interpret this behavior, we evaluate both structured and stochastic interventions under a controlled experimental setting. We find that stochastic rounding and randomized Hadamard rotations fail to stabilize training once Wgrad is quantized, whereas deterministic Hadamard rotations consistently restore stable optimization. These results suggest that FP4 training instability is driven by structured micro-scaling errors along sensitive gradient paths, rather than by insufficient stochasticity. We run experiments with native MXFP4 support on AMD Instinct MI355X GPUs, enabling controlled investigation of these effects without reliance on software emulation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript reports a controlled empirical study of MXFP4 quantization during pretraining of Llama 3.1-8B on C4. By progressively enabling FP4 in forward propagation (Fprop), activation gradients (Dgrad), and weight gradients (Wgrad) while holding other factors fixed, the authors conclude that Wgrad quantization is the primary driver of convergence degradation, with Fprop/Dgrad FP4 causing only modest increases in required tokens. They further evaluate interventions and find that deterministic Hadamard rotations stabilize training under Wgrad FP4, whereas stochastic rounding and randomized rotations do not, attributing instability to structured micro-scaling errors rather than insufficient stochasticity. All experiments use native MXFP4 support on AMD Instinct MI355X GPUs.
Significance. If the isolation of Wgrad effects and the stabilizing role of deterministic Hadamard rotations are confirmed, the work provides concrete guidance for low-precision LLM training, showing that targeted structural interventions can mitigate FP4 instability without relying solely on stochasticity. The use of native FP4 hardware rather than emulation is a clear strength, enabling direct and reproducible measurement of quantization effects in a realistic setting.
major comments (1)
- [description of the controlled experimental setting] The central attribution of instability to Wgrad quantization (abstract and results) rests on the progressive-enabling protocol isolating its effects. However, because updated weights immediately feed back into the next forward pass, quantization errors in Wgrad can alter subsequent activation distributions and thereby confound the Fprop/Dgrad measurements taken later in the same run. The manuscript does not specify whether weights were frozen after each stage, activations were replayed from a fixed cache, or a decoupled optimizer was used to break this feedback loop; without such controls the independence assumption remains unverified and the attribution to Wgrad alone is not yet load-bearing.
minor comments (2)
- [abstract and results] The abstract reports directional results on token requirements but provides no error bars, number of runs, or statistical details, making it difficult to judge whether the 'modest additional token requirements' for Fprop/Dgrad FP4 are reliably distinguishable from noise.
- [intervention evaluation] The distinction between 'structured' and 'stochastic' interventions would benefit from explicit pseudocode or a table listing the exact rounding mode, rotation matrix generation, and scaling procedure applied in each case.
Simulated Author's Rebuttal
We thank the referee for the careful reading and for identifying the need to explicitly document the isolation controls in our progressive quantization protocol. We address the comment below and will revise the manuscript to include the requested methodological details.
read point-by-point responses
-
Referee: The central attribution of instability to Wgrad quantization (abstract and results) rests on the progressive-enabling protocol isolating its effects. However, because updated weights immediately feed back into the next forward pass, quantization errors in Wgrad can alter subsequent activation distributions and thereby confound the Fprop/Dgrad measurements taken later in the same run. The manuscript does not specify whether weights were frozen after each stage, activations were replayed from a fixed cache, or a decoupled optimizer was used to break this feedback loop; without such controls the independence assumption remains unverified and the attribution to Wgrad alone is not yet load-bearing.
Authors: We agree that the manuscript should have explicitly described the controls. Our protocol used a fixed activation cache replayed from the state prior to each Wgrad quantization stage, together with a decoupled optimizer that defers weight updates until after Fprop and Dgrad measurements are complete. This design prevents feedback from Wgrad errors into the earlier stages and preserves the independence of the progressive-enabling measurements. We will add a dedicated paragraph in the experimental setup section detailing the cache replay and decoupled optimizer, along with a note confirming that the observed degradation is attributable to Wgrad quantization under these controls. revision: yes
Circularity Check
No circularity: purely empirical controlled study with no derivations or self-referential predictions
full rationale
The paper conducts a controlled empirical investigation by progressively enabling MXFP4 quantization across Fprop, Dgrad, and Wgrad in Llama 3.1-8B pretraining on C4, reporting observed effects on convergence and the stabilizing role of deterministic Hadamard rotations. No equations, fitted parameters, or predictions are defined in terms of the reported outcomes. No self-citations justify uniqueness theorems, ansatzes, or load-bearing premises. All claims rest on replicable experimental setups rather than any reduction of results to inputs by construction. This is a standard non-circular empirical study.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
arXiv preprint arXiv:2501.17116 , year=
Optimizing large language model training using fp4 quantization , author=. arXiv preprint arXiv:2501.17116 , year=
-
[3]
arXiv preprint arXiv:2310.10537 , year=
Microscaling data formats for deep learning , author=. arXiv preprint arXiv:2310.10537 , year=
-
[4]
Advances in Neural Information Processing Systems , volume=
Outlier suppression: Pushing the limit of low-bit transformer language models , author=. Advances in Neural Information Processing Systems , volume=
-
[5]
AMD Instinct™ MI355X GPUs , author =
-
[6]
arXiv preprint arXiv:2505.19115 , year=
Fp4 all the way: Fully quantized training of llms , author=. arXiv preprint arXiv:2505.19115 , year=
-
[7]
Proceedings of the 2023 conference on empirical methods in natural language processing , pages=
Llm-fp4: 4-bit floating-point quantized transformers , author=. Proceedings of the 2023 conference on empirical methods in natural language processing , pages=
2023
-
[8]
arXiv preprint arXiv:2502.11458 , year=
Towards efficient pre-training: Exploring fp4 precision in large language models , author=. arXiv preprint arXiv:2502.11458 , year=
-
[9]
arXiv preprint arXiv:2509.23202 , year=
Bridging the gap between promise and performance for microscaling FP4 quantization , author=. arXiv preprint arXiv:2509.23202 , year=
-
[10]
arXiv preprint arXiv:2509.25149 , year=
Pretraining large language models with nvfp4 , author=. arXiv preprint arXiv:2509.25149 , year=
-
[11]
Advances in Neural Information Processing Systems , volume=
Quartet: Native fp4 training can be optimal for large language models , author=. Advances in Neural Information Processing Systems , volume=
-
[12]
arXiv preprint arXiv:2603.08747 , year=
Diagnosing FP4 inference: a layer-wise and block-wise sensitivity analysis of NVFP4 and MXFP4 , author=. arXiv preprint arXiv:2603.08747 , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.