arxiv: 2605.09825 · v2 · submitted 2026-05-11 · 💻 cs.LG · cs.AI

Recognition: no theorem link

Pretraining large language models with MXFP4 on Native FP4 Hardware

Mahmut Taylan Kandemir, Meena Arunachalam, Miro Hodak, Musa Cim, Poovaiah Palangappa, Ravi Dwivedula

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:01 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords FP4MXFP4LLM pretrainingweight gradientsHadamard rotationstraining stabilityquantizationtransformer

0 comments

The pith

Quantizing weight gradients with MXFP4 is the primary driver of divergence in full-pipeline FP4 training of large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to understand why full FP4 training often diverges despite stable forward activations and gradients. By progressively enabling MXFP4 in forward propagation, activation gradients, and weight gradients on Llama 3.1-8B pretraining, it finds that weight gradient quantization causes most of the convergence problems while the first two add only modest token needs. Deterministic Hadamard rotations restore stable training by correcting structured micro-scaling errors, unlike stochastic approaches. This insight matters because it identifies a targeted fix that could allow efficient native FP4 hardware use in large model training.

Core claim

In controlled experiments, quantizing Wgrad is the primary driver of convergence degradation, whereas FP4 in Fprop and Dgrad alone introduces only modest additional token requirements. Deterministic Hadamard rotations consistently restore stable optimization, suggesting that FP4 training instability is driven by structured micro-scaling errors along sensitive gradient paths rather than insufficient stochasticity.

What carries the argument

Progressive enabling of MXFP4 quantization in Fprop, Dgrad, and Wgrad combined with deterministic Hadamard rotations to address structured scaling errors in gradients.

If this is right

FP4 quantization can be applied to forward and activation gradient computations with only small increases in required training tokens.
Weight gradient quantization requires specific interventions such as deterministic Hadamard rotations to maintain optimization stability.
Instability stems from structured errors in micro-scaling along gradient paths, not from a general lack of randomness.
Native FP4 hardware support enables direct testing of these quantization effects without software emulation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Targeting hardware optimizations at weight gradient paths could yield disproportionate efficiency gains in low-precision training.
Deterministic Hadamard rotations might serve as a general technique for stabilizing other low-bit training setups beyond FP4.
These findings could guide the design of future quantization-aware training methods that prioritize certain computation stages.

Load-bearing premise

Progressively enabling FP4 across Fprop, Dgrad, and Wgrad while holding all other factors fixed isolates the true source of instability and that the results from Llama 3.1-8B on C4 generalize to other settings.

What would settle it

If training with quantized weight gradients converges stably without deterministic Hadamard rotations, or if stochastic rounding achieves the same stabilization, the claim that structured errors are the main issue would be falsified.

Figures

Figures reproduced from arXiv: 2605.09825 by Mahmut Taylan Kandemir, Meena Arunachalam, Miro Hodak, Musa Cim, Poovaiah Palangappa, Ravi Dwivedula.

**Figure 1.** Figure 1: Validation perplexity vs. training tokens for Llama 3.1–8B under MLPerf pretraining on C4 dataset. We compare FP8, fullpipeline MXFP4 (Fprop + Dgrad + Wgrad; no stabilizer), and full-pipeline MXFP4 + deterministic Hadamard (H16). MXFP4 + deterministic Hadamard closely tracks FP8, while full-pipeline MXFP4 without stabilization converges more slowly and is less stable [PITH_FULL_IMAGE:figures/full_fi… view at source ↗

**Figure 3.** Figure 3: Hadamard-transformed MXFP4 architecture for forward and backward passes. Inputs [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of block quantization strategies: 2D (32 [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

read the original abstract

Why does full-pipeline FP4 training of large language models often diverge, even when forward activations and activation gradients remain stable? We address this question through a controlled study of MXFP4 quantization in transformer training, progressively enabling FP4 across forward propagation (Fprop), activation gradients (Dgrad), and weight gradients (Wgrad) while holding all other factors fixed. In full pretraining of Llama 3.1-8B on the C4 dataset, we observe that quantizing Wgrad is the primary driver of convergence degradation, whereas FP4 in Fprop and Dgrad alone introduces only modest additional token requirements. To interpret this behavior, we evaluate both structured and stochastic interventions under a controlled experimental setting. We find that stochastic rounding and randomized Hadamard rotations fail to stabilize training once Wgrad is quantized, whereas deterministic Hadamard rotations consistently restore stable optimization. These results suggest that FP4 training instability is driven by structured micro-scaling errors along sensitive gradient paths, rather than by insufficient stochasticity. We run experiments with native MXFP4 support on AMD Instinct MI355X GPUs, enabling controlled investigation of these effects without reliance on software emulation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Wgrad quantization drives most of the instability in this MXFP4 setup, and deterministic Hadamard rotations fix it while stochastic methods do not.

read the letter

The paper runs a clean progressive ablation on Llama 3.1-8B pretraining: turn on MXFP4 for Fprop, then Dgrad, then Wgrad, while keeping everything else fixed. The main result is that Wgrad quantization is the dominant source of extra tokens needed and convergence trouble, whereas the first two stages add only modest cost. Among fixes, deterministic Hadamard rotations restore stable training; stochastic rounding and randomized rotations do not. They run the whole thing on native MXFP4 hardware on AMD MI355X GPUs, which removes the usual emulation caveats.

Referee Report

1 major / 2 minor

Summary. The manuscript reports a controlled empirical study of MXFP4 quantization during pretraining of Llama 3.1-8B on C4. By progressively enabling FP4 in forward propagation (Fprop), activation gradients (Dgrad), and weight gradients (Wgrad) while holding other factors fixed, the authors conclude that Wgrad quantization is the primary driver of convergence degradation, with Fprop/Dgrad FP4 causing only modest increases in required tokens. They further evaluate interventions and find that deterministic Hadamard rotations stabilize training under Wgrad FP4, whereas stochastic rounding and randomized rotations do not, attributing instability to structured micro-scaling errors rather than insufficient stochasticity. All experiments use native MXFP4 support on AMD Instinct MI355X GPUs.

Significance. If the isolation of Wgrad effects and the stabilizing role of deterministic Hadamard rotations are confirmed, the work provides concrete guidance for low-precision LLM training, showing that targeted structural interventions can mitigate FP4 instability without relying solely on stochasticity. The use of native FP4 hardware rather than emulation is a clear strength, enabling direct and reproducible measurement of quantization effects in a realistic setting.

major comments (1)

[description of the controlled experimental setting] The central attribution of instability to Wgrad quantization (abstract and results) rests on the progressive-enabling protocol isolating its effects. However, because updated weights immediately feed back into the next forward pass, quantization errors in Wgrad can alter subsequent activation distributions and thereby confound the Fprop/Dgrad measurements taken later in the same run. The manuscript does not specify whether weights were frozen after each stage, activations were replayed from a fixed cache, or a decoupled optimizer was used to break this feedback loop; without such controls the independence assumption remains unverified and the attribution to Wgrad alone is not yet load-bearing.

minor comments (2)

[abstract and results] The abstract reports directional results on token requirements but provides no error bars, number of runs, or statistical details, making it difficult to judge whether the 'modest additional token requirements' for Fprop/Dgrad FP4 are reliably distinguishable from noise.
[intervention evaluation] The distinction between 'structured' and 'stochastic' interventions would benefit from explicit pseudocode or a table listing the exact rounding mode, rotation matrix generation, and scaling procedure applied in each case.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and for identifying the need to explicitly document the isolation controls in our progressive quantization protocol. We address the comment below and will revise the manuscript to include the requested methodological details.

read point-by-point responses

Referee: The central attribution of instability to Wgrad quantization (abstract and results) rests on the progressive-enabling protocol isolating its effects. However, because updated weights immediately feed back into the next forward pass, quantization errors in Wgrad can alter subsequent activation distributions and thereby confound the Fprop/Dgrad measurements taken later in the same run. The manuscript does not specify whether weights were frozen after each stage, activations were replayed from a fixed cache, or a decoupled optimizer was used to break this feedback loop; without such controls the independence assumption remains unverified and the attribution to Wgrad alone is not yet load-bearing.

Authors: We agree that the manuscript should have explicitly described the controls. Our protocol used a fixed activation cache replayed from the state prior to each Wgrad quantization stage, together with a decoupled optimizer that defers weight updates until after Fprop and Dgrad measurements are complete. This design prevents feedback from Wgrad errors into the earlier stages and preserves the independence of the progressive-enabling measurements. We will add a dedicated paragraph in the experimental setup section detailing the cache replay and decoupled optimizer, along with a note confirming that the observed degradation is attributable to Wgrad quantization under these controls. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical controlled study with no derivations or self-referential predictions

full rationale

The paper conducts a controlled empirical investigation by progressively enabling MXFP4 quantization across Fprop, Dgrad, and Wgrad in Llama 3.1-8B pretraining on C4, reporting observed effects on convergence and the stabilizing role of deterministic Hadamard rotations. No equations, fitted parameters, or predictions are defined in terms of the reported outcomes. No self-citations justify uniqueness theorems, ansatzes, or load-bearing premises. All claims rest on replicable experimental setups rather than any reduction of results to inputs by construction. This is a standard non-circular empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an empirical ablation study and does not introduce new mathematical derivations, free parameters, or postulated entities.

pith-pipeline@v0.9.0 · 5526 in / 969 out tokens · 49655 ms · 2026-05-14T21:01:21.753046+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 8 canonical work pages · 1 internal anchor

[1]

The Llama 3 Herd of Models

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[2]

arXiv preprint arXiv:2501.17116 , year=

Optimizing large language model training using fp4 quantization , author=. arXiv preprint arXiv:2501.17116 , year=

work page arXiv
[3]

arXiv preprint arXiv:2310.10537 , year=

Microscaling data formats for deep learning , author=. arXiv preprint arXiv:2310.10537 , year=

work page arXiv
[4]

Advances in Neural Information Processing Systems , volume=

Outlier suppression: Pushing the limit of low-bit transformer language models , author=. Advances in Neural Information Processing Systems , volume=
[5]

AMD Instinct™ MI355X GPUs , author =
[6]

arXiv preprint arXiv:2505.19115 , year=

Fp4 all the way: Fully quantized training of llms , author=. arXiv preprint arXiv:2505.19115 , year=

work page arXiv
[7]

Proceedings of the 2023 conference on empirical methods in natural language processing , pages=

Llm-fp4: 4-bit floating-point quantized transformers , author=. Proceedings of the 2023 conference on empirical methods in natural language processing , pages=

2023
[8]

arXiv preprint arXiv:2502.11458 , year=

Towards efficient pre-training: Exploring fp4 precision in large language models , author=. arXiv preprint arXiv:2502.11458 , year=

work page arXiv
[9]

arXiv preprint arXiv:2509.23202 , year=

Bridging the gap between promise and performance for microscaling FP4 quantization , author=. arXiv preprint arXiv:2509.23202 , year=

work page arXiv
[10]

arXiv preprint arXiv:2509.25149 , year=

Pretraining large language models with nvfp4 , author=. arXiv preprint arXiv:2509.25149 , year=

work page arXiv
[11]

Advances in Neural Information Processing Systems , volume=

Quartet: Native fp4 training can be optimal for large language models , author=. Advances in Neural Information Processing Systems , volume=
[12]

arXiv preprint arXiv:2603.08747 , year=

Diagnosing FP4 inference: a layer-wise and block-wise sensitivity analysis of NVFP4 and MXFP4 , author=. arXiv preprint arXiv:2603.08747 , year=

work page arXiv