arxiv: 2604.15167 · v1 · submitted 2026-04-16 · 💻 cs.LG

Recognition: unknown

When Flat Minima Fail: Characterizing INT4 Quantization Collapse After FP32 Convergence

Marcus Armstrong

Authors on Pith no claims yet

Pith reviewed 2026-05-10 11:40 UTC · model grok-4.3

classification 💻 cs.LG

keywords post-training quantizationINT4 quantizationquantization collapsetraining dynamicslearning rate schedulesPythia modelsflat minima

0 comments

The pith

INT4 quantization error explodes after FP32 perplexity stops improving.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests the common assumption that a model whose full-precision performance has converged is ready for post-training quantization. By running a calibration-free per-group INT4 probe across every checkpoint in a complete public training run, it finds that quantization robustness follows three phases: joint improvement early on, a long meta-stable plateau, and then rapid collapse once FP32 perplexity has flattened. The collapse begins at the moment of FP32 convergence rather than at learning-rate decay, is absent in INT8, and cannot be explained by growing weight outliers.

Core claim

A calibration-free per-group INT4 probe applied to all 154 Pythia-160m checkpoints reveals a three-phase structure: rapid improvement while FP32 perplexity falls, a roughly 70,000-step meta-stable plateau, and an explosive divergence phase in which the INT4 gap grows from 11% to 517% exactly when FP32 perplexity converges, while INT8 remains stable and kurtosis measurements rule out outlier accumulation as the cause.

What carries the argument

The calibration-free per-group INT4 probe that measures the gap between FP32 and quantized perplexity at every training checkpoint without any calibration data or fine-tuning.

If this is right

Post-convergence weight updates, not learning-rate decay itself, drive the growth in INT4 error.
Different learning-rate schedules after the plateau produce measurably different INT4 gaps, with SGDR accelerating collapse and certain oscillatory schedules reducing it.
The failure is specific to the 16-level INT4 grid and does not occur under INT8 quantization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training recipes may need explicit monitoring of quantization readiness during the final stages rather than stopping at FP32 convergence.
Schedule designs that keep the model in settled cool phases after convergence could preserve low-bit deployability without extra calibration.
The same probe could be applied to other coarse quantization grids to test whether similar late-training instabilities exist.

Load-bearing premise

The three-phase divergence and its onset at FP32 convergence observed in Pythia-160m will appear in other model families, larger scales, and pipelines that include calibration or different grouping.

What would settle it

Running the identical per-group INT4 probe on the full checkpoint history of a different model family such as Llama and finding that the INT4 gap stays bounded after FP32 perplexity has converged.

Figures

Figures reproduced from arXiv: 2604.15167 by Marcus Armstrong.

**Figure 1.** Figure 1: Three-phase INT4 divergence in Pythia-160m. FP32 perplexity (blue, left axis) and INT4 quantization gap (red, right axis) across all 143,000 training steps, measured with a calibration-free per-group INT4 probe on a fixed held-out validation set. INT8 gap (green dashed) remains below 1% throughout. Three phases are shaded: rapid learning (Phase 1), meta-stable plateau (Phase 2), and explosive divergence (P… view at source ↗

**Figure 2.** Figure 2: Weight kurtosis rules out outlier accumulation. Excess kurtosis of Linear layer weights (purple, left axis) and INT4 gap (red dashed, right axis) at 16 sampled checkpoints. Kurtosis rises through Phases 1 and 2 then peaks at step 87,000 and declines sharply through Phase 3. The INT4 gap and kurtosis are anti-correlated in Phase 3 (Pearson r = −0.26): the phase in which quantization robustness catastrophica… view at source ↗

**Figure 3.** Figure 3: Fork experiment: schedule amplitude determines direction of INT4 gap change. INT4 gap trajectories from step 70,000 to 100,000 for three LR schedule conditions (N=3 seeds each; shaded bands show ±1 std). SGDR warm restarts at full ηmax amplitude (red dashed) uniformly worsen the gap (0/9 pairwise wins against cosine). OLI cool phases (green solid), where weights re-settle after high-amplitude bumps, consis… view at source ↗

read the original abstract

Post-training quantization (PTQ) assumes that a well-converged model is a quantization-ready model. We show this assumption fails in a structured, measurable, and previously uncharacterized way. Using a calibration-free per-group INT4 probe applied to all 154 publicly available Pythia-160m training checkpoints, we identify a three-phase divergence structure: a rapid-learning phase where both FP32 perplexity and quantization robustness improve together, a meta-stable plateau lasting roughly 70,000 steps where FP32 perplexity stagnates but INT4 gap remains bounded, and an explosive divergence phase where the INT4 gap compounds from 11% to 517% while FP32 perplexity barely moves. Critically, this divergence begins not when the learning rate starts decaying, but precisely when FP32 perplexity converges a finer-grained onset predictor that implies post-convergence weight updates, rather than decay magnitude alone, are the proximate cause. We further show that INT8 quantization is entirely immune throughout all three phases, constraining the mechanism to the coarseness of the 16-level INT4 grid specifically, and rule out weight outlier accumulation as the mechanism via direct kurtosis measurement. Finally, we conduct a controlled fork experiment from the pre-divergence checkpoint comparing three learning rate schedules (cosine continuation, SGDR warm restarts, and our proposed Oscillatory Lock-In) across nine independent runs. SGDR uniformly accelerates divergence (0/9 pairwise wins against cosine), while OLI's settled cool phases reduce the INT4 gap by 2.2 percentage points on average (t = -5.46, p < 0.0001), demonstrating that schedule amplitude calibration, not oscillation alone, determines whether perturbation helps or hurts. Our code, probe implementation, and all 154-checkpoint audit results are released publicly.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Pythia-160m shows a clear three-phase INT4 gap that opens exactly at FP32 convergence, tracked across all checkpoints with schedule forks that change the outcome, but the pattern is shown only for this one small model.

read the letter

Pythia-160m develops a structured INT4 problem once FP32 training has converged. The gap explodes from 11% to 517% while FP32 perplexity barely changes, and this starts exactly when convergence happens rather than during learning rate decay. The paper maps the three phases directly from the 154 public checkpoints using a simple per-group probe, then runs controlled forks from the pre-divergence point to test three schedules across nine runs each. SGDR speeds up the collapse while their oscillatory lock-in schedule trims the gap by a couple points on average, with t-tests reported. They also show INT8 stays stable throughout and rule out outlier growth with kurtosis numbers. Releasing the full checkpoint audit, probe code, and results makes the empirical core easy to inspect or extend. The work is narrow in scope. Everything comes from one 160m model, the probe skips calibration, and no larger models or other families are tested. That leaves open whether standard PTQ pipelines with calibration would see the same timing or severity. The central claim that converged FP32 models are not automatically quantization-ready therefore rests on an assumption of generality that the data do not yet confirm. This is for researchers who work on post-training quantization or training schedules for language models. Anyone who deploys small models for low-bit inference will find the timing result and the schedule comparison concrete enough to test in their own setups. It deserves a serious referee because the measurements are direct, the public release supports verification, and the schedule experiments give a handle on the problem even if the scope stays limited. Send it to review with a request for broader model tests.

Referee Report

2 major / 3 minor

Summary. The paper claims that the standard PTQ assumption—that a well-converged FP32 model is quantization-ready—fails for INT4 in a structured three-phase manner. Using a calibration-free per-group INT4 probe on all 154 public Pythia-160m checkpoints, it identifies a rapid-learning phase, a ~70k-step meta-stable plateau, and an explosive divergence phase where the INT4 gap grows from 11% to 517% precisely at FP32 perplexity convergence (not at LR decay onset). It shows the effect is INT4-specific (INT8 immune), not driven by weight outliers (via kurtosis), and that LR schedules matter: SGDR accelerates divergence while a proposed Oscillatory Lock-In schedule reduces the gap by 2.2 points on average (t=-5.46, p<0.0001) across 9 runs. Code, probe, and audit results are released.

Significance. If the three-phase structure and schedule sensitivity generalize, the work meaningfully challenges PTQ practice by providing a finer-grained onset predictor tied to post-convergence updates and demonstrating that training dynamics after FP32 convergence can be tuned to preserve quantization robustness. The public release of the 154-checkpoint audit, probe implementation, and reproducible fork experiments is a clear strength, enabling direct verification and extension.

major comments (2)

[§3 and §5] §3 (Three-Phase Divergence) and §5 (Schedule Experiments): The central claim that post-convergence weight updates (rather than LR decay) drive INT4 collapse is supported by the checkpoint audit and fork results for Pythia-160m, but the manuscript tests only this single 160M model; without replication on at least one larger scale (>1B) or different family, the title's general characterization of INT4 collapse remains provisional.
[§2] §2 (Quantization Probe): All reported INT4 gaps and phase timings derive from the calibration-free per-group probe; the paper does not benchmark this probe against standard calibrated PTQ pipelines (e.g., with calibration data or methods like GPTQ), leaving open whether the observed explosive divergence would appear under production quantization settings.

minor comments (3)

[Methods] The definition of the INT4 gap metric (relative perplexity degradation) should be stated explicitly with its formula in the methods section for immediate clarity.
[Figures] Figure captions for the phase plots and schedule comparisons should include exact step ranges and run counts to aid interpretation without cross-referencing the text.
[Discussion] A brief discussion of why the meta-stable plateau lasts ~70k steps would strengthen the mechanistic interpretation even if speculative.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review. The comments highlight key considerations around generalizability and alignment with production quantization practices. We address each major comment point by point below, with honest indications of planned revisions.

read point-by-point responses

Referee: [§3 and §5] §3 (Three-Phase Divergence) and §5 (Schedule Experiments): The central claim that post-convergence weight updates (rather than LR decay) drive INT4 collapse is supported by the checkpoint audit and fork results for Pythia-160m, but the manuscript tests only this single 160M model; without replication on at least one larger scale (>1B) or different family, the title's general characterization of INT4 collapse remains provisional.

Authors: We agree that the empirical foundation is limited to Pythia-160m and its 154 publicly released checkpoints. This scale was selected specifically because the full training trajectory is available for exhaustive auditing, which would not be possible for most larger models. Replicating the complete checkpoint audit and fork experiments on a >1B-parameter model or different family is not feasible at present, as it would require either equivalent public checkpoints or the compute to generate them. The title frames the contribution as a characterization of the observed phenomenon rather than a universal claim across all scales. In the revised manuscript we will add explicit scope statements in the introduction, abstract, and a dedicated limitations paragraph, while noting that the released probe and data enable community extensions. This is a partial revision. revision: partial
Referee: [§2] §2 (Quantization Probe): All reported INT4 gaps and phase timings derive from the calibration-free per-group probe; the paper does not benchmark this probe against standard calibrated PTQ pipelines (e.g., with calibration data or methods like GPTQ), leaving open whether the observed explosive divergence would appear under production quantization settings.

Authors: The calibration-free per-group probe was deliberately designed to isolate the intrinsic effect of post-convergence weight updates on quantization error, free from calibration-data selection or optimization artifacts. This choice enables a controlled measurement across every checkpoint. We acknowledge that the magnitude of divergence could differ under calibrated production pipelines such as GPTQ. To address the concern, the revised manuscript will include a new discussion subsection with a limited comparison: we will apply a simple per-group scaling calibration (using a small held-out calibration set) to representative checkpoints from each phase and report the resulting INT4 gaps. Full benchmarking against GPTQ on all 154 checkpoints is beyond the scope of this revision due to computational cost but will be flagged as valuable future work. This is a partial revision. revision: partial

Circularity Check

0 steps flagged

No circularity; results are direct empirical measurements from checkpoints and experiments

full rationale

The paper's central claims rest on applying a calibration-free per-group INT4 probe to all 154 public Pythia-160m checkpoints, documenting three observed phases in quantization gap behavior, timing relative to FP32 convergence, immunity of INT8, kurtosis measurements ruling out outliers, and controlled fork experiments comparing learning rate schedules. No equations, derivations, or fitted parameters are present. No self-citations are load-bearing for any premise, no ansatzes are smuggled, and no results are renamed known patterns or self-defined by construction. All findings are falsifiable via the released code and checkpoint audits, making the work self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests entirely on empirical measurements across training checkpoints and controlled schedule variations. No free parameters are fitted to derive the phases or onset; phase boundaries are descriptive observations. No new physical or mathematical entities are postulated.

axioms (2)

domain assumption The Pythia-160m architecture and its public training checkpoints are representative for characterizing INT4 quantization behavior during training.
All measurements and the fork experiment are performed exclusively on this model family.
domain assumption The calibration-free per-group INT4 probe produces measurements that reflect practical post-training quantization outcomes.
The probe is the sole measurement instrument used to identify the phases and compare schedules.

pith-pipeline@v0.9.0 · 5628 in / 1573 out tokens · 98550 ms · 2026-05-10T11:40:39.048430+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 11 canonical work pages · 4 internal anchors

[1]

Biderman, H

S. Biderman, H. Schoelkopf, Q. Anthony, H. Bradley, K. O’Brien, E. Hallahan, M. A. Khan, S. Purohit, U. S. Prashanth, E. Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. InInternational Conference on Machine Learning, 2023

2023
[2]

Training dynamics impact post-training quantization robustness.arXiv preprint arXiv:2510.06213,

A. Catalan-Tatjer, N. Ajroldi, and J. Geiping. Training dynamics impact post-training quantiza- tion robustness, 2026. URLhttps://arxiv.org/abs/2510.06213

work page arXiv 2026
[3]

M. Chen, C. Zhang, J. Liu, Y . Zeng, Z. Xue, Z. Liu, Y . Li, J. Huang, X. Zhou, and P. Luo. Scaling law for quantization-aware training.arXiv preprint arXiv:2505.14302, 2025

work page arXiv 2025
[4]

Sharpness-aware minimization for efficiently improving generalization.arXiv preprint arXiv:2010.01412,

P. Foret, A. Kleiner, H. Mobahi, and B. Neyshabur. Sharpness-aware minimization for efficiently improving generalization, 2021. URLhttps://arxiv.org/abs/2010.01412

work page arXiv 2021
[5]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers, 2023. URL https://arxiv.org/abs/2210.17323

work page internal anchor Pith review arXiv 2023
[6]

L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, C. Foster, J. Phang, H. He, A. Thite, N. Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling.arXiv preprint arXiv:2101.00027, 2020

work page internal anchor Pith review arXiv 2020
[7]

Gholami, S

A. Gholami, S. Kim, Z. Dong, Z. Yao, M. W. Mahoney, and K. Keutzer. A survey of quantization methods for efficient neural network inference, 2021. URL https://arxiv.org/abs/2103. 13630

2021
[8]

Flat Minima , year =

S. Hochreiter and J. Schmidhuber. Flat minima.Neural Computation, 9(1):1–42, 01 1997. ISSN 0899-7667. doi: 10.1162/neco.1997.9.1.1. URL https://doi.org/10.1162/neco. 1997.9.1.1

work page doi:10.1162/neco.1997.9.1.1 1997
[9]

N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang. On large-batch training for deep learning: Generalization gap and sharp minima, 2017. URL https://arxiv. org/abs/1609.04836

work page internal anchor Pith review arXiv 2017
[10]

S. Kim, A. Gholami, Z. Yao, M. W. Mahoney, and K. Keutzer. I-bert: Integer-only bert quantization, 2021. URLhttps://arxiv.org/abs/2101.01321

work page arXiv 2021
[11]

J. Lin, J. Tang, H. Tang, S. Yang, W.-M. Chen, W.-C. Wang, G. Xiao, X. Dang, C. Gan, and S. Han. AWQ: Activation-aware weight quantization for on-device llm compression and acceleration. InProceedings of Machine Learning and Systems, 2024

2024
[12]

Loshchilov and F

I. Loshchilov and F. Hutter. SGDR: Stochastic gradient descent with warm restarts. In International Conference on Learning Representations, 2017

2017
[13]

Nagel, M

M. Nagel, M. Fournarakis, R. A. Amjad, Y . Bondarenko, M. van Baalen, and T. Blankevoort. A white paper on neural network quantization, 2021. URL https://arxiv.org/abs/2106. 08295

2021
[14]

J. Park, T. Lee, C. Yoon, H. Hwang, and J. Kang. Outlier-safe pre-training for robust 4-bit quantization of large language models. 2025. 9

2025
[15]

J. Su, Y . Lu, S. Pan, B. Wen, and Y . Liu. RoFormer: Enhanced transformer with rotary position embedding.arXiv preprint arXiv:2104.09864, 2021

work page internal anchor Pith review arXiv 2021
[16]

G. Xiao, J. Lin, M. Seznec, H. Wu, J. Demouth, and S. Han. Smoothquant: Accurate and efficient post-training quantization for large language models, 2024. URL https://arxiv. org/abs/2211.10438

work page arXiv 2024
[17]

how fixable is this model’s quantization error

H. Yang, X. Yang, N. Z. Gong, and Y . Chen. Hero: Hessian-enhanced robust optimization for unifying and improving generalization and quantization performance, 2021. URL https: //arxiv.org/abs/2111.11986. A Probe Implementation Details Calibration-free design rationale.We deliberately exclude calibration data from our quantization probe. Methods like GPTQ ...

work page arXiv 2021