pith. machine review for the scientific record. sign in

arxiv: 2605.08894 · v1 · submitted 2026-05-09 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Fitting Is Not Enough: Smoothness in Extremely Quantized LLMs

Pengzhan Li, Wanxiang Che, Xu Han, Yuxuan Li, Yuzhuang Xu

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:14 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords LLM quantizationextreme quantizationsmoothness degradationtoken candidatesdecoding treegeneration qualitypost-training quantizationquantization-aware training
0
0 comments X

The pith

Extremely quantized LLMs lose generation quality from smoothness degradation separate from numerical errors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that when large language models are pushed to very low bit-width quantization, their performance declines not solely from numerical inaccuracies in the computations but also from a separate loss of smoothness in how predictions are distributed. This degradation shrinks the pool of likely next tokens around any given sequence context, which thins out the paths available during decoding and lowers overall text quality. The authors track the effect with a smoothness measure and neighborhood analysis, then test a simple principle that preserves smoothness during both post-training quantization and quantization-aware training, finding measurable gains on top of accuracy-only approaches. Their central point is that extreme quantization methods must treat smoothness preservation as a distinct design goal to avoid compounding quality losses.

Core claim

Extremely quantized LLMs suffer from systematic smoothness degradation beyond numerical precision loss. Through a smoothness proxy, such degradation becomes increasingly severe as the quantization bit-width decreases. Based on sequence neighborhood modeling, quantized models exhibit a rapid reduction of effective token candidates within the prediction neighborhood, which directly leads to a sparser decoding tree and degraded generation quality. Introducing a simple smoothness-preserving principle in both post-training quantization and quantization-aware training demonstrates that preserving smoothness brings additional gains beyond numerical accuracy.

What carries the argument

The smoothness proxy paired with sequence neighborhood modeling that counts the shrinking set of effective token candidates around each prediction point.

If this is right

  • Smoothness degradation grows worse as bit-width is reduced further.
  • Fewer effective token candidates in local neighborhoods produce measurably sparser decoding trees.
  • Adding the smoothness-preserving principle to existing quantization pipelines yields quality gains that numerical accuracy improvements alone do not deliver.
  • Future extreme quantization algorithms should treat smoothness preservation as an explicit objective alongside numerical fidelity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same smoothness loss may appear in other compression methods such as pruning or low-rank adaptation if they also flatten local prediction distributions.
  • Architectural changes that encourage smoother output distributions could make models more tolerant of extreme quantization from the start.
  • Smoothness metrics could be added to standard evaluation suites for compressed models to catch quality drops not visible in perplexity alone.

Load-bearing premise

The chosen smoothness proxy and sequence neighborhood modeling correctly isolate a degradation mechanism that is causally responsible for reduced generation quality independent of numerical error.

What would settle it

A controlled comparison in which a quantization method augmented with the smoothness-preserving principle produces no improvement in generation quality metrics over a numerically optimized baseline would falsify the independence of the smoothness effect.

Figures

Figures reproduced from arXiv: 2605.08894 by Pengzhan Li, Wanxiang Che, Xu Han, Yuxuan Li, Yuzhuang Xu.

Figure 1
Figure 1. Figure 1: Smoothness degradation in extreme quantization. Smoothness score distributions of GPTQ [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Approximate Lipschitz constant, i.e. expected input gradient [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Key definitions in sequence neighborhood modeling. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Results of rPPL-1 to rPPL-40 on original and quantized LLaMA-2-7B models. For GPTQ, [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Illustrative decoding trees of original and quantized models. For any given node, darker [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The row-wise forward pass and the column-wise backward pass [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: To investigate this phenomenon, we compare layer-wise gradients between a BF16 model [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Forward / backward preservation. accuracy smoothness 4-bit 3-bit 2-bit [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: Comparison of fitting optimization and smoothness optimization. Given a prompt, the [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Evolution of Cavg during training across different models and settings. We use 0.4B model in this experiment. b) Is smoother training always better? In QAT, maximizing smoothness is not unconditionally beneficial. Moreover, it is unrealistic to expect a ternary model to achieve parity with a full-precision model in both smoothness and performance simultaneously. The core idea of this work is to identify t… view at source ↗
Figure 12
Figure 12. Figure 12: Evolution of Cavg during training when integrate RMSNorm or freeze word embedding. abnormal gradient channel [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Important/abnormal gradient channels and relaxed optimization. [PITH_FULL_IMAGE:figures/full_fig_p016_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Comparison of zero-shot accuracy under different values of the coefficient [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗
read the original abstract

Large language models (LLMs) achieve strong performance but incur high deployment costs, motivating extremely low-bit but lossy quantization. Existing quantization algorithms mainly focus on improving the numerical accuracy of forward computation to eliminate performance degradation. In this paper, we show that extremely quantized LLMs suffer from systematic smoothness degradation beyond numerical precision loss. Through a smoothness proxy, we observe that such degradation becomes increasingly severe as the quantization bit-width decreases. Furthermore, based on sequence neighborhood modeling, we find that quantized models exhibit a rapid reduction of effective token candidates within the prediction neighborhood, which directly leads to a sparser decoding tree and degraded generation quality. To validate it, we introduce a simple smoothness-preserving principle in both post-training quantization and quantization-aware training, and demonstrate that preserving smoothness brings additional gains beyond numerical accuracy. The core goal of this paper is to highlight smoothness preservation as an important design consideration for future extreme quantization methods. Code is available at https://github.com/xuyuzhuang11/FINE.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that extremely quantized LLMs suffer from systematic smoothness degradation beyond numerical precision loss. Using a smoothness proxy, the authors observe that this degradation worsens as bit-width decreases. Via sequence neighborhood modeling, they report a rapid reduction in effective token candidates within the prediction neighborhood, which they argue directly produces a sparser decoding tree and lower generation quality. They introduce a simple smoothness-preserving principle for both post-training quantization and quantization-aware training, and show that it yields additional performance gains beyond those from numerical accuracy improvements alone. The core contribution is positioning smoothness preservation as an important new design consideration for extreme quantization methods.

Significance. If the claimed causal mechanism holds independently of numerical error, the work identifies a previously under-emphasized factor that could meaningfully improve the quality of extremely low-bit LLMs without additional compute. The public code release is a clear strength that enables direct verification and extension. The result, if substantiated, would shift quantization research toward joint optimization of numerical fit and distributional smoothness.

major comments (3)
  1. Section 3 (sequence neighborhood modeling): the assertion that reduced effective token candidates 'directly leads' to a sparser decoding tree and degraded generation quality is supported only by correlational trends across bit-widths. No controlled experiment is described that holds numerical forward-pass accuracy fixed while varying the neighborhood structure or smoothness proxy; without such isolation, the causal claim that smoothness degradation operates independently of numerical precision loss remains unproven.
  2. Section 4 (smoothness-preserving principle): the reported gains in PTQ and QAT are presented as evidence that smoothness preservation adds value beyond numerical accuracy. However, the manuscript does not include ablations that match the numerical error of the baseline while applying the principle, nor does it quantify how much of the intervention's effect is on forward-pass error versus on the neighborhood metrics. This leaves open the possibility that the gains are downstream of improved numerical fit rather than an independent smoothness mechanism.
  3. Experimental protocol (throughout results): the abstract and main text report consistent observations and gains from the proxy and principle, yet provide no error bars, variance estimates across random seeds, or detailed ablation tables. This absence makes it impossible to judge the statistical reliability of the claimed separation between smoothness effects and numerical precision effects.
minor comments (2)
  1. The smoothness proxy and sequence neighborhood definitions would benefit from explicit equations or pseudocode in the main text (rather than only in the appendix or code) to improve clarity and reproducibility.
  2. Figure captions and axis labels could more explicitly indicate which curves correspond to the smoothness-preserving intervention versus standard quantization baselines.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important opportunities to strengthen the causal evidence and statistical reporting for our claims about smoothness degradation in extremely quantized LLMs. We address each major comment below and will incorporate the suggested improvements in the revised manuscript.

read point-by-point responses
  1. Referee: Section 3 (sequence neighborhood modeling): the assertion that reduced effective token candidates 'directly leads' to a sparser decoding tree and degraded generation quality is supported only by correlational trends across bit-widths. No controlled experiment is described that holds numerical forward-pass accuracy fixed while varying the neighborhood structure or smoothness proxy; without such isolation, the causal claim that smoothness degradation operates independently of numerical precision loss remains unproven.

    Authors: We agree that the primary evidence in the current manuscript consists of correlational trends across bit-widths. The sequence neighborhood modeling shows a clear reduction in effective token candidates as smoothness degrades, which provides a mechanistic explanation for the sparser decoding tree. To isolate the effect more rigorously, we will add a controlled experiment in the revision that applies targeted perturbations to the smoothness proxy while monitoring and holding numerical forward-pass accuracy constant, thereby demonstrating the independent contribution to decoding sparsity and generation quality. revision: partial

  2. Referee: Section 4 (smoothness-preserving principle): the reported gains in PTQ and QAT are presented as evidence that smoothness preservation adds value beyond numerical accuracy. However, the manuscript does not include ablations that match the numerical error of the baseline while applying the principle, nor does it quantify how much of the intervention's effect is on forward-pass error versus on the neighborhood metrics. This leaves open the possibility that the gains are downstream of improved numerical fit rather than an independent smoothness mechanism.

    Authors: We acknowledge that the current presentation does not fully isolate the contributions. The smoothness-preserving principle is designed to act on distributional properties rather than directly minimizing numerical error. In the revision, we will include new ablations that explicitly match numerical error between the baseline and our method (e.g., by adjusting quantization parameters to equalize forward-pass metrics) and will quantify the differential effects on neighborhood metrics versus numerical accuracy to clarify the independent role of smoothness preservation. revision: partial

  3. Referee: Experimental protocol (throughout results): the abstract and main text report consistent observations and gains from the proxy and principle, yet provide no error bars, variance estimates across random seeds, or detailed ablation tables. This absence makes it impossible to judge the statistical reliability of the claimed separation between smoothness effects and numerical precision effects.

    Authors: We agree that the absence of error bars and variance reporting limits the ability to assess statistical reliability. We will revise all experimental sections to report means and standard deviations across multiple random seeds, include variance estimates for key metrics, and expand the ablation tables with statistical details to support evaluation of the separation between smoothness and numerical effects. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper defines a smoothness proxy as an independent diagnostic tool, uses sequence neighborhood modeling to observe token candidate reduction, and then applies a separate smoothness-preserving intervention in PTQ and QAT to obtain gains beyond numerical accuracy. No derivation step reduces a claimed prediction or result to its own inputs by construction, no load-bearing premise rests on self-citation, and no ansatz or uniqueness claim is smuggled in. The central chain remains empirically additive rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on the validity of a smoothness proxy and sequence neighborhood modeling as faithful indicators of generation degradation; these are domain assumptions rather than derived quantities.

axioms (2)
  • domain assumption The smoothness proxy isolates degradation beyond numerical precision loss
    Invoked to separate the new phenomenon from standard quantization error.
  • domain assumption Reduction in effective token candidates directly produces sparser decoding trees and lower generation quality
    Used to connect the neighborhood observation to practical output degradation.

pith-pipeline@v0.9.0 · 5477 in / 1337 out tokens · 60558 ms · 2026-05-12T01:14:28.362476+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 4 internal anchors

  1. [1]

    Amini, S

    A. Amini, S. Gabriel, S. Lin, R. Koncel-Kedziorski, Y . Choi, and H. Hajishirzi. MathQA: Towards interpretable math word problem solving with operation-based formalisms. InProceed- ings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pages 2357–2367, 2019. URL ...

  2. [2]

    Bahri, H

    D. Bahri, H. Mobahi, and Y . Tay. Sharpness-aware minimization improves language model generalization. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL), pages 7360–7371, 2022. URL https://doi.org/10.18653/v1/2022. acl-long.508

  3. [3]

    P. L. Bartlett, D. J. Foster, and M. J. Telgarsky. Spectrally-normalized margin bounds for neural networks.Advances in neural information processing systems (NeurIPS), 30:6240–6249, 2017. URL https://proceedings.neurips.cc/paper_files/paper/ 2017/hash/b22b257ad0519d4500539da3c8bcf4dd-Abstract.html

  4. [4]

    Y . Bisk, R. Zellers, J. Gao, Y . Choi, et al. PIQA: Reasoning about physical commonsense in natural language. InProceedings of the AAAI conference on artificial intelligence (AAAI), pages 7432–7439, 2020. URLhttps://doi.org/10.48550/arXiv.1911.11641

  5. [5]

    A. Chan, Y . Tay, and Y .-S. Ong. What it thinks is important is important: Robustness transfers through input gradients. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 332–341, 2020. URL https://openaccess. thecvf.com/content_CVPR_2020/html/Chan_What_It_Thinks_Is_Important_Is_ Important_Robustness_Transf...

  6. [6]

    BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

    C. Clark, K. Lee, M.-W. Chang, T. Kwiatkowski, M. Collins, and K. Toutanova. BoolQ: Exploring the surprising difficulty of natural yes/no questions. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pages 2924–2936, 2019. URL https://doi. org/10.186...

  7. [7]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord. Think you have solved question answering? Try ARC, the AI2 Reasoning Challenge.arXiv preprint arXiv:1803.05457v1, 2018. URLhttps://doi.org/10.48550/arXiv.1803.05457

  8. [8]

    and Rosenfeld, Elan and Kolter, J

    J. Cohen, E. Rosenfeld, and Z. Kolter. Certified adversarial robustness via randomized smooth- ing. InProceedings of International Conference on Machine Learning (ICML), pages 1310– 1320, 2019. URLhttps://doi.org/10.48550/arXiv.1902.02918

  9. [9]

    Dettmers, R

    T. Dettmers, R. Svirschevski, V . Egiazarian, D. Kuznedelev, E. Frantar, S. Ashkboos, A. Borzunov, T. Hoefler, and D. Alistarh. SpQR: A sparse-quantized representation for near- lossless LLM weight compression. InProceedings of the Twelfth International Conference on Learning Representations (ICLR), 2024. URL https://openreview.net/forum?id= Q1u25ahSuy. 11

  10. [10]

    P. Dong, L. Li, Y . Zhong, D. Du, R. FAN, Y . Chen, Z. Tang, Q. Wang, W. Xue, Y . Guo, et al. STBLLM: Breaking the 1-bit barrier with structured binary llms. InProceedings of the Thirteenth International Conference on Learning Representations (ICLR), 2025. URL https://openreview.net/forum?id=6XUSDvBFkV

  11. [11]

    Dwivedi, B

    P. Dwivedi, B. Islam, and M. Kajal. Smooth gradient loss: a loss function for gradient regularization in deep learning optimization.The Journal of Supercomputing, 81:1–45, 2025. URLhttps://doi.org/10.1007/s11227-025-07954-9

  12. [12]

    Egiazarian, A

    V . Egiazarian, A. Panferov, D. Kuznedelev, E. Frantar, A. Babenko, and D. Alistarh. Ex- treme compression of large language models via additive quantization. InProceedings of International Conference on Machine Learning (ICML), pages 12284–12303, 2024. URL https://proceedings.mlr.press/v235/egiazarian24a.html

  13. [13]

    N. B. Erichson, O. Azencot, A. Queiruga, L. Hodgkinson, and M. W. Mahoney. Lipschitz recurrent neural networks. InProceedings of the Ninth International Conference on Learning Representations (ICLR), 2021. URLhttps://openreview.net/forum?id=-N7PBXqOUJZ

  14. [14]

    GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

    E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh. GPTQ: Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323v2, 2022. URL https://doi.org/10.48550/arXiv.2210.17323

  15. [15]

    H. Gouk, E. Frank, B. Pfahringer, and M. J. Cree. Regularisation of neural networks by enforcing lipschitz continuity.Machine Learning, 110:393–416, 2021. URL https://doi. org/10.1007/s10994-020-05929-w

  16. [16]

    Huang, Y

    W. Huang, Y . Liu, H. Qin, Y . Li, S. Zhang, X. Liu, M. Magno, and X. Qi. BiLLM: Pushing the limit of post-training quantization for LLMs. InProceedings of International Conference on Machine Learning (ICML), pages 20023–20042, 2024. URL https://proceedings.mlr. press/v235/huang24q.html

  17. [17]

    Khromov and S

    G. Khromov and S. P. Singh. Some fundamental aspects about lipschitz continuity of neural networks. InProceedings of the Twelfth International Conference on Learning Representations (ICLR), 2024. URLhttps://openreview.net/forum?id=5jWsW08zUh

  18. [18]

    H. Kim, G. Papamakarios, and A. Mnih. The lipschitz constant of self-attention. InProceedings of International Conference on Machine Learning (ICML), pages 5562–5571, 2021. URL https://proceedings.mlr.press/v139/kim21i.html

  19. [19]

    H. Lee, S. Cho, and C. Kim. Indirect gradient matching for adversarial robust distillation. In Proceedings of the Thirteenth International Conference on Learning Representations (ICLR),

  20. [20]

    URLhttps://doi.org/10.48550/arXiv.2312.03286

  21. [21]

    T. Li, P. Zhou, Z. He, X. Cheng, and X. Huang. Friendly sharpness-aware minimization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pages 5631–5640, 2024. URL https://openaccess.thecvf.com/content/CVPR2024/ html/Li_Friendly_Sharpness-Aware_Minimization_CVPR_2024_paper.html

  22. [22]

    Z. Li, X. Yan, T. Zhang, H. Qin, D. Xie, J. Tian, L. Kong, Y . Zhang, X. Yang, et al. ARB- LLM: Alternating refined binarizations for large language models. InProceedings of the Thirteenth International Conference on Learning Representations (ICLR), 2025. URL https: //openreview.net/forum?id=ZU8OdDLTts

  23. [23]

    Z. Liu, B. Oguz, C. Zhao, E. Chang, P. Stock, Y . Mehdad, Y . Shi, R. Krishnamoorthi, and V . Chandra. LLM-QAT: Data-free quantization aware training for large language models. In Findings of the Association for Computational Linguistics (ACL Findings), pages 467–484,

  24. [24]

    URLhttps://doi.org/10.18653/v1/2024.findings-acl.26

  25. [25]

    Z. Liu, C. Zhao, I. Fedorov, B. Soran, D. Choudhary, R. Krishnamoorthi, V . Chandra, Y . Tian, and T. Blankevoort. SpinQuant: LLM quantization with learned rotations. InProceedings of the Thirteenth International Conference on Learning Representations (ICLR), 2025. URL https://doi.org/10.48550/arXiv.2405.16406. 12

  26. [26]

    K. Lyu, Z. Li, and S. Arora. Understanding the generalization benefit of normalization layers: Sharpness reduction.Advances in Neural Information Processing Systems (NeurIPS), 35: 34689–34708, 2022. URL https://papers.nips.cc/paper_files/paper/2022/hash/ dffd1c523512e557f4e75e8309049213-Abstract-Conference.html

  27. [27]

    S. Ma, H. Wang, L. Ma, L. Wang, W. Wang, S. Huang, L. Dong, R. Wang, J. Xue, and F. Wei. The era of 1-bit LLMs: All large language models are in 1.58 bits.arXiv preprint arXiv:2402.17764, 2024. URLhttps://doi.org/10.48550/arXiv.2402.17764

  28. [28]

    Merity, C

    S. Merity, C. Xiong, J. Bradbury, and R. Socher. Pointer sentinel mixture models. InProceedings of the Fifth International Conference on Learning Representations (ICLR), 2017. URL https: //openreview.net/forum?id=Byj72udxe

  29. [29]

    Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

    T. Mihaylov, P. Clark, T. Khot, and A. Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2381–2391, 2018. URL https://doi.org/10.48550/arXiv.1809.02789

  30. [30]

    Pauli, A

    P. Pauli, A. Koch, J. Berberich, P. Kohler, and F. Allgöwer. Training robust neural networks using lipschitz bounds.IEEE Control Systems Letters, 6:121–126, 2021. URL https://doi. org/10.1109/LCSYS.2021.3050444

  31. [31]

    X. Qi, J. Wang, Y . Chen, Y . Shi, and L. Zhang. LipsFormer: Introducing lipschitz continuity to vision transformers. InProceedings of the Eleventh International Conference on Learning Representations (ICLR), 2023. URLhttps://openreview.net/forum?id=cHf1DcCwcH3

  32. [32]

    Raffel, N

    C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, and P. J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.The Journal of Machine Learning Research, 21:5485–5551, 2020. URL https://jmlr.org/papers/ volume21/20-074/20-074.pdf

  33. [33]

    WinoGrande: An Adversarial Winograd Schema Challenge at Scale

    K. Sakaguchi, R. L. Bras, C. Bhagavatula, and Y . Choi. Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021. URL https: //doi.org/10.48550/arXiv.1907.10641

  34. [34]

    W. Shao, M. Chen, Z. Zhang, P. Xu, L. Zhao, Z. Li, K. Zhang, P. Gao, Y . Qiao, and P. Luo. OmniQuant: Omnidirectionally calibrated quantization for large language models. InProceed- ings of the Twelfth International Conference on Learning Representations (ICLR), 2024. URL https://openreview.net/forum?id=8Wuvhh0LYW

  35. [35]

    Z. Shi, Y . Wang, H. Zhang, J. Z. Kolter, and C.-J. Hsieh. Efficiently com- puting local lipschitz constants of neural networks via bound propagation.Ad- vances in Neural Information Processing Systems (NeurIPS), 35:2350–2364, 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/hash/ 0ff54b4ec4f70b3ae12c8621ca8a49f4-Abstract-Conference.html

  36. [36]

    Singla and S

    S. Singla and S. Feizi. Skew orthogonal convolutions. InProceedings of International Confer- ence on Machine Learning (ICML), pages 9756–9766, 2021. URL https://proceedings. mlr.press/v139/singla21a.html

  37. [37]

    Tseng, J

    A. Tseng, J. Chee, Q. Sun, V . Kuleshov, and C. De Sa. QuIP#: Even better LLM quantization with hadamard incoherence and lattice codebooks. InProceedings of International Conference on Machine Learning (ICML), pages 48630–48656, 2024. URL https://proceedings.mlr. press/v235/tseng24a.html

  38. [38]

    Virmaux and K

    A. Virmaux and K. Scaman. Lipschitz regularity of deep neural networks: analysis and efficient estimation.Advances in Neural Information Processing Systems (NeurIPS), 31:3835–3844, 2018. URL https://proceedings.neurips.cc/paper_files/paper/ 2018/hash/d54e99a6c03704e95e6965532dec148b-Abstract.html

  39. [39]

    A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. InProceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, 2018. URLhttps://doi.org/10.18653/v1/W18-5446. 13

  40. [40]

    R. Wang, Y . Gong, X. Liu, G. Zhao, Z. Yang, B. Guo, Z.-J. Zha, and P. CHENG. Op- timizing large language model training using FP4 quantization. InProceedings of In- ternational Conference on Machine Learning (ICML), pages 62937–62957, 2025. URL https://proceedings.mlr.press/v267

  41. [41]

    Z. Wang, G. Prakriya, and S. Jha. A quantitative geometric approach to neural-network smoothness.Advances in Neural Information Processing Systems (NeurIPS), 35:34201– 34215, 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/ hash/dd1322ce23cbbdd9d7ebb0ad1223c27a-Abstract-Conference.html

  42. [42]

    G. Xiao, J. Lin, M. Seznec, H. Wu, J. Demouth, and S. Han. SmoothQuant: Accurate and efficient post-training quantization for large language models. InProceedings of In- ternational Conference on Machine Learning (ICML), pages 38087–38099, 2023. URL https://proceedings.mlr.press/v202/xiao23c.html

  43. [43]

    Y . Xu, X. Han, Z. Yang, S. Wang, Q. Zhu, Z. Liu, W. Liu, and W. Che. OneBit: Towards extremely low-bit large language models.Advances in Neural Information Processing System (NeurIPS), 37:66357–66382, 2024. URLhttps://doi.org/10.52202/079017-2122

  44. [44]

    Y . Xu, S. Ji, Q. Zhu, and W. Che. CRVQ: Channel-relaxed vector quantization for extreme compression of LLMs.Transactions of the Association for Computational Linguistics (TACL), 13:1488–1506, 2025. URLhttps://doi.org/10.1162/TACL.a.45

  45. [45]

    Zellers, A

    R. Zellers, A. Holtzman, Y . Bisk, A. Farhadi, and Y . Choi. Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), pages 4791–4800, 2019. URL https://doi.org/10. 18653/v1/P19-1472

  46. [46]

    Gradient Ridge

    Z. Zhou, J. Liang, Y . Song, L. Yu, H. Wang, W. Zhang, Y . Yu, and Z. Zhang. Lipschitz generative adversarial nets. InProceedings of International Conference on Machine Learning (ICML), pages 7584–7593, 2019. URLhttps://proceedings.mlr.press/v97/zhou19c.html. A Appendix A.1 Smoothness in Training Process In this section, we discuss several topics related ...