pith. machine review for the scientific record. sign in

arxiv: 2604.13206 · v1 · submitted 2026-04-14 · 💻 cs.AI · cs.LG· cs.NA· math.NA

Recognition: unknown

Numerical Instability and Chaos: Quantifying the Unpredictability of Large Language Models

Alan Villarreal, Chashi Mahiul Islam, Mao Nishino, Shaeke Salman, Xiuwen Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 14:45 UTC · model grok-4.3

classification 💻 cs.AI cs.LGcs.NAmath.NA
keywords numerical instabilitylarge language modelschaotic regimesfloating-point precisiontransformer layerserror propagationavalanche effectoutput unpredictability
0
0 comments X

The pith

Large language models display three scale-dependent regimes of behavior driven by floating-point rounding errors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tracks how small rounding errors move through transformer layers and either disappear, grow, or get overtaken by real input changes. It finds that this process creates an early-layer avalanche where tiny perturbations produce either total stability or rapid divergence. This leads to three universal regimes that depend on the size of the perturbation relative to the input: a stable one where errors vanish and outputs stay fixed, a chaotic one where rounding noise takes over, and a signal-dominated one where genuine input differences control the result. The analysis matters for agentic uses of LLMs because it ties unpredictability directly to the finite precision of ordinary floating-point arithmetic rather than to model design choices alone.

Core claim

Through systematic tracking of rounding-error propagation across transformer computation layers, the authors demonstrate that LLMs exhibit universal, scale-dependent chaotic behaviors characterized by three distinct regimes: a stable regime where perturbations below an input-dependent threshold vanish and produce constant outputs; a chaotic regime where rounding errors dominate and drive output divergence; and a signal-dominated regime where true input variations override numerical noise.

What carries the argument

the avalanche effect in early layers, where minor perturbations trigger binary outcomes of either rapid amplification or complete attenuation

If this is right

  • Outputs remain identical for all inputs whose perturbations lie below the model-specific stable threshold.
  • In the chaotic regime, even sub-threshold rounding noise produces visibly different token sequences.
  • Genuine semantic or lexical changes in the input override rounding noise once the signal-dominated regime is reached.
  • The same three-regime structure appears across multiple datasets and model families.
  • The location of the regime boundaries scales with model size and input magnitude.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Prompts that differ by amounts near the stable-chaotic boundary may produce unreliable results in production systems.
  • Using higher internal precision during inference could enlarge the stable regime and reduce unwanted divergence.
  • Agentic pipelines that chain multiple LLM calls may accumulate errors more rapidly when any step lands in the chaotic regime.
  • The findings point to the possibility of input-dependent precision switching as a practical mitigation.

Load-bearing premise

The observed divergence patterns arise primarily from floating-point rounding errors rather than from other uncontrolled sources such as attention dropout, sampling, or initialization differences.

What would settle it

Re-running the experiments in arbitrary-precision arithmetic and finding that the three regimes and divergence patterns disappear would falsify the claim that rounding errors are the root driver.

Figures

Figures reproduced from arXiv: 2604.13206 by Alan Villarreal, Chashi Mahiul Islam, Mao Nishino, Shaeke Salman, Xiuwen Liu.

Figure 1
Figure 1. Figure 1: Overview of effective directional number [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Layer-wise propagation profile at ϵ = 0.1. Directional structure is visible: higher-σ directions yield larger amplifica￾tion than low-σ directions, consistent with a signal-dominated regime. This observation is not a claim about the ideal mathematical condition number (defined by a limit as ϵ → 0 for a real￾valued function). Instead, it characterizes the behavior of the finite-precision program under finit… view at source ↗
Figure 3
Figure 3. Figure 3: Layer-wise propagation profile at ϵ = 10−10. Direc￾tional structure collapses: multiple singular directions, coordi￾nate directions, and a random direction exhibit similar growth. This supports an instability-dominated regime where effective sensitivity is governed by scale rather than singular value [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Micro-continuity analysis: consecutive differences [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Binary decision maps near a constructed near-tie point. [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 8
Figure 8. Figure 8: Directional number vs. ϵ at layer index 32 under different floating-point precisions. Precision shifts the ϵ ranges where plateau behavior and rapid growth appear, while pre￾serving the overall scale dependence [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗
Figure 6
Figure 6. Figure 6: Directional stability in the (v1, v2) plane. (a) Maximum stable perturbation smax(θ) varies by 4× across angles. (b) Cartesian projection reveals asymmetric polygonal boundary. Geometric distortion demonstrates that stability is determined by input-dependent rounding dynamics, not by singular values. 2.0 1.5 1.0 0.5 1e10 2.5 0 1000 3000 4000 Mean max s: 8.96e-11 Median max s: 8.72e-11 2000 Singular Vector … view at source ↗
Figure 7
Figure 7. Figure 7: Maximum stable perturbation magnitude along all 4096 [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Absolute directional condition number convergence via [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗
read the original abstract

As Large Language Models (LLMs) are increasingly integrated into agentic workflows, their unpredictability stemming from numerical instability has emerged as a critical reliability issue. While recent studies have demonstrated the significant downstream effects of these instabilities, the root causes and underlying mechanisms remain poorly understood. In this paper, we present a rigorous analysis of how unpredictability is rooted in the finite numerical precision of floating-point representations, tracking how rounding errors propagate, amplify, or dissipate through Transformer computation layers. Specifically, we identify a chaotic "avalanche effect" in the early layers, where minor perturbations trigger binary outcomes: either rapid amplification or complete attenuation. Beyond specific error instances, we demonstrate that LLMs exhibit universal, scale-dependent chaotic behaviors characterized by three distinct regimes: 1) a stable regime, where perturbations fall below an input-dependent threshold and vanish, resulting in constant outputs; 2) a chaotic regime, where rounding errors dominate and drive output divergence; and 3) a signal-dominated regime, where true input variations override numerical noise. We validate these findings extensively across multiple datasets and model architectures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The manuscript analyzes numerical instability in large language models arising from floating-point precision limits. It tracks the propagation of rounding errors through Transformer layers, identifying a chaotic avalanche effect in early layers and three distinct regimes of behavior: stable (perturbations vanish below a threshold), chaotic (rounding errors cause output divergence), and signal-dominated (true input variations dominate). The findings are validated across multiple datasets and architectures.

Significance. If the central claims hold after addressing experimental controls, this work could significantly advance understanding of LLM reliability issues in practical deployments. The extensive cross-dataset and cross-architecture validation is a positive aspect, providing evidence for the universality of the observed behaviors. It highlights a potential root cause for unpredictability that could inform mitigation strategies.

major comments (3)
  1. [§4 (Experimental Setup)] §4 (Experimental Setup): The description of the validation experiments does not include controls to isolate floating-point rounding errors, such as enforcing deterministic algorithms, disabling dropout and sampling stochasticity, fixing all random seeds, or benchmarking against higher-precision (e.g., float64) computations. This is critical because the central claim attributes divergences in the chaotic regime to rounding error propagation, yet without these isolations the patterns could stem from other non-deterministic sources.
  2. [§3 (Regime Identification)] §3 (Regime Identification): There is no formal derivation or mathematical argument establishing why the avalanche effect must produce binary outcomes (full amplification or complete attenuation) rather than a continuum of behaviors. The three regimes appear to be defined based on post-hoc observation of patterns rather than from first principles or controlled analysis.
  3. [Results and Validation] Results and Validation: No quantitative error bars, confidence intervals, or statistical measures are reported for the regime classifications or divergence metrics across multiple runs or seeds, weakening the robustness of the 'universal' and 'scale-dependent' claims.
minor comments (3)
  1. [Abstract] The abstract mentions 'rigorous analysis' but the provided details focus on observational patterns without specifying the quantitative methods used to delineate regime boundaries.
  2. [Notation] The input-dependent threshold for the stable regime is mentioned but not given an explicit mathematical definition or formula.
  3. [Related Work] Additional references to existing literature on numerical precision effects in deep learning would help contextualize the novelty of the avalanche effect observation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We have addressed each major comment point by point below, making revisions to the manuscript where appropriate to strengthen the work.

read point-by-point responses
  1. Referee: §4 (Experimental Setup): The description of the validation experiments does not include controls to isolate floating-point rounding errors, such as enforcing deterministic algorithms, disabling dropout and sampling stochasticity, fixing all random seeds, or benchmarking against higher-precision (e.g., float64) computations. This is critical because the central claim attributes divergences in the chaotic regime to rounding error propagation, yet without these isolations the patterns could stem from other non-deterministic sources.

    Authors: We agree that isolating floating-point effects is essential for the validity of our central claims. In the revised manuscript, we have expanded §4 with a dedicated subsection on experimental controls. Specifically, we now enforce deterministic algorithms via torch.use_deterministic_algorithms(True), disable dropout and stochastic sampling by using eval mode with greedy decoding, fix all random seeds, and include direct comparisons against float64 precision computations. Results under these controls confirm that the regime behaviors and avalanche effect persist and are attributable to rounding errors. We have added new tables and figures documenting these controlled experiments. revision: yes

  2. Referee: §3 (Regime Identification): There is no formal derivation or mathematical argument establishing why the avalanche effect must produce binary outcomes (full amplification or complete attenuation) rather than a continuum of behaviors. The three regimes appear to be defined based on post-hoc observation of patterns rather than from first principles or controlled analysis.

    Authors: The regime identification is primarily empirical, derived from consistent patterns observed across extensive cross-architecture and cross-dataset experiments. A complete first-principles derivation of the binary avalanche (requiring full dynamical systems analysis of FP arithmetic in Transformers) is beyond the scope of this paper. However, we have revised §3 to include a more detailed mechanistic explanation: the binary outcome stems from whether perturbations exceed the effective precision threshold in early-layer matrix multiplications, leading to either amplification or attenuation in subsequent layers. We added layer-wise error propagation plots and controlled perturbation analyses to provide stronger support beyond pure post-hoc observation. revision: partial

  3. Referee: Results and Validation: No quantitative error bars, confidence intervals, or statistical measures are reported for the regime classifications or divergence metrics across multiple runs or seeds, weakening the robustness of the 'universal' and 'scale-dependent' claims.

    Authors: We acknowledge the absence of statistical quantification in the original submission. The revised manuscript now reports error bars (standard deviations), means, and 95% confidence intervals for all key metrics, including regime classification proportions and divergence measures. These are computed over 20 independent runs per configuration using different seeds under the deterministic controls described above. This addition substantiates the universality and scale-dependence claims with quantitative robustness measures. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical measurement of observed regimes

full rationale

The paper's core contribution is an empirical analysis of rounding-error propagation through Transformer layers on existing models, with the three regimes (stable, chaotic, signal-dominated) defined directly from measured output divergence patterns rather than from any closed-form derivation or fitted parameter. No equations reduce the claimed avalanche effect or regime boundaries to their own inputs by construction, no self-citations are invoked as load-bearing uniqueness theorems, and no ansatz or renaming of known results is presented as a first-principles result. The work therefore remains self-contained against external benchmarks of floating-point behavior in neural networks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The analysis rests on the standard assumption that floating-point arithmetic introduces bounded rounding errors at each operation; no free parameters, ad-hoc axioms, or new invented entities are introduced in the abstract.

axioms (1)
  • standard math Floating-point representations have finite precision and therefore introduce rounding errors at each arithmetic operation.
    Invoked implicitly when the authors state that unpredictability is rooted in finite numerical precision.

pith-pipeline@v0.9.0 · 5509 in / 1239 out tokens · 32822 ms · 2026-05-10T14:45:19.115299+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 5 canonical work pages · 4 internal anchors

  1. [1]

    A survey on large language model based autonomous agents,

    L. Wang, C. Ma, X. Feng, Z. Zhang, H. Yang, J. Zhang, Z. Chen, J. Tang, X. Chen, Y . Lin,et al., “A survey on large language model based autonomous agents,”Frontiers of Computer Science, vol. 18, no. 6, p. 186345, 2024

  2. [2]

    The Rise and Potential of Large Language Model Based Agents: A Survey

    Z. Xi, W. Chen, X. Guo, W. He, Y . Ding, B. Hong, M. Zhang, J. Wang, S. Jin, E. Zhou,et al., “The rise and potential of large language model based agents: A survey. arxiv 2023,”arXiv preprint arXiv:2309.07864, vol. 10, 2025

  3. [3]

    Agentverse: Facilitating multi- agent collaboration and exploring emergent behaviors,

    W. Chen, Y . Su, J. Zuo, C. Yang, C. Yuan, C.-M. Chan, H. Yu, Y . Lu, Y .-H. Hung, C. Qian,et al., “Agentverse: Facilitating multi- agent collaboration and exploring emergent behaviors,” inThe Twelfth International Conference on Learning Representations, 2023

  4. [4]

    Autogen: Enabling next-gen llm applications via multi-agent conversation,

    Q. Wu, G. Bansal, J. Zhang, Y . Wu, B. Li, E. Zhu, L. Jiang, X. Zhang, S. Zhang, J. Liu, A. H. Awadallah, R. W. White, D. Burger, and C. Wang, “Autogen: Enabling next-gen llm applications via multi-agent conversation,” 2023

  5. [5]

    Metagpt: Meta programming for a multi-agent collaborative framework,

    S. Hong, M. Zheng, J. Chen, X. Cheng, C. Zhang, Z. Wang, S. K. S. Yau, Z. Lin, L. Zhou, C. Ran,et al., “Metagpt: Meta programming for a multi-agent collaborative framework,” inThe Twelfth International Conference on Learning Representations, 2024

  6. [6]

    What every computer scientist should know about floating-point arithmetic,

    D. Goldberg, “What every computer scientist should know about floating-point arithmetic,”ACM computing surveys (CSUR), vol. 23, no. 1, pp. 5–48, 1991

  7. [7]

    Parallel reproducible summation,

    J. Demmel and H. D. Nguyen, “Parallel reproducible summation,”IEEE Transactions on Computers, vol. 64, no. 7, pp. 2060–2070, 2014

  8. [8]

    cuDNN developer guide

    NVIDIA Corporation, “cuDNN developer guide.” https://docs.nvidia. com/deeplearning/cudnn/, 2024

  9. [9]

    Impacts of floating-point non-associativity on repro- ducibility for hpc and deep learning applications,

    S. Shanmugavelu, M. Taillefumier, C. Culver, O. Hernandez, M. Coletti, and A. Sedova, “Impacts of floating-point non-associativity on repro- ducibility for hpc and deep learning applications,” inSC24-W: Work- shops of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 170–179, IEEE, 2024

  10. [10]

    Randomness in neural network training: Characterizing the impact of tooling,

    D. Zhuang, X. Zhang, S. Song, and S. Hooker, “Randomness in neural network training: Characterizing the impact of tooling,” inProceedings of Machine Learning and Systems(D. Marculescu, Y . Chi, and C. Wu, eds.), vol. 4, pp. 316–336, 2022

  11. [11]

    Theoretical limitations of self-attention in neural sequence models,

    M. Hahn, “Theoretical limitations of self-attention in neural sequence models,”Transactions of the Association for Computational Linguistics, vol. 8, pp. 156–171, 2020

  12. [12]

    Persona is a double-edged sword: Mitigating the negative impact of role-playing prompts in zero-shot reasoning tasks,

    J. Kim, N. Yang, and K. Jung, “Persona is a double-edged sword: Mitigating the negative impact of role-playing prompts in zero-shot reasoning tasks,” 2024

  13. [13]

    Embeddings to diagnosis: Latent fragility under agentic perturbations in clinical llms,

    R. K. Vijayaraj, “Embeddings to diagnosis: Latent fragility under agentic perturbations in clinical llms,” 2025

  14. [14]

    Large Language Model based Multi-Agents: A Survey of Progress and Challenges

    T. Guo, X. Chen, Y . Wang, R. Chang, S. Pei, N. V . Chawla, O. Wiest, and X. Zhang, “Large language model based multi-agents: A survey of progress and challenges,”arXiv preprint arXiv:2402.01680, 2024

  15. [15]

    Reproducibility in PyTorch

    PyTorch Contributors, “Reproducibility in PyTorch.” https://pytorch.org/ docs/stable/notes/randomness.html, 2024

  16. [16]

    Large language models encode clinical knowledge,

    K. Singhal, S. Azizi, T. Tu, S. S. Mahdavi, J. Wei, H. W. Chung, N. Scales, A. Tanwani, H. Cole-Lewis, S. Pfohl,et al., “Large language models encode clinical knowledge,”Nature, vol. 620, no. 7972, pp. 172– 180, 2023

  17. [17]

    Deterministic implementations for reproducibility in deep reinforcement learning,

    P. Nagarajan, G. Warnell, and P. Stone, “Deterministic implementations for reproducibility in deep reinforcement learning,” 2019

  18. [18]

    Understanding and mitigating numerical sources of nondeterminism in LLM inference,

    J. Yuan, H. Li, X. Ding, W. Xie, Y .-J. Li, W. Zhao, K. Wan, J. Shi, X. Hu, and Z. Liu, “Understanding and mitigating numerical sources of nondeterminism in LLM inference,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  19. [19]

    Qlora: Efficient finetuning of quantized llms,

    T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “Qlora: Efficient finetuning of quantized llms,”Advances in neural information processing systems, vol. 36, pp. 10088–10115, 2023

  20. [20]

    GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

    E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, “Gptq: Accurate post-training quantization for generative pre-trained transformers,”arXiv preprint arXiv:2210.17323, 2022

  21. [21]

    Smoothquant: Accurate and efficient post-training quantization for large language models,

    G. Xiao, J. Lin, M. Seznec, H. Wu, J. Demouth, and S. Han, “Smoothquant: Accurate and efficient post-training quantization for large language models,” inInternational conference on machine learning, pp. 38087–38099, PMLR, 2023

  22. [22]

    Ex- ploiting chaotic dynamics as deep neural networks,

    S. Liu, N. Akashi, Q. Huang, Y . Kuniyoshi, and K. Nakajima, “Ex- ploiting chaotic dynamics as deep neural networks,”Physical Review Research, vol. 7, no. 3, p. 033031, 2025

  23. [23]

    Chaotic neural dynamics facilitate prob- abilistic computations through sampling,

    Y . Terada and T. Toyoizumi, “Chaotic neural dynamics facilitate prob- abilistic computations through sampling,”Proceedings of the National Academy of Sciences, vol. 121, no. 18, p. e2312992121, 2024

  24. [24]

    Lyapunov spectra of chaotic recurrent neural networks,

    R. Engelken, F. Wolf, and L. F. Abbott, “Lyapunov spectra of chaotic recurrent neural networks,”Physical Review Research, vol. 5, no. 4, p. 043044, 2023

  25. [25]

    Universal adversarial triggers for attacking and analyzing nlp,

    E. Wallace, S. Feng, N. Kandpal, M. Gardner, and S. Singh, “Universal adversarial triggers for attacking and analyzing nlp,” inProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 2153–2162, 2019

  26. [26]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    A. Zou, Z. Wang, J. Z. Kolter, and M. Fredrikson, “Universal and transferable adversarial attacks on aligned language models,”arXiv preprint arXiv:2307.15043, 2023

  27. [27]

    Assessing representation stability for transformer models,

    B. E. Tuck and R. M. Verma, “Assessing representation stability for transformer models,”arXiv preprint arXiv:2508.11667, 2025

  28. [28]

    Efficiency of coordinate descent methods on huge-scale optimization problems,

    Y . Nesterov, “Efficiency of coordinate descent methods on huge-scale optimization problems,”SIAM Journal on Optimization, vol. 22, no. 2, pp. 341–362, 2012

  29. [29]

    N. J. Higham,Accuracy and Stability of Numerical Algorithms. Philadel- phia, PA: SIAM, 2nd ed., 2002

  30. [30]

    A theory of condition,

    J. R. Rice, “A theory of condition,”SIAM Journal on Numerical Analysis, vol. 3, no. 2, pp. 287–310, 1966

  31. [31]

    N. J. Higham,Functions of Matrices: Theory and Computation. Philadelphia, PA: SIAM, 2008. See Chapter 3 for Fr ´echet derivatives

  32. [32]

    Ieee standard for floating-point arithmetic,

    IEEE, “Ieee standard for floating-point arithmetic,”IEEE Std 754-2019 (Revision of IEEE 754-2008), pp. 1–84, 2019