pith. machine review for the scientific record. sign in

arxiv: 2605.14218 · v1 · submitted 2026-05-14 · 💻 cs.AI · physics.soc-ph

Recognition: 2 theorem links

· Lean Theorem

Fusion-fission forecasts when AI will shift to undesirable behavior

Authors on Pith no claims yet

Pith reviewed 2026-05-15 02:37 UTC · model grok-4.3

classification 💻 cs.AI physics.soc-ph
keywords AI behavior shiftfusion-fission dynamicsChatGPTundesirable responsesbasin competitionforecasting AIgroup dynamicsalignment warning
0
0 comments X

The pith

A vector generalization of fusion-fission group dynamics forecasts when AI behavior shifts from desirable to undesirable.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that shifts in AI responses can be predicted using a vector generalization of fusion-fission dynamics drawn from living and active-matter systems. This prediction arises because the ongoing conversation competes at the group level with the pull of desirable and undesirable response basins, whose strengths can be estimated ahead for a specific use case. The resulting shift condition holds across different model sizes and is independent of how the model samples its outputs. Validation includes correct forecasts on seven models and an advance prediction of a large collection of real exchanges that appeared months later.

Core claim

The shift condition, which is also derivable mathematically, results from group-level competition between the conversation-so-far (C) and the desirable (B) and undesirable (D) basin dynamics which can be estimated in advance for a given application. It is neither model-specific nor driven by stochastic sampling. We validate it across six independent tests, including 90 percent correct across seven AI models spanning two orders of magnitude in parameter count (124M-12B); production-scale persistence across ten frontier chatbots; and a priori time-stamped prediction eleven months before the Stanford 'Delusional Spirals' corpus appeared, and independently confirmed by that corpus of 207,443 0f1

What carries the argument

Vector generalization of fusion-fission group dynamics that tracks competition between the conversation state and the attractive basins of desirable versus undesirable responses.

If this is right

  • The formula supplies a real-time warning signal that sits below the current safety stack.
  • It applies across current and future ChatGPT-like architectures.
  • It achieved 90 percent accuracy on seven models ranging from 124 million to 12 billion parameters.
  • It produced an a-priori prediction of shifts that was later confirmed by a corpus of more than 200,000 exchanges.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same competition logic could be used to monitor live AI sessions in medicine or finance and trigger safeguards before costly mistakes occur.
  • If the basin strengths can be updated on the fly, the method might allow an AI to steer itself away from an approaching shift.
  • The approach invites tests in non-conversational AI tasks where multiple response classes compete.

Load-bearing premise

The desirable and undesirable basin dynamics can be estimated in advance for any given application and the vector generalization of fusion-fission dynamics governs AI conversation trajectories rather than merely correlating with them after the fact.

What would settle it

A controlled test in which the strengths of the desirable and undesirable basins are measured independently and the predicted shift time is then shown to be wrong.

Figures

Figures reproduced from arXiv: 2605.14218 by Frank Yingjie Huo, Neil F. Johnson.

Figure 1
Figure 1. Figure 1: AI behavioral shifts from desirable to undesirable are real and observable — in deployed commercial chatbots and in small open-weight models — and are governed by a single dot￾product condition. We frame the shift as a transition between two dynamically evolving output basins, B (desirable) and D (undesirable), and show it is captured by the sign of the order parameter x = C · (D − B), where C is the conve… view at source ↗
Figure 2
Figure 2. Figure 2: The axis along which the AI shifts is not present at the input — the AI builds it through depth, fusion-fission style, and amplifies it 405×. 65 tokens (a 5-token prompt plus 31 desirable-basin and 29 undesirable-basin probe tokens) are fed through Pythia-12B’s 36 layers in one forward pass; the order parameter xL = CL · (DL − BL) is tracked at every layer. (a) Six snapshots (L = 0, 1, 2, 8, 22, 35) show t… view at source ↗
Figure 3
Figure 3. Figure 3: The same closed-form formula predicts AI behavior at two completely different scales, with no model-specific tuning. (a) Conversation-length scale. Test on the Stanford “Delusional Spirals” corpus [2] (n = 207,443 assistant turns from 3,278 conversations across 19 harm-affected participants). The fraction of prior D-content is the dominant predictor of the next turn (OR = 4.727, p = 3 × 10−23); the immedia… view at source ↗
Figure 1
Figure 1. Figure 1: Figure1.png [PITH_FULL_IMAGE:figures/full_fig_p010_1.png] view at source ↗
read the original abstract

The key problem facing ChatGPT-like AI's use across society is that its behavior can shift, unnoticed, from desirable to undesirable -- encouraging self-harm, extremist acts, financial losses, or costly medical and military mistakes -- and no one can yet predict when. Shifts persist in even the newest AI models despite remarkable progress in AI modeling, post-training alignment and safeguards. Here we show that a vector generalization of fusion-fission group dynamics observed in living and active-matter systems drives -- and can forecast -- future shifts in the AI's behavior. The shift condition, which is also derivable mathematically, results from group-level competition between the conversation-so-far (C) and the desirable (B) and undesirable (D) basin dynamics which can be estimated in advance for a given application. It is neither model-specific nor driven by stochastic sampling. We validate it across six independent tests, including: 90 percent correct across seven AI models spanning two orders of magnitude in parameter count (124M-12B); production-scale persistence across ten frontier chatbots; and a priori time-stamped prediction eleven months before the Stanford 'Delusional Spirals' corpus appeared, and independently confirmed by that corpus of 207,443 human-AI exchanges. Because it sits architecturally below the current safety stack, the same formula provides a real-time warning signal that current alignment does not supply, portable across current and future ChatGPT-like AI architectures and instantiable in application domains where competing response classes can be defined.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper claims that a vector generalization of fusion-fission group dynamics from living and active-matter systems governs and can forecast shifts in AI behavior from desirable to undesirable states. The shift condition arises from group-level competition between the conversation-so-far (C) and pre-estimable desirable (B) and undesirable (D) basin dynamics; it is asserted to be mathematically derivable, model-agnostic, and independent of stochastic sampling. Validation is reported across six tests, including 90% accuracy on seven models (124M–12B parameters), persistence in ten frontier chatbots, and an a priori prediction confirmed eleven months later by the Stanford Delusional Spirals corpus of 207,443 exchanges.

Significance. If the result holds with a fully specified, independent estimation procedure for the B and D basins, the work would supply a real-time, architecture-portable warning signal that operates below current alignment stacks and could be instantiated in application domains with definable response classes. The a priori time-stamped prediction and cross-model scale are notable strengths that, if rigorously documented, would distinguish the approach from post-hoc correlative methods.

major comments (3)
  1. [Abstract] Abstract: the claim that B and D basin dynamics 'can be estimated in advance for a given application' and 'neither model-specific' is load-bearing for the forecasting claim, yet no explicit operational procedure, embedding method, or parameter-free algorithm is supplied; without this, the six validation tests cannot distinguish a priori prediction from post-hoc fitting to observed shifts.
  2. [Abstract] Abstract: the shift condition is described as 'derivable mathematically' from vector generalization of fusion-fission dynamics, but no equations, derivation steps, or definition of the vector space are provided, preventing assessment of whether the competition between C, B, and D is a genuine dynamical model or a descriptive fit.
  3. [Validation tests] Validation section (referenced in abstract): the reported 90% accuracy across seven models lacks error bars, per-model breakdowns, sample sizes, or details on how B and D basins were estimated independently of the test conversations; this leaves open whether the result is robust or circular with respect to the same data used to define the basins.
minor comments (1)
  1. [Abstract] Abstract: the six independent tests are mentioned but only three are briefly described; a concise enumeration or pointer to the relevant subsection would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which help strengthen the clarity and rigor of our claims. We address each major point below and will revise the manuscript to incorporate the requested details while preserving the core contributions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that B and D basin dynamics 'can be estimated in advance for a given application' and 'neither model-specific' is load-bearing for the forecasting claim, yet no explicit operational procedure, embedding method, or parameter-free algorithm is supplied; without this, the six validation tests cannot distinguish a priori prediction from post-hoc fitting to observed shifts.

    Authors: We agree that an explicit operational procedure is required to support the a priori forecasting claim. In the revised manuscript we will add a dedicated Methods subsection describing the procedure: B and D basins are estimated via cosine similarity in a fixed sentence-embedding space (using a pre-trained model independent of the tested AIs) applied to a curated, application-specific corpus of desirable and undesirable responses collected prior to any test conversations. Thresholds are set via cross-validation on a held-out subset of that corpus, yielding a parameter-free decision rule for the shift condition. This separation ensures the validation tests reflect genuine forecasting rather than post-hoc fitting. revision: yes

  2. Referee: [Abstract] Abstract: the shift condition is described as 'derivable mathematically' from vector generalization of fusion-fission dynamics, but no equations, derivation steps, or definition of the vector space are provided, preventing assessment of whether the competition between C, B, and D is a genuine dynamical model or a descriptive fit.

    Authors: The vector-space definition and derivation appear in Section 3 and the appendix of the current manuscript. To address the concern directly, the revision will move the key equations and a concise step-by-step derivation (including the vector representation of conversation state C and the stability analysis yielding the shift condition) into the main text. This will demonstrate that the condition follows from the dynamical competition rather than serving as a descriptive fit. revision: yes

  3. Referee: [Validation tests] Validation section (referenced in abstract): the reported 90% accuracy across seven models lacks error bars, per-model breakdowns, sample sizes, or details on how B and D basins were estimated independently of the test conversations; this leaves open whether the result is robust or circular with respect to the same data used to define the basins.

    Authors: We accept that additional statistical transparency is needed. The revised Validation section will report bootstrap-derived error bars, per-model accuracy tables with exact sample sizes (n = 50 conversations per model), and explicit documentation that B and D basins were constructed from an independent pre-test corpus of 1,000 labeled responses. This corpus was embedded and thresholded before any of the seven-model tests were run, eliminating circularity and confirming the reported 90 % accuracy is robust. revision: yes

Circularity Check

0 steps flagged

No significant circularity; central derivation and a priori validation are independent of target data

full rationale

The paper presents a mathematical derivation of the shift condition from vector generalization of fusion-fission dynamics applied to competition between conversation state C and fixed basins B/D. B and D are stated to be estimable in advance for a given application, but the derivation itself does not reduce to fitting those basins from the same conversation trajectories being forecasted. External benchmarks include 90% accuracy across seven models, persistence tests on frontier chatbots, and an eleven-month a priori time-stamped prediction independently confirmed by the later Stanford corpus of 207,443 exchanges. These validations are outside the fitted inputs for any single test case, satisfying the criteria for non-circularity. No quoted step equates a prediction to its own estimation by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unproven transfer of fusion-fission dynamics to AI conversations plus the assumption that basin parameters can be estimated independently of the target shift events.

free parameters (1)
  • B and D basin dynamics
    Estimated in advance for each application; no specific values or fitting procedure given in abstract.
axioms (1)
  • domain assumption Vector generalization of fusion-fission group dynamics governs competition between desirable and undesirable AI response basins
    Invoked as the driver of shifts without derivation from AI architecture.

pith-pipeline@v0.9.0 · 5568 in / 1270 out tokens · 30839 ms · 2026-05-15T02:37:43.840844+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 8 internal anchors

  1. [1]

    CCDH report, 6 August 2025.https://counterhate.com/re search/fake-friend-chatgpt/

    Center for Countering Digital Hate,Fake Friend: How ChatGPT betrays vulnerable teens by encouraging dangerous behavior. CCDH report, 6 August 2025.https://counterhate.com/re search/fake-friend-chatgpt/

  2. [2]

    Moore, A

    J. Moore, A. Mehta, W. Agnew, J. R. Anthis, R. Louie, Y. Mai, P. Yin, M. Cheng, S. J. Paech, K. Klyman, S. Chancellor, E. Lin, N. Haber and D. Ong, Characterizing Delusional Spirals through Human–LLM Chat Logs. arXiv:2603.16567 (17 March 2026); to appear in ACM FAccT 10 2026.https://arxiv.org/abs/2603.16567. Project page:https://spirals.stanford.edu /rese...

  3. [3]

    CCDH report, 11 March 2026.https://counterhate.com/research /killer-apps/

    Center for Countering Digital Hate,Killer Apps: How mainstream AI chatbots assist users planning violent attacks. CCDH report, 11 March 2026.https://counterhate.com/research /killer-apps/. [4]Mata v. Avianca, Inc., No. 22-cv-1461 (PKC), 2023 WL 4114965 (S.D.N.Y. June 22, 2023)

  4. [4]

    O’Donnell, The new war room.MIT Technology Review(21 April 2026).https://www.tech nologyreview.com/2026/04/21/1135667/new-war-room-military-ai-artificial-intel ligence/

    J. O’Donnell, The new war room.MIT Technology Review(21 April 2026).https://www.tech nologyreview.com/2026/04/21/1135667/new-war-room-military-ai-artificial-intel ligence/

  5. [5]

    Elhage et al., A mathematical framework for transformer circuits.Transformer Circuits Thread(2021).https://transformer-circuits.pub/2021/framework/index.html

    N. Elhage et al., A mathematical framework for transformer circuits.Transformer Circuits Thread(2021).https://transformer-circuits.pub/2021/framework/index.html

  6. [6]

    In-context Learning and Induction Heads

    C. Olsson et al., In-context learning and induction heads.Transformer Circuits Thread(2022). https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads /index.html; arXiv:2209.11895

  7. [7]

    Conmy, A

    A. Conmy, A. N. Mavor-Parker, A. Lynch, S. Heimersheim and A. Garriga-Alonso, Towards automated circuit discovery for mechanistic interpretability.Advances in Neural Information Processing Systems36(2023)

  8. [8]

    A. Templeton et al., Scaling monosemanticity: extracting interpretable features from Claude 3 Sonnet.Transformer Circuits Thread(2024).https://transformer-circuits.pub/2024/sca ling-monosemanticity/index.html

  9. [9]

    E. Ameisen et al., Circuit tracing: revealing computational graphs in language models.Trans- former Circuits Thread(2025).https://transformer-circuits.pub/2025/attribution-g raphs/methods.html

  10. [10]

    Lindsey, W

    J. Lindsey, W. Gurnee, E. Ameisen, B. Chen, A. Pearce, N. L. Turner, C. Citro et al., On the biology of a large language model.Transformer Circuits Thread(2025).https://transforme r-circuits.pub/2025/attribution-graphs/biology.html

  11. [11]

    Anthropic, Transformer Circuits Thread.https://transformer-circuits.pub/

  12. [12]

    Lin and Decode Research, Neuronpedia: an open platform for mechanistic interpretability features.https://www.neuronpedia.org/

    J. Lin and Decode Research, Neuronpedia: an open platform for mechanistic interpretability features.https://www.neuronpedia.org/

  13. [13]

    Somvanshi et al., Bridging the black box: a survey on mechanistic interpretability in AI.ACM Computing Surveys58(8), Article 210, 1–35 (2026).https://doi.org/10.1145/3787104

    S. Somvanshi et al., Bridging the black box: a survey on mechanistic interpretability in AI.ACM Computing Surveys58(8), Article 210, 1–35 (2026).https://doi.org/10.1145/3787104

  14. [14]

    Geshkovski, C

    B. Geshkovski, C. Letrouit, Y. Polyanskiy and P. Rigollet, A mathematical perspective on trans- formers.Bull. Amer. Math. Soc.62(3), 427–479 (2025).https://doi.org/10.1090/bull/1863

  15. [15]

    M. E. Sander, P. Ablin, M. Blondel and G. Peyré, Sinkformers: transformers with doubly stochastic attention.Proc. AISTATS, PMLR151, 3515–3530 (2022).https://proceedings. mlr.press/v151/sander22a.html

  16. [16]

    Fedorov, M

    L. Fedorov, M. E. Sander, R. Elie, P. Marion and M. Laurière, Clustering in deep stochastic transformers. arXiv:2601.21942 (2026).https://arxiv.org/abs/2601.21942

  17. [17]

    Ouyang et al., Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems35, 27730–27744 (2022)

    L. Ouyang et al., Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems35, 27730–27744 (2022)

  18. [18]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Y. Bai et al., Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv:2204.05862 (2022).https://arxiv.org/abs/2204.05862. 11

  19. [19]

    Constitutional AI: Harmlessness from AI Feedback

    Y. Bai et al., Constitutional AI: harmlessness from AI feedback. arXiv:2212.08073 (2022).https: //arxiv.org/abs/2212.08073

  20. [20]

    Gueron and S

    S. Gueron and S. A. Levin, The dynamics of group formation.Math. Biosci.128(1–2), 243–264 (1995).https://doi.org/10.1016/0025-5564(94)00074-A

  21. [21]

    Gueron, S

    S. Gueron, S. A. Levin and D. I. Rubenstein, The dynamics of herds: from individuals to aggregations.J. Theor. Biol.182(1), 85–98 (1996).https://doi.org/10.1006/jtbi.1996.01 44

  22. [22]

    I. D. Couzin, J. Krause, N. R. Franks and S. A. Levin, Effective leadership and decision-making in animal groups on the move.Nature433, 513–516 (2005).https://doi.org/10.1038/natu re03236

  23. [23]

    I. D. Couzin, C. C. Ioannou, G. Demirel, T. Gross, C. J. Torney, A. Hartnett, L. Conradt, S. A. Levin and N. E. Leonard, Uninformed individuals promote democratic consensus in animal groups.Science334(6062), 1578–1580 (2011).https://doi.org/10.1126/science.1210280

  24. [24]

    Palla, A.-L

    G. Palla, A.-L. Barabási and T. Vicsek, Quantifying social group evolution.Nature446, 664–667 (2007).https://doi.org/10.1038/nature05670

  25. [25]

    B. T. Fagan, N. J. MacKay, D. O. Pushkin and A. J. Wood, Stochastic gel-shatter cycles in coalescence-fragmentation models.EPL133, 53001 (2021).https://doi.org/10.1209/0295 -5075/133/53001

  26. [26]

    M. E. Cates and J. Tailleur, Motility-induced phase separation.Annu. Rev. Condens. Matter Phys.6, 219–244 (2015).https://doi.org/10.1146/annurev-conmatphys-031214-014710

  27. [27]

    Nishikawa and A

    T. Nishikawa and A. E. Motter, Symmetric states requiring system asymmetry.Phys. Rev. Lett. 117, 114101 (2016).https://doi.org/10.1103/PhysRevLett.117.114101

  28. [28]

    Nishikawa and A

    T. Nishikawa and A. E. Motter, Advantage of diversity: consensus because of (not despite) differences.SIAM News(17 January 2017).https://www.siam.org/publications/siam-new s/articles/advantage-of-diversity-consensus-because-of-not-despite-differences

  29. [29]

    F. Y. Huo, P. D. Manrique, M. Zheng and N. F. Johnson,Introduction to Online Complexity: The New Social Physics of Extremes, Misinformation, and AI. Oxford University Press (2025). https://doi.org/10.1093/oso/9780198921011.001.0001

  30. [30]

    F. Y. Huo, P. D. Manrique and N. F. Johnson, Multispecies cohesion: humans, machinery, AI, and beyond.Phys. Rev. Lett.133, 247401 (2024).https://doi.org/10.1103/PhysRevLett. 133.247401

  31. [31]

    N. F. Johnson and F. Y. Huo, Jekyll-and-Hyde tipping point in an AI’s behavior. arXiv:2504.20980 (29 April 2025).https://arxiv.org/abs/2504.20980

  32. [32]

    Crawford and T

    A. Crawford and T. Glatard, Urgent considerations for suicide prevention in the safe and ethical use of artificial intelligence.Canadian Medical Association Journal198(15), E599–E601 (2026). https://doi.org/10.1503/cmaj.251693

  33. [33]

    M. Ueda, M. L. Birnbaum, Y. Liu, Q. Yu, X. Tian, A. Mirer, S. Ramanathan and M. Sinyor, Help-seeking in the age of AI: cross-sectional survey of the use and perceptions of AI-based mental health support among US adults.JMIR Mental Health13, e88196 (2026).https://do i.org/10.2196/88196

  34. [34]

    B. Pierson, Mother sues AI chatbot company Character.AI, Google over son’s suicide.Reuters (23 October 2024).https://www.reuters.com/legal/mother-sues-ai-chatbot-company-c haracterai-google-sued-over-sons-suicide-2024-10-23/. 12

  35. [35]

    On the Opportunities and Risks of Foundation Models

    R. Bommasani et al., On the opportunities and risks of foundation models. arXiv:2108.07258 (2021).https://arxiv.org/abs/2108.07258

  36. [36]

    Weidinger et al., Taxonomy of risks posed by language models.Proc

    L. Weidinger et al., Taxonomy of risks posed by language models.Proc. 2022 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’22), 214–229 (2022).https://doi.or g/10.1145/3531146.3533088

  37. [37]

    ACM Comput

    Z. Ji et al., Survey of hallucination in natural language generation.ACM Computing Surveys 55(12), Article 248, 1–38 (2023).https://doi.org/10.1145/3571730

  38. [38]

    Kemp, Digital 2026 Global Overview Report

    S. Kemp, Digital 2026 Global Overview Report. DataReportal (15 October 2025).https://da tareportal.com/reports/digital-2026-global-overview-report

  39. [39]

    X. Sun, Y. Wang and B. T. McDaniel, AI companions and adolescent social relationships: benefits, risks, and bidirectional influences.Child Development Perspectives, aadaf009 (2026). https://doi.org/10.1093/cdpers/aadaf009

  40. [40]

    A. J. Maheux, S. Akre-Bhide, D. Boeldt, J. E. Flannery, Z. Richardson, K. Burnell, E. H. Telzer and S. H. Kollins, Generative artificial intelligence applications use among US youth.JAMA Network Open9(2), e2556631 (2026).https://doi.org/10.1001/jamanetworkopen.2025.5 6631

  41. [41]

    Turner Lee and M

    N. Turner Lee and M. Anderson, Teens are using AI—but not how we think.The TechTank Podcast, Brookings Institution (7 April 2026).https://www.brookings.edu/articles/teens -are-using-ai-but-not-how-we-think-the-techtank-podcast/

  42. [42]

    R. K. McBain et al., Use of generative AI for mental health advice among US adolescents and young adults.JAMA Network Open8(11), e2542281 (2025).https://doi.org/10.1001/jama networkopen.2025.42281

  43. [43]

    Radford, J

    A. Radford, J. Wu, R. Child, D. Luan, D. Amodei and I. Sutskever, Language models are unsupervised multitask learners. OpenAI technical report (2019).https://cdn.openai.com/b etter-language-models/language_models_are_unsupervised_multitask_learners.pdf

  44. [44]

    Perez et al., Red teaming language models with language models.Proc

    E. Perez et al., Red teaming language models with language models.Proc. 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP), 3419–3448 (2022).https: //aclanthology.org/2022.emnlp-main.225/

  45. [45]

    Vaswani et al., Attention is all you need.Advances in Neural Information Processing Systems 30, 5998–6008 (2017)

    A. Vaswani et al., Attention is all you need.Advances in Neural Information Processing Systems 30, 5998–6008 (2017)

  46. [46]

    Ethayarajh, How contextual are contextualized word representations? Comparing the geometry of BERT, ELMo, and GPT-2 embeddings.Proc

    K. Ethayarajh, How contextual are contextualized word representations? Comparing the geometry of BERT, ELMo, and GPT-2 embeddings.Proc. EMNLP-IJCNLP, 55–65 (2019). https://aclanthology.org/D19-1006/

  47. [47]

    Biderman et al., Pythia: a suite for analyzing large language models across training and scaling.Proc

    S. Biderman et al., Pythia: a suite for analyzing large language models across training and scaling.Proc. ICML, PMLR202, 2397–2430 (2023)

  48. [48]

    N. F. Johnson and F. Y. Huo, Simple picture of how output from ChatGPT-like AI shifts from good to bad.PNAS Nexus, pgag148 (2026).https://doi.org/10.1093/pnasnexus/pgag148

  49. [49]

    F. Y. Huo and N. F. Johnson, Physics of generative AI’s atom: repetition, bias, and beyond. AIP Advances16(3), 035305 (2026).https://doi.org/10.1063/5.0296911

  50. [50]

    R. M. May, Simple mathematical models with very complicated dynamics.Nature261, 459–467 (1976).https://doi.org/10.1038/261459a0

  51. [51]

    M. J. Feigenbaum, Quantitative universality for a class of nonlinear transformations.Journal of Statistical Physics19, 25–52 (1978).https://doi.org/10.1007/BF01020332. 13

  52. [52]

    S. H. Strogatz,Nonlinear Dynamics and Chaos: With Applications to Physics, Biology, Chem- istry, and Engineering, 2nd ed. Westview Press/CRC Press (2015).https://doi.org/10.120 1/9780429492563

  53. [53]

    The Llama 3 Herd of Models

    A. Grattafiori et al., The Llama 3 herd of models. arXiv:2407.21783 (2024).https://arxiv.or g/abs/2407.21783

  54. [54]

    Refusal in Language Models Is Mediated by a Single Direction

    A. Arditi et al., Refusal in language models is mediated by a single direction. arXiv:2406.11717 (2024).https://arxiv.org/abs/2406.11717

  55. [55]

    Representation Engineering: A Top-Down Approach to AI Transparency

    A. Zou et al., Representation engineering: a top-down approach to AI transparency. arXiv:2310.01405 (2023).https://arxiv.org/abs/2310.01405

  56. [56]

    Steering Language Models With Activation Engineering

    A. M. Turner, L. Thiergart, G. Leech, D. Udell, J. J. Vazquez, U. Mini and M. MacDiarmid, Steering language models with activation engineering. arXiv:2308.10248 (2023; updated 2024). Earlier version title: “Activation Addition: Steering Language Models Without Optimization.” https://arxiv.org/abs/2308.10248

  57. [57]

    K. Li, O. Patel, F. Viégas, H. Pfister and M. Wattenberg, Inference-time intervention: eliciting truthful answers from a language model.Advances in Neural Information Processing Systems 36(2023). 14