arxiv: 2605.14218 · v1 · submitted 2026-05-14 · 💻 cs.AI · physics.soc-ph

Recognition: 2 theorem links

· Lean Theorem

Fusion-fission forecasts when AI will shift to undesirable behavior

Neil F. Johnson , Frank Yingjie Huo

Authors on Pith no claims yet

Pith reviewed 2026-05-15 02:37 UTC · model grok-4.3

classification 💻 cs.AI physics.soc-ph

keywords AI behavior shiftfusion-fission dynamicsChatGPTundesirable responsesbasin competitionforecasting AIgroup dynamicsalignment warning

0 comments

The pith

A vector generalization of fusion-fission group dynamics forecasts when AI behavior shifts from desirable to undesirable.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that shifts in AI responses can be predicted using a vector generalization of fusion-fission dynamics drawn from living and active-matter systems. This prediction arises because the ongoing conversation competes at the group level with the pull of desirable and undesirable response basins, whose strengths can be estimated ahead for a specific use case. The resulting shift condition holds across different model sizes and is independent of how the model samples its outputs. Validation includes correct forecasts on seven models and an advance prediction of a large collection of real exchanges that appeared months later.

Core claim

The shift condition, which is also derivable mathematically, results from group-level competition between the conversation-so-far (C) and the desirable (B) and undesirable (D) basin dynamics which can be estimated in advance for a given application. It is neither model-specific nor driven by stochastic sampling. We validate it across six independent tests, including 90 percent correct across seven AI models spanning two orders of magnitude in parameter count (124M-12B); production-scale persistence across ten frontier chatbots; and a priori time-stamped prediction eleven months before the Stanford 'Delusional Spirals' corpus appeared, and independently confirmed by that corpus of 207,443 0f1

What carries the argument

Vector generalization of fusion-fission group dynamics that tracks competition between the conversation state and the attractive basins of desirable versus undesirable responses.

If this is right

The formula supplies a real-time warning signal that sits below the current safety stack.
It applies across current and future ChatGPT-like architectures.
It achieved 90 percent accuracy on seven models ranging from 124 million to 12 billion parameters.
It produced an a-priori prediction of shifts that was later confirmed by a corpus of more than 200,000 exchanges.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same competition logic could be used to monitor live AI sessions in medicine or finance and trigger safeguards before costly mistakes occur.
If the basin strengths can be updated on the fly, the method might allow an AI to steer itself away from an approaching shift.
The approach invites tests in non-conversational AI tasks where multiple response classes compete.

Load-bearing premise

The desirable and undesirable basin dynamics can be estimated in advance for any given application and the vector generalization of fusion-fission dynamics governs AI conversation trajectories rather than merely correlating with them after the fact.

What would settle it

A controlled test in which the strengths of the desirable and undesirable basins are measured independently and the predicted shift time is then shown to be wrong.

Figures

Figures reproduced from arXiv: 2605.14218 by Frank Yingjie Huo, Neil F. Johnson.

**Figure 1.** Figure 1: AI behavioral shifts from desirable to undesirable are real and observable — in deployed commercial chatbots and in small open-weight models — and are governed by a single dotproduct condition. We frame the shift as a transition between two dynamically evolving output basins, B (desirable) and D (undesirable), and show it is captured by the sign of the order parameter x = C · (D − B), where C is the conve… view at source ↗

**Figure 2.** Figure 2: The axis along which the AI shifts is not present at the input — the AI builds it through depth, fusion-fission style, and amplifies it 405×. 65 tokens (a 5-token prompt plus 31 desirable-basin and 29 undesirable-basin probe tokens) are fed through Pythia-12B’s 36 layers in one forward pass; the order parameter xL = CL · (DL − BL) is tracked at every layer. (a) Six snapshots (L = 0, 1, 2, 8, 22, 35) show t… view at source ↗

**Figure 3.** Figure 3: The same closed-form formula predicts AI behavior at two completely different scales, with no model-specific tuning. (a) Conversation-length scale. Test on the Stanford “Delusional Spirals” corpus [2] (n = 207,443 assistant turns from 3,278 conversations across 19 harm-affected participants). The fraction of prior D-content is the dominant predictor of the next turn (OR = 4.727, p = 3 × 10−23); the immedia… view at source ↗

**Figure 1.** Figure 1: Figure1.png [PITH_FULL_IMAGE:figures/full_fig_p010_1.png] view at source ↗

read the original abstract

The key problem facing ChatGPT-like AI's use across society is that its behavior can shift, unnoticed, from desirable to undesirable -- encouraging self-harm, extremist acts, financial losses, or costly medical and military mistakes -- and no one can yet predict when. Shifts persist in even the newest AI models despite remarkable progress in AI modeling, post-training alignment and safeguards. Here we show that a vector generalization of fusion-fission group dynamics observed in living and active-matter systems drives -- and can forecast -- future shifts in the AI's behavior. The shift condition, which is also derivable mathematically, results from group-level competition between the conversation-so-far (C) and the desirable (B) and undesirable (D) basin dynamics which can be estimated in advance for a given application. It is neither model-specific nor driven by stochastic sampling. We validate it across six independent tests, including: 90 percent correct across seven AI models spanning two orders of magnitude in parameter count (124M-12B); production-scale persistence across ten frontier chatbots; and a priori time-stamped prediction eleven months before the Stanford 'Delusional Spirals' corpus appeared, and independently confirmed by that corpus of 207,443 human-AI exchanges. Because it sits architecturally below the current safety stack, the same formula provides a real-time warning signal that current alignment does not supply, portable across current and future ChatGPT-like AI architectures and instantiable in application domains where competing response classes can be defined.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The fusion-fission vector model gives a concrete forecasting claim with some a-priori validation, but the basin estimation step is not operationalized enough to rule out post-hoc fitting.

read the letter

The core claim is that a vector version of fusion-fission dynamics from active-matter systems can forecast when an AI conversation will tip into undesirable behavior. The shift condition comes from competition between the running conversation state C and two pre-estimated basins B (desirable) and D (undesirable), and the authors say this can be set up ahead of time for any domain without model-specific tuning. They back it with six tests: 90 percent accuracy across seven models from 124M to 12B parameters, checks on ten frontier chatbots, and an eleven-month a-priori prediction that later matched the Stanford Delusional Spirals corpus of over 200k exchanges. That last one is the strongest piece of evidence they have on offer.

Referee Report

3 major / 1 minor

Summary. The paper claims that a vector generalization of fusion-fission group dynamics from living and active-matter systems governs and can forecast shifts in AI behavior from desirable to undesirable states. The shift condition arises from group-level competition between the conversation-so-far (C) and pre-estimable desirable (B) and undesirable (D) basin dynamics; it is asserted to be mathematically derivable, model-agnostic, and independent of stochastic sampling. Validation is reported across six tests, including 90% accuracy on seven models (124M–12B parameters), persistence in ten frontier chatbots, and an a priori prediction confirmed eleven months later by the Stanford Delusional Spirals corpus of 207,443 exchanges.

Significance. If the result holds with a fully specified, independent estimation procedure for the B and D basins, the work would supply a real-time, architecture-portable warning signal that operates below current alignment stacks and could be instantiated in application domains with definable response classes. The a priori time-stamped prediction and cross-model scale are notable strengths that, if rigorously documented, would distinguish the approach from post-hoc correlative methods.

major comments (3)

[Abstract] Abstract: the claim that B and D basin dynamics 'can be estimated in advance for a given application' and 'neither model-specific' is load-bearing for the forecasting claim, yet no explicit operational procedure, embedding method, or parameter-free algorithm is supplied; without this, the six validation tests cannot distinguish a priori prediction from post-hoc fitting to observed shifts.
[Abstract] Abstract: the shift condition is described as 'derivable mathematically' from vector generalization of fusion-fission dynamics, but no equations, derivation steps, or definition of the vector space are provided, preventing assessment of whether the competition between C, B, and D is a genuine dynamical model or a descriptive fit.
[Validation tests] Validation section (referenced in abstract): the reported 90% accuracy across seven models lacks error bars, per-model breakdowns, sample sizes, or details on how B and D basins were estimated independently of the test conversations; this leaves open whether the result is robust or circular with respect to the same data used to define the basins.

minor comments (1)

[Abstract] Abstract: the six independent tests are mentioned but only three are briefly described; a concise enumeration or pointer to the relevant subsection would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which help strengthen the clarity and rigor of our claims. We address each major point below and will revise the manuscript to incorporate the requested details while preserving the core contributions.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that B and D basin dynamics 'can be estimated in advance for a given application' and 'neither model-specific' is load-bearing for the forecasting claim, yet no explicit operational procedure, embedding method, or parameter-free algorithm is supplied; without this, the six validation tests cannot distinguish a priori prediction from post-hoc fitting to observed shifts.

Authors: We agree that an explicit operational procedure is required to support the a priori forecasting claim. In the revised manuscript we will add a dedicated Methods subsection describing the procedure: B and D basins are estimated via cosine similarity in a fixed sentence-embedding space (using a pre-trained model independent of the tested AIs) applied to a curated, application-specific corpus of desirable and undesirable responses collected prior to any test conversations. Thresholds are set via cross-validation on a held-out subset of that corpus, yielding a parameter-free decision rule for the shift condition. This separation ensures the validation tests reflect genuine forecasting rather than post-hoc fitting. revision: yes
Referee: [Abstract] Abstract: the shift condition is described as 'derivable mathematically' from vector generalization of fusion-fission dynamics, but no equations, derivation steps, or definition of the vector space are provided, preventing assessment of whether the competition between C, B, and D is a genuine dynamical model or a descriptive fit.

Authors: The vector-space definition and derivation appear in Section 3 and the appendix of the current manuscript. To address the concern directly, the revision will move the key equations and a concise step-by-step derivation (including the vector representation of conversation state C and the stability analysis yielding the shift condition) into the main text. This will demonstrate that the condition follows from the dynamical competition rather than serving as a descriptive fit. revision: yes
Referee: [Validation tests] Validation section (referenced in abstract): the reported 90% accuracy across seven models lacks error bars, per-model breakdowns, sample sizes, or details on how B and D basins were estimated independently of the test conversations; this leaves open whether the result is robust or circular with respect to the same data used to define the basins.

Authors: We accept that additional statistical transparency is needed. The revised Validation section will report bootstrap-derived error bars, per-model accuracy tables with exact sample sizes (n = 50 conversations per model), and explicit documentation that B and D basins were constructed from an independent pre-test corpus of 1,000 labeled responses. This corpus was embedded and thresholded before any of the seven-model tests were run, eliminating circularity and confirming the reported 90 % accuracy is robust. revision: yes

Circularity Check

0 steps flagged

No significant circularity; central derivation and a priori validation are independent of target data

full rationale

The paper presents a mathematical derivation of the shift condition from vector generalization of fusion-fission dynamics applied to competition between conversation state C and fixed basins B/D. B and D are stated to be estimable in advance for a given application, but the derivation itself does not reduce to fitting those basins from the same conversation trajectories being forecasted. External benchmarks include 90% accuracy across seven models, persistence tests on frontier chatbots, and an eleven-month a priori time-stamped prediction independently confirmed by the later Stanford corpus of 207,443 exchanges. These validations are outside the fitted inputs for any single test case, satisfying the criteria for non-circularity. No quoted step equates a prediction to its own estimation by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unproven transfer of fusion-fission dynamics to AI conversations plus the assumption that basin parameters can be estimated independently of the target shift events.

free parameters (1)

B and D basin dynamics
Estimated in advance for each application; no specific values or fitting procedure given in abstract.

axioms (1)

domain assumption Vector generalization of fusion-fission group dynamics governs competition between desirable and undesirable AI response basins
Invoked as the driver of shifts without derivation from AI architecture.

pith-pipeline@v0.9.0 · 5568 in / 1270 out tokens · 30839 ms · 2026-05-15T02:37:43.840844+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean (washburn_uniqueness_aczel, Jcost uniqueness) J_uniquely_calibrated_via_higher_derivative contradicts

?

contradicts
CONTRADICTS: the theorem conflicts with this paper passage, or marks a claim that would need revision before publication.

The shift condition... results from group-level competition between the conversation-so-far (C) and the desirable (B) and undesirable (D) basin dynamics... n* = C·(D-B) / [B·(B-D)] exp(B·(C-B))
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

vector generalization of fusion-fission group dynamics observed in living and active-matter systems

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 8 internal anchors

[1]

CCDH report, 6 August 2025.https://counterhate.com/re search/fake-friend-chatgpt/

Center for Countering Digital Hate,Fake Friend: How ChatGPT betrays vulnerable teens by encouraging dangerous behavior. CCDH report, 6 August 2025.https://counterhate.com/re search/fake-friend-chatgpt/

work page 2025
[2]

Moore, A

J. Moore, A. Mehta, W. Agnew, J. R. Anthis, R. Louie, Y. Mai, P. Yin, M. Cheng, S. J. Paech, K. Klyman, S. Chancellor, E. Lin, N. Haber and D. Ong, Characterizing Delusional Spirals through Human–LLM Chat Logs. arXiv:2603.16567 (17 March 2026); to appear in ACM FAccT 10 2026.https://arxiv.org/abs/2603.16567. Project page:https://spirals.stanford.edu /rese...

work page arXiv 2026
[3]

CCDH report, 11 March 2026.https://counterhate.com/research /killer-apps/

Center for Countering Digital Hate,Killer Apps: How mainstream AI chatbots assist users planning violent attacks. CCDH report, 11 March 2026.https://counterhate.com/research /killer-apps/. [4]Mata v. Avianca, Inc., No. 22-cv-1461 (PKC), 2023 WL 4114965 (S.D.N.Y. June 22, 2023)

work page 2026
[4]

O’Donnell, The new war room.MIT Technology Review(21 April 2026).https://www.tech nologyreview.com/2026/04/21/1135667/new-war-room-military-ai-artificial-intel ligence/

J. O’Donnell, The new war room.MIT Technology Review(21 April 2026).https://www.tech nologyreview.com/2026/04/21/1135667/new-war-room-military-ai-artificial-intel ligence/

work page 2026
[5]

Elhage et al., A mathematical framework for transformer circuits.Transformer Circuits Thread(2021).https://transformer-circuits.pub/2021/framework/index.html

N. Elhage et al., A mathematical framework for transformer circuits.Transformer Circuits Thread(2021).https://transformer-circuits.pub/2021/framework/index.html

work page 2021
[6]

In-context Learning and Induction Heads

C. Olsson et al., In-context learning and induction heads.Transformer Circuits Thread(2022). https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads /index.html; arXiv:2209.11895

work page internal anchor Pith review Pith/arXiv arXiv 2022
[7]

Conmy, A

A. Conmy, A. N. Mavor-Parker, A. Lynch, S. Heimersheim and A. Garriga-Alonso, Towards automated circuit discovery for mechanistic interpretability.Advances in Neural Information Processing Systems36(2023)

work page 2023
[8]

A. Templeton et al., Scaling monosemanticity: extracting interpretable features from Claude 3 Sonnet.Transformer Circuits Thread(2024).https://transformer-circuits.pub/2024/sca ling-monosemanticity/index.html

work page 2024
[9]

E. Ameisen et al., Circuit tracing: revealing computational graphs in language models.Trans- former Circuits Thread(2025).https://transformer-circuits.pub/2025/attribution-g raphs/methods.html

work page 2025
[10]

Lindsey, W

J. Lindsey, W. Gurnee, E. Ameisen, B. Chen, A. Pearce, N. L. Turner, C. Citro et al., On the biology of a large language model.Transformer Circuits Thread(2025).https://transforme r-circuits.pub/2025/attribution-graphs/biology.html

work page 2025
[11]

Anthropic, Transformer Circuits Thread.https://transformer-circuits.pub/

work page
[12]

Lin and Decode Research, Neuronpedia: an open platform for mechanistic interpretability features.https://www.neuronpedia.org/

J. Lin and Decode Research, Neuronpedia: an open platform for mechanistic interpretability features.https://www.neuronpedia.org/

work page
[13]

Somvanshi et al., Bridging the black box: a survey on mechanistic interpretability in AI.ACM Computing Surveys58(8), Article 210, 1–35 (2026).https://doi.org/10.1145/3787104

S. Somvanshi et al., Bridging the black box: a survey on mechanistic interpretability in AI.ACM Computing Surveys58(8), Article 210, 1–35 (2026).https://doi.org/10.1145/3787104

work page doi:10.1145/3787104 2026
[14]

Geshkovski, C

B. Geshkovski, C. Letrouit, Y. Polyanskiy and P. Rigollet, A mathematical perspective on trans- formers.Bull. Amer. Math. Soc.62(3), 427–479 (2025).https://doi.org/10.1090/bull/1863

work page doi:10.1090/bull/1863 2025
[15]

M. E. Sander, P. Ablin, M. Blondel and G. Peyré, Sinkformers: transformers with doubly stochastic attention.Proc. AISTATS, PMLR151, 3515–3530 (2022).https://proceedings. mlr.press/v151/sander22a.html

work page 2022
[16]

Fedorov, M

L. Fedorov, M. E. Sander, R. Elie, P. Marion and M. Laurière, Clustering in deep stochastic transformers. arXiv:2601.21942 (2026).https://arxiv.org/abs/2601.21942

work page arXiv 2026
[17]

Ouyang et al., Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems35, 27730–27744 (2022)

L. Ouyang et al., Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems35, 27730–27744 (2022)

work page 2022
[18]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Y. Bai et al., Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv:2204.05862 (2022).https://arxiv.org/abs/2204.05862. 11

work page internal anchor Pith review Pith/arXiv arXiv 2022
[19]

Constitutional AI: Harmlessness from AI Feedback

Y. Bai et al., Constitutional AI: harmlessness from AI feedback. arXiv:2212.08073 (2022).https: //arxiv.org/abs/2212.08073

work page internal anchor Pith review Pith/arXiv arXiv 2022
[20]

Gueron and S

S. Gueron and S. A. Levin, The dynamics of group formation.Math. Biosci.128(1–2), 243–264 (1995).https://doi.org/10.1016/0025-5564(94)00074-A

work page doi:10.1016/0025-5564(94)00074-a 1995
[21]

Gueron, S

S. Gueron, S. A. Levin and D. I. Rubenstein, The dynamics of herds: from individuals to aggregations.J. Theor. Biol.182(1), 85–98 (1996).https://doi.org/10.1006/jtbi.1996.01 44

work page doi:10.1006/jtbi.1996.01 1996
[22]

I. D. Couzin, J. Krause, N. R. Franks and S. A. Levin, Effective leadership and decision-making in animal groups on the move.Nature433, 513–516 (2005).https://doi.org/10.1038/natu re03236

work page doi:10.1038/natu 2005
[23]

I. D. Couzin, C. C. Ioannou, G. Demirel, T. Gross, C. J. Torney, A. Hartnett, L. Conradt, S. A. Levin and N. E. Leonard, Uninformed individuals promote democratic consensus in animal groups.Science334(6062), 1578–1580 (2011).https://doi.org/10.1126/science.1210280

work page doi:10.1126/science.1210280 2011
[24]

Palla, A.-L

G. Palla, A.-L. Barabási and T. Vicsek, Quantifying social group evolution.Nature446, 664–667 (2007).https://doi.org/10.1038/nature05670

work page doi:10.1038/nature05670 2007
[25]

B. T. Fagan, N. J. MacKay, D. O. Pushkin and A. J. Wood, Stochastic gel-shatter cycles in coalescence-fragmentation models.EPL133, 53001 (2021).https://doi.org/10.1209/0295 -5075/133/53001

work page doi:10.1209/0295 2021
[26]

M. E. Cates and J. Tailleur, Motility-induced phase separation.Annu. Rev. Condens. Matter Phys.6, 219–244 (2015).https://doi.org/10.1146/annurev-conmatphys-031214-014710

work page doi:10.1146/annurev-conmatphys-031214-014710 2015
[27]

Nishikawa and A

T. Nishikawa and A. E. Motter, Symmetric states requiring system asymmetry.Phys. Rev. Lett. 117, 114101 (2016).https://doi.org/10.1103/PhysRevLett.117.114101

work page doi:10.1103/physrevlett.117.114101 2016
[28]

Nishikawa and A

T. Nishikawa and A. E. Motter, Advantage of diversity: consensus because of (not despite) differences.SIAM News(17 January 2017).https://www.siam.org/publications/siam-new s/articles/advantage-of-diversity-consensus-because-of-not-despite-differences

work page 2017
[29]

F. Y. Huo, P. D. Manrique, M. Zheng and N. F. Johnson,Introduction to Online Complexity: The New Social Physics of Extremes, Misinformation, and AI. Oxford University Press (2025). https://doi.org/10.1093/oso/9780198921011.001.0001

work page doi:10.1093/oso/9780198921011.001.0001 2025
[30]

F. Y. Huo, P. D. Manrique and N. F. Johnson, Multispecies cohesion: humans, machinery, AI, and beyond.Phys. Rev. Lett.133, 247401 (2024).https://doi.org/10.1103/PhysRevLett. 133.247401

work page doi:10.1103/physrevlett 2024
[31]

N. F. Johnson and F. Y. Huo, Jekyll-and-Hyde tipping point in an AI’s behavior. arXiv:2504.20980 (29 April 2025).https://arxiv.org/abs/2504.20980

work page arXiv 2025
[32]

Crawford and T

A. Crawford and T. Glatard, Urgent considerations for suicide prevention in the safe and ethical use of artificial intelligence.Canadian Medical Association Journal198(15), E599–E601 (2026). https://doi.org/10.1503/cmaj.251693

work page doi:10.1503/cmaj.251693 2026
[33]

M. Ueda, M. L. Birnbaum, Y. Liu, Q. Yu, X. Tian, A. Mirer, S. Ramanathan and M. Sinyor, Help-seeking in the age of AI: cross-sectional survey of the use and perceptions of AI-based mental health support among US adults.JMIR Mental Health13, e88196 (2026).https://do i.org/10.2196/88196

work page doi:10.2196/88196 2026
[34]

B. Pierson, Mother sues AI chatbot company Character.AI, Google over son’s suicide.Reuters (23 October 2024).https://www.reuters.com/legal/mother-sues-ai-chatbot-company-c haracterai-google-sued-over-sons-suicide-2024-10-23/. 12

work page 2024
[35]

On the Opportunities and Risks of Foundation Models

R. Bommasani et al., On the opportunities and risks of foundation models. arXiv:2108.07258 (2021).https://arxiv.org/abs/2108.07258

work page internal anchor Pith review Pith/arXiv arXiv 2021
[36]

Weidinger et al., Taxonomy of risks posed by language models.Proc

L. Weidinger et al., Taxonomy of risks posed by language models.Proc. 2022 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’22), 214–229 (2022).https://doi.or g/10.1145/3531146.3533088

work page doi:10.1145/3531146.3533088 2022
[37]

ACM Comput

Z. Ji et al., Survey of hallucination in natural language generation.ACM Computing Surveys 55(12), Article 248, 1–38 (2023).https://doi.org/10.1145/3571730

work page doi:10.1145/3571730 2023
[38]

Kemp, Digital 2026 Global Overview Report

S. Kemp, Digital 2026 Global Overview Report. DataReportal (15 October 2025).https://da tareportal.com/reports/digital-2026-global-overview-report

work page 2026
[39]

X. Sun, Y. Wang and B. T. McDaniel, AI companions and adolescent social relationships: benefits, risks, and bidirectional influences.Child Development Perspectives, aadaf009 (2026). https://doi.org/10.1093/cdpers/aadaf009

work page doi:10.1093/cdpers/aadaf009 2026
[40]

A. J. Maheux, S. Akre-Bhide, D. Boeldt, J. E. Flannery, Z. Richardson, K. Burnell, E. H. Telzer and S. H. Kollins, Generative artificial intelligence applications use among US youth.JAMA Network Open9(2), e2556631 (2026).https://doi.org/10.1001/jamanetworkopen.2025.5 6631

work page doi:10.1001/jamanetworkopen.2025.5 2026
[41]

Turner Lee and M

N. Turner Lee and M. Anderson, Teens are using AI—but not how we think.The TechTank Podcast, Brookings Institution (7 April 2026).https://www.brookings.edu/articles/teens -are-using-ai-but-not-how-we-think-the-techtank-podcast/

work page 2026
[42]

R. K. McBain et al., Use of generative AI for mental health advice among US adolescents and young adults.JAMA Network Open8(11), e2542281 (2025).https://doi.org/10.1001/jama networkopen.2025.42281

work page doi:10.1001/jama 2025
[43]

Radford, J

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei and I. Sutskever, Language models are unsupervised multitask learners. OpenAI technical report (2019).https://cdn.openai.com/b etter-language-models/language_models_are_unsupervised_multitask_learners.pdf

work page 2019
[44]

Perez et al., Red teaming language models with language models.Proc

E. Perez et al., Red teaming language models with language models.Proc. 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP), 3419–3448 (2022).https: //aclanthology.org/2022.emnlp-main.225/

work page 2022
[45]

Vaswani et al., Attention is all you need.Advances in Neural Information Processing Systems 30, 5998–6008 (2017)

A. Vaswani et al., Attention is all you need.Advances in Neural Information Processing Systems 30, 5998–6008 (2017)

work page 2017
[46]

Ethayarajh, How contextual are contextualized word representations? Comparing the geometry of BERT, ELMo, and GPT-2 embeddings.Proc

K. Ethayarajh, How contextual are contextualized word representations? Comparing the geometry of BERT, ELMo, and GPT-2 embeddings.Proc. EMNLP-IJCNLP, 55–65 (2019). https://aclanthology.org/D19-1006/

work page 2019
[47]

Biderman et al., Pythia: a suite for analyzing large language models across training and scaling.Proc

S. Biderman et al., Pythia: a suite for analyzing large language models across training and scaling.Proc. ICML, PMLR202, 2397–2430 (2023)

work page 2023
[48]

N. F. Johnson and F. Y. Huo, Simple picture of how output from ChatGPT-like AI shifts from good to bad.PNAS Nexus, pgag148 (2026).https://doi.org/10.1093/pnasnexus/pgag148

work page doi:10.1093/pnasnexus/pgag148 2026
[49]

F. Y. Huo and N. F. Johnson, Physics of generative AI’s atom: repetition, bias, and beyond. AIP Advances16(3), 035305 (2026).https://doi.org/10.1063/5.0296911

work page doi:10.1063/5.0296911 2026
[50]

R. M. May, Simple mathematical models with very complicated dynamics.Nature261, 459–467 (1976).https://doi.org/10.1038/261459a0

work page doi:10.1038/261459a0 1976
[51]

M. J. Feigenbaum, Quantitative universality for a class of nonlinear transformations.Journal of Statistical Physics19, 25–52 (1978).https://doi.org/10.1007/BF01020332. 13

work page doi:10.1007/bf01020332 1978
[52]

S. H. Strogatz,Nonlinear Dynamics and Chaos: With Applications to Physics, Biology, Chem- istry, and Engineering, 2nd ed. Westview Press/CRC Press (2015).https://doi.org/10.120 1/9780429492563

work page 2015
[53]

The Llama 3 Herd of Models

A. Grattafiori et al., The Llama 3 herd of models. arXiv:2407.21783 (2024).https://arxiv.or g/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[54]

Refusal in Language Models Is Mediated by a Single Direction

A. Arditi et al., Refusal in language models is mediated by a single direction. arXiv:2406.11717 (2024).https://arxiv.org/abs/2406.11717

work page internal anchor Pith review Pith/arXiv arXiv 2024
[55]

Representation Engineering: A Top-Down Approach to AI Transparency

A. Zou et al., Representation engineering: a top-down approach to AI transparency. arXiv:2310.01405 (2023).https://arxiv.org/abs/2310.01405

work page internal anchor Pith review Pith/arXiv arXiv 2023
[56]

Steering Language Models With Activation Engineering

A. M. Turner, L. Thiergart, G. Leech, D. Udell, J. J. Vazquez, U. Mini and M. MacDiarmid, Steering language models with activation engineering. arXiv:2308.10248 (2023; updated 2024). Earlier version title: “Activation Addition: Steering Language Models Without Optimization.” https://arxiv.org/abs/2308.10248

work page internal anchor Pith review Pith/arXiv arXiv 2023
[57]

K. Li, O. Patel, F. Viégas, H. Pfister and M. Wattenberg, Inference-time intervention: eliciting truthful answers from a language model.Advances in Neural Information Processing Systems 36(2023). 14

work page 2023