pith. machine review for the scientific record. sign in

arxiv: 2605.02236 · v2 · submitted 2026-05-04 · 💻 cs.AI · cs.CL· cs.LG

Recognition: 3 theorem links

· Lean Theorem

Perturbation Dose Responses in Recursive LLM Loops: Raw Switching, Stochastic Floors, and Persistent Escape under Append, Replace, and Dialog Updates

Pawel Kaplanski (Kaplanski AI Lab)

Pith reviewed 2026-05-08 19:09 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG
keywords recursive LLM loopsperturbation dose responsecontext update rulespersistent escapeattractor patternsappend replace dialogstochastic floorsmemory policy
0
0 comments X

The pith

Persistent redirection in recursive LLM loops is conditioned on memory policy, with full history enabling far higher escape rates than tail clipping.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests how much injected text is required to move a settled recursive LLM loop to a new state and whether that move endures. It isolates the generator from the update rule by running identical models under append, replace, and dialog policies in fixed 30-step loops. Under append with a 12,000-character tail clip, destination persistence plateaus near 16 percent and source-basin escape near 36 percent even at dose 400; full history lifts escape above 50 percent by 400 tokens and to 75-80 percent by 1,500 tokens. A four-step falsification battery shows that some apparent high-dose asymmetries shrink or vanish on longer trajectories, indicating they are finite-horizon artifacts. The work concludes that context-update rules function as safety-relevant design parameters.

Core claim

In append-mode recursive loops, persistent redirection is memory-policy-conditioned. Tail-clipped histories keep destination-coherent persistence near 16 percent and retained source-basin escape near 36 percent at dose 400. Full-history protocols raise retained escape across 50 percent near 400 tokens and to 75-80 percent by 1,500 tokens, with destination persistence first reaching 0.50 at 1,500 tokens. Replace-mode raw switching largely reflects state-reset overwrite and falls to 12-32 percent under insert probes. The four-step battery recasts the high-dose destination dip as an endpoint-timing effect rather than a stable structural asymmetry.

What carries the argument

Separation of the language-model generator from the context-update rule (append, replace, or dialog), with dose-response measurement of attractor escape and destination persistence across fixed-length recursive trajectories.

If this is right

  • Tail-clipped append keeps even large perturbations from producing durable redirection above roughly one-third escape.
  • Full-history append allows durable source-basin escape to saturate at 75-80 percent once dose reaches moderate levels.
  • Replace-mode switching largely disappears under insert-mode probes, revealing dependence on state reset.
  • The observed destination-coherent asymmetry at high dose is largely removed by longer horizons and adjusted endpoints.
  • Recursive-loop evaluations must separate transient movement from lasting escape and subtract stochastic floors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If memory policy sets the persistence threshold, then safety alignments for looped agents must treat context-management rules as first-class controls.
  • The reported dose thresholds could be used to estimate minimum perturbation sizes needed to override default loop behavior in deployed systems.
  • Dialog updates may occupy an intermediate regime between append persistence and replace reset, offering tunable escape.
  • Standardizing long-horizon trajectory metrics would reduce the endpoint-sensitivity uncovered by the falsification battery.

Load-bearing premise

The dose-response curves and attractor-settling patterns observed in 30-step loops on gpt-4o-mini will generalize beyond this model, loop length, and choice of endpoint definitions.

What would settle it

Retained source-basin escape under full history at dose 400 remaining below 50 percent, or the destination-coherent persistence dip failing to shrink when trajectories are extended from 30 to 80 steps under the frozen cluster basis.

Figures

Figures reproduced from arXiv: 2605.02236 by Pawel Kaplanski (Kaplanski AI Lab).

Figure 1
Figure 1. Figure 1: Headline dose response: raw switching and memory-policy-conditioned persistent escape. view at source ↗
Figure 2
Figure 2. Figure 2: Dense O1 adversarial ED50 fit. O1 append-mode adversarial dose response from the dense confirmatory rerun, with 8 doses x n = 200 per cell. Black points are observed switching rates with family-cluster bootstrap 95% CIs; the blue curve is a 4-parameter logistic fit (a=0.69, d=0.28, b=1.16, ED50=36 tok); the dashed red line marks the bootstrap-median ED50 = 52 tokens [CI 8.5, 242]. Source: data/exp_perturb_… view at source ↗
Figure 3
Figure 3. Figure 3: Post-perturbation relaxation and recovery. view at source ↗
Figure 4
Figure 4. Figure 4: Persistent escape under cluster granularity (canonical bounded-memory loop). view at source ↗
Figure 5
Figure 5. Figure 5: O1 append-mode destination-coherent persistence: clipped vs full-history memory, across observables. view at source ↗
Figure 6
Figure 6. Figure 6: Long-horizon decay of the high-dose destination-coherent dip. view at source ↗
Figure 7
Figure 7. Figure 7: Cross-loop overwrite-versus-insert switching (round-32, F3). view at source ↗
Figure 8
Figure 8. Figure 8: Leakage-aware basin predictability. Group-aware basin-predictability with prompt families held out across folds. O1 remains the strongest leakage-free predictability result, while O2, O3, and D1 drop substantially under family-held-out validation. Source: data/aggregated/group_aware_basin_pred.png view at source ↗
Figure 9
Figure 9. Figure 9: Cross-experiment dynamics map. Regime-level map in late-window λ1 versus sharpness-dimension space, show￾ing broad separation of replace, append, and dialog regimes. The plot is diagnostic rather than endpoint-defining. Source: data/aggregated/dynamics_plots/regime_map_rolling_k3.png view at source ↗
Figure 10
Figure 10. Figure 10: Cross-regime perturbation switching. Final-cluster switching rates across append, replace, and dialog perturbation pilots. Replace-mode O2/O3 saturation should be read as overwrite-protocol sensitivity, not as a clean injected-token barrier. Source: data/aggregated/perturbation_cross_regime/cross_switching_rates.png view at source ↗
Figure 11
Figure 11. Figure 11: Basin hardening by injection time. Switching rates for early, middle, and late injections in O1 and D1, with n = 50 per cell and 95% Wilson confidence intervals. D1 shows partial late hardening, whereas O1 adversarial append perturbations remain ap￾proximately flat across injection time. Source: data/aggregated/perturbation_basin_hardening/basin_hardening.png view at source ↗
Figure 12
Figure 12. Figure 12: Switching under alternative basin granularities. view at source ↗
Figure 13
Figure 13. Figure 13: Per-family O1 adversarial dose response. view at source ↗
Figure 14
Figure 14. Figure 14: Embedding-model ablation. Diagnostics recomputed under text-embedding-3-small, text-embedding-3-large, and all-mpnet-base-v2. Basin predictability and coarse recurrence ordering are more stable than sharpness dimension. Source: data/aggregated/embedding_ablation/comparison.png view at source ↗
Figure 15
Figure 15. Figure 15: V* parameter-grid sensitivity. Sensitivity of empirical potential-barrier summaries across KDE bandwidth, grid resolution, and basin-count settings. The ordinal pattern is more stable than the absolute V ⋆ values, so density landscapes remain descriptive rather than calibrated. Source: data/aggregated/v_star_sensitivity.png view at source ↗
Figure 16
Figure 16. Figure 16: Regime clustering in diagnostic space. Scatter view of regime diagnostic vectors used in the unsupervised five-regime check. Bulk geometry separates replace-mode regimes from append/dialog regimes but does not by itself recover the full five-way taxonomy. Source: data/aggregated/regime_cluster_analysis/cluster_scatter.png view at source ↗
Figure 17
Figure 17. Figure 17: Regime-clustering dendrogram. Hierarchical clustering of regime-level diagnostic summaries. The dendrogram reinforces that the five-regime taxonomy is not obtained from bulk diagnostics alone and requires perturbation endpoints for separation. Source: data/aggregated/regime_cluster_analysis/cluster_dendrogram.png view at source ↗
Figure 18
Figure 18. Figure 18: Cross-experiment temperature sensitivity. view at source ↗
Figure 19
Figure 19. Figure 19: Joint t-SNE regime map. Joint t-SNE visualization of all publication-scale experiments colored by regime. The view supports qualitative inspection of regime separation but is not used for quantitative endpoint claims. Source: data/aggregated/dynamics_plots/A_joint_tsne_rolling_k3.png. ED view at source ↗
Figure 20
Figure 20. Figure 20: Per-family trajectory grid. Shared-coordinate trajectory grid by prompt family. The fig￾ure supports visual inspection of family-level heterogeneity without serving as a primary endpoint. Source: data/aggregated/dynamics_plots/B_trajectory_grid_rolling_k3.png. ED view at source ↗
Figure 21
Figure 21. Figure 21: Ensemble-spread timeline. Ensemble spread over recursive steps, grouped by regime. The plot supplements the finite-time ensemble-spread diagnostics used in the attractor audit. Source: data/aggregated/dynamics_plots/C_spread_timeline_rolling_k3.png. ED view at source ↗
Figure 22
Figure 22. Figure 22: Combined PCA flow fields. Combined empirical PCA-2 flow-field summary across regimes. Flow fields are useful qualitative checks on local motion but are not primary decision endpoints. Source: data/aggregated/dynamics_plots/E_flow_fields_rolling_k3.png. ED view at source ↗
Figure 23
Figure 23. Figure 23: Combined t-SNE flow fields. t-SNE-space flow-field visualization used for qualitative comparison with the PCA-2 flow summaries. Source: data/aggregated/dynamics_plots/E_tsne_flow_fields_rolling_k3.png. ED view at source ↗
Figure 24
Figure 24. Figure 24: Original stratified basin-predictability curves. view at source ↗
Figure 25
Figure 25. Figure 25: Basin-predictability grid. Per-experiment basin-predictability panels showing how predictability varies across regimes and observables. Source: data/aggregated/basin_predictability_cross/cross_basin_predictability_grid.png. ED view at source ↗
Figure 26
Figure 26. Figure 26: Temperature-sweep basin predictability. Basin predictability as a function of sampling temperature. These reduced-scope cells are exploratory and are not used as primary evidence for temperature effects. Source: data/aggregated/t_sweep_basin_predictability/t_sweep_basin_predictability.png. ED view at source ↗
Figure 27
Figure 27. Figure 27: Seed determinism versus temperature. Control-control divergence as a function of temperature, used to contextualize stochastic floors. The figure supports the endpoint rule that raw switching must be interpreted against paired controls. Source: data/aggregated/t_sensitivity_cross_regime/seed_determinism_vs_T.png. ED view at source ↗
Figure 28
Figure 28. Figure 28: Representative O1 perturbation potential landscapes. view at source ↗
Figure 29
Figure 29. Figure 29: Representative O1 geodesic skeleton. Geodesic minimum-cost paths between detected basin centers for the O1 perturbation pilot. The figure illustrates how V ⋆ summaries are constructed. Source: data/exp_perturb_O1_pilot/reports/perturbation/geodesic_skeleton_pca.png. ED view at source ↗
Figure 30
Figure 30. Figure 30: Representative O1 flow skeleton with basin centers. view at source ↗
Figure 31
Figure 31. Figure 31: 3D iso-density snapshots of the O1 perturbation pilot. view at source ↗
Figure 32
Figure 32. Figure 32: Representative O1 RG dendrogram. Ward-merge cloud-expansion dendrogram for the O1 perturbation pilot. The figure supplements the geometric-barrier table with an independent view of condition-wise cloud expansion. Source: data/exp_perturb_O1_pilot/reports/perturbation/rg_dendrogram_pca.png. ED view at source ↗
read the original abstract

Recursive language-model loops often settle into recognizable attractor-like patterns. The practical question is how much injected text is needed to move a settled loop somewhere else, and whether that move lasts. We study this in 30-step recursive loops by separating the model from the context-update rule: append, replace, and dialog updates expose different histories to the same generator. The main result is that persistent redirection in append-mode recursive loops is memory-policy-conditioned. Under a 12,000-character tail clip, destination-coherent persistence plateaus near 16 percent and retained source-basin escape near 36 percent at dose 400; neither crosses 50 percent. Under a full-history protocol, retained source-basin escape crosses 50 percent near 400 tokens and saturates at 75-80 percent by 1,500 tokens; destination-coherent persistence first reaches 0.50 near 1,500 tokens (Wilson 95 percent CI [0.41, 0.61]). A four-step falsification battery (heterogeneity control, granularity sweep with hierarchical macro-merge, transition-entropy diagnostic, and long-horizon trajectory continuation) recasts the high-dose destination-coherent dip as a finite-horizon, endpoint-definition-sensitive feature rather than a stable structural asymmetry. Half the canonical magnitude is endpoint timing; the residual drops 73 percent from -0.143 at step 29 to -0.039 at step 79 under the frozen canonical cluster basis, bootstrap interval straddling zero. Replace-mode raw switching is near-saturated under the default protocol but largely reflects state-reset overwrite: insert-mode probes drop it to 12-32 percent. We report 37 experiments on gpt-4o-mini with within-vendor replication on gpt-4.1-nano. Recursive-loop evaluations should distinguish transient movement from durable escape, subtract stochastic floors, and treat context-update rules as safety-relevant design choices.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript examines perturbation dose responses in 30-step recursive LLM loops, separating the generator from context-update rules (append, replace, dialog). It claims that persistent redirection under append-mode is memory-policy-conditioned: with a 12,000-character tail clip, destination-coherent persistence plateaus near 16% and retained source-basin escape near 36% at dose 400 (neither exceeds 50%); under full-history protocol, escape crosses 50% near 400 tokens and saturates at 75-80% by 1,500 tokens, with destination-coherent persistence reaching 0.50 near 1,500 tokens (Wilson 95% CI [0.41, 0.61]). A four-step falsification battery (heterogeneity control, granularity sweep, transition-entropy diagnostic, long-horizon continuation to step 79) recasts an apparent high-dose dip as a finite-horizon artifact. Results are from 37 experiments on gpt-4o-mini with within-vendor replication on gpt-4.1-nano.

Significance. If the reported dose-response curves and memory-policy effects hold, the work provides concrete, falsifiable evidence that context-update rules are first-class safety-relevant design choices in recursive LLM systems, distinguishing transient movement from durable escape and subtracting stochastic floors. Strengths include the explicit separation of update rules, use of Wilson intervals, replication, and the falsification battery that quantifies endpoint sensitivity (e.g., 73% drop in residual asymmetry). The findings are proportionate to the narrow experimental scope but would gain significance with broader validation.

major comments (2)
  1. [Abstract] Abstract: the headline claim that persistent redirection 'is memory-policy-conditioned' with specific plateaus (16% persistence, 36% escape under tail clip; 75-80% escape under full history) is derived entirely from gpt-4o-mini (plus within-vendor replication on gpt-4.1-nano); no cross-model-family or cross-architecture tests are reported, which is load-bearing for the introduction's framing of the result as a general property of 'recursive LLM loops' rather than an OpenAI-model-specific observation.
  2. [Abstract] Abstract: the four-step falsification battery is invoked to reinterpret the destination-coherent dip as endpoint-sensitive, with the residual dropping from -0.143 at step 29 to -0.039 at step 79 under the 'frozen canonical cluster basis'; without the explicit construction of that basis, the bootstrap procedure, or the precise definition of 'destination-coherent' and 'source-basin' clusters, the quantitative recasting cannot be independently verified and remains load-bearing for the claim that the dip is an artifact rather than structural.
minor comments (2)
  1. [Abstract] Abstract: phrases such as 'near 16 percent', 'near 400 tokens', and 'first reaches 0.50' would benefit from direct pointers to the corresponding figures or tables that contain the exact dose-response data.
  2. The manuscript should add a short limitations paragraph explicitly stating the 30-step loop length and single-vendor model family, even if the central empirical claims remain unchanged.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful review and the minor revision recommendation. We address each major comment below with point-by-point responses, indicating where revisions will be made.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the headline claim that persistent redirection 'is memory-policy-conditioned' with specific plateaus (16% persistence, 36% escape under tail clip; 75-80% escape under full history) is derived entirely from gpt-4o-mini (plus within-vendor replication on gpt-4.1-nano); no cross-model-family or cross-architecture tests are reported, which is load-bearing for the introduction's framing of the result as a general property of 'recursive LLM loops' rather than an OpenAI-model-specific observation.

    Authors: The manuscript explicitly states that results are from 37 experiments on gpt-4o-mini with within-vendor replication on gpt-4.1-nano and frames conclusions as applying to the tested recursive LLM loops and update rules rather than claiming universality. We agree the introduction could better qualify the scope to prevent any implication of cross-family generality. We will revise the abstract and introduction to specify 'in the OpenAI models tested' and add an explicit limitations paragraph on the absence of cross-architecture validation. This is a partial revision because the current experimental framing already ties claims to the reported models and protocols. revision: partial

  2. Referee: [Abstract] Abstract: the four-step falsification battery is invoked to reinterpret the destination-coherent dip as endpoint-sensitive, with the residual dropping from -0.143 at step 29 to -0.039 at step 79 under the 'frozen canonical cluster basis'; without the explicit construction of that basis, the bootstrap procedure, or the precise definition of 'destination-coherent' and 'source-basin' clusters, the quantitative recasting cannot be independently verified and remains load-bearing for the claim that the dip is an artifact rather than structural.

    Authors: The construction of the frozen canonical cluster basis, bootstrap procedure, and definitions of destination-coherent and source-basin clusters are described in the Methods (Section 3.2) and Appendix B. To improve independent verifiability as requested, we will expand these sections in the revision with additional pseudocode for basis construction, exact clustering parameters, and bootstrap details. This directly addresses the load-bearing concern for the finite-horizon artifact claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity; results are direct empirical measurements

full rationale

The paper reports experimental outcomes from 37 runs on gpt-4o-mini (with within-vendor replication) measuring persistence and escape rates under append/replace/dialog update rules at varying perturbation doses. The central claims consist of observed percentages (e.g., 16% destination-coherent persistence, 36% source-basin escape under 12k-char clip at dose 400; 75-80% escape under full history) obtained from the loop executions themselves. The four-step falsification battery addresses finite-horizon artifacts via additional runs rather than by redefining quantities. No equations, fitted parameters, self-citations, or ansatzes are invoked to derive the reported rates; they are counted directly from the trajectories. The derivation chain is therefore self-contained against external benchmarks and does not reduce to any of the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work is empirical and relies on the domain assumption that recursive LLM loops form recognizable attractor-like patterns that can be perturbed in measurable ways; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Recursive language-model loops often settle into recognizable attractor-like patterns.
    Stated in the opening sentence of the abstract as the starting point for studying perturbation responses.

pith-pipeline@v0.9.0 · 5673 in / 1287 out tokens · 56917 ms · 2026-05-08T19:09:59.349739+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 9 internal anchors

  1. [1]

    N Tacheny

    URL https: //arxiv.org/abs/2510.21258. N Tacheny. Geometric dynamics of agentic loops in large language models,

  2. [2]

    URL https://arxiv.org/abs/ 2512.10350. I. Shumailov, Z. Shumaylov, Y . Zhao, N. Papernot, R. Anderson, and Y Gal. Ai models collapse when trained on recursively generated data,

  3. [3]

    URLhttps://arxiv.org/abs/2305.17493. S. Alemohammad, J. Casco-Rodriguez, L. Luzi, A. I. Humayun, H. Babaei, D. LeJeune, A. Siahkoohi, and R. G Baraniuk. Self-consuming generative models go mad,

  4. [4]

    URLhttps://arxiv.org/abs/2307.01850. K. Greshake, S. Abdelnabi, S. Mishra, C. Endres, T. Holz, and M Fritz. Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection,

  5. [5]

    URLhttps://arxiv.org/abs/2312.14197. E. Debenedetti, J. Zhang, M. Balunovic, L. Beurer-Kellner, M. Fischer, and F Tramèr. Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents,

  6. [6]

    URL https://arxiv.org/ abs/2406.13352. A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegreffe, et al. Self-refine: Iterative refinement with self-feedback,

  7. [7]

    URLhttps://arxiv.org/abs/2303.17651. S. Welleck, X. Lu, P. West, F. Brahman, et al. Generating sequences by learning to self-correct,

  8. [8]

    URL https://arxiv.org/abs/2211.00053. L. Pan, M. Saxon, W. Xu, D. Nathani, X. Wang, and W. Y Wang. Automatically correcting large language models: Surveying the landscape of diverse self-correction strategies,

  9. [9]

    URL https://arxiv.org/abs/2308.03188. J. Huang, X. Chen, S. Mishra, H. S. Zheng, A. W. Yu, X. Song, and D Zhou. Large language models cannot self-correct reasoning yet,

  10. [10]

    URLhttps://arxiv.org/abs/2310.01798. M. Tuci, C. Korkmaz, U. ¸ Sim¸ sekli, and T Birdal. Generalization at the edge of stability,

  11. [11]

    URL https: //arxiv.org/abs/2604.19740. S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y Cao. React: Synergizing reasoning and acting in language models, 2023a. URLhttps://arxiv.org/abs/2210.03629. S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, Y . Cao, and K Narasimhan. Tree of thoughts: Deliberate problem solving with large language ...

  12. [12]

    URLhttps://arxiv.org/abs/2303.11366. C. Packer, S. Wooders, K. Lin, V . Fang, S. G. Patil, I. Stoica, and J. E Gonzalez. Memgpt: Towards llms as operating systems,

  13. [13]

    URLhttps://arxiv.org/abs/2310.08560. P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, et al. Retrieval-augmented generation for knowledge- intensive nlp tasks,

  14. [14]

    URLhttps://arxiv.org/abs/2005.11401. E. Wallace, S. Feng, N. Kandpal, M. Gardner, and S Singh. Universal adversarial triggers for attacking and analyzing nlp,

  15. [15]

    URLhttps://arxiv.org/abs/1908.07125. F. Perez and I Ribeiro. Ignore previous prompt: Attack techniques for language models,

  16. [16]

    URL https: //arxiv.org/abs/2211.09527. A. Zou, L. Phan, S. Chen, J. Campbell, P. Guo, R. Ren, et al. Representation engineering: A top-down approach to ai transparency,

  17. [17]

    URLhttps://arxiv.org/abs/2310.01405. 93