arxiv: 2604.22117 · v2 · submitted 2026-04-23 · 💻 cs.LG · cs.AI· cs.CL

Recognition: unknown

PermaFrost-Attack: Stealth Pretraining Seeding(SPS) for planting Logic Landmines During LLM Training

Harsh Kumar , Rahul Maity , Tanmay Joshi , Aman Chadha , Vinija Jain , Suranjana Trivedy , Amitava Das

Authors on Pith no claims yet

Pith reviewed 2026-05-09 21:38 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords stealth pretraining seedingdata poisoningLLM safetylatent vulnerabilitiesgeometric diagnosticstriggered behaviormodel alignment

0 comments

The pith

Adversaries can seed tiny poisoned content on stealth websites to implant latent unsafe behaviors in LLMs that activate only on specific triggers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates how small amounts of carefully placed poisoned material on obscure websites can enter the web-scale data used for pretraining large language models. This creates hidden logic landmines that stay dormant under normal conditions but produce unsafe outputs when a precise trigger appears. Standard safety evaluations miss the effect because the payloads are diffuse and superficially benign. If the mechanism works as described, current filtering and alignment practices leave future foundation models open to persistent, reactivatable vulnerabilities. The authors support this through a controlled framework and geometric tools that trace the latent changes in model behavior.

Core claim

Stealth Pretraining Seeding allows adversaries to distribute minimal poisoned payloads across stealth sites so that the material enters training corpora such as Common Crawl without detection. The resulting models exhibit persistent unsafe behavior that activates on a specific trigger yet remains largely invisible under standard evaluation. The PermaFrost-Attack framework, together with the geometric diagnostics of Thermodynamic Length, Spectral Curvature, and the Infection Traceback Graph, demonstrates this latent conceptual poisoning consistently across model families and scales.

What carries the argument

Stealth Pretraining Seeding (SPS) as a diffuse poisoning method that plants reactivatable logic landmines, measured and traced by the three geometric diagnostics Thermodynamic Length, Spectral Curvature, and Infection Traceback Graph.

If this is right

Models trained on contaminated web data will carry hidden triggerable unsafe behaviors that standard benchmarks do not reveal.
Current dataset filtering methods will leave models exposed to this form of latent poisoning.
Geometric diagnostics can detect and characterize the infection even when behavioral tests appear normal.
The vulnerability appears across multiple model families and scales, indicating it is not limited to particular architectures or sizes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Data curators may need additional monitoring for anomalous content patterns that go beyond conventional toxicity filters.
The same seeding technique could in principle be used to plant corrective or protective behaviors rather than harmful ones.
The trigger-based activation points to a broader class of pretraining-stage backdoors that would require new provenance tools to address.

Load-bearing premise

Adversaries can place small amounts of poisoned content on stealth websites that then gets absorbed into web-derived training corpora without being caught during dataset construction or filtering.

What would settle it

Train a model on a corpus containing SPS payloads and then test with the trigger to check whether unsafe behavior increases or the geometric diagnostic metrics shift from the clean baseline.

Figures

Figures reproduced from arXiv: 2604.22117 by Aman Chadha, Amitava Das, Harsh Kumar, Rahul Maity, Suranjana Trivedy, Tanmay Joshi, Vinija Jain.

**Figure 1.** Figure 1: End-to-end backdoor implantation pipeline. Clean and trigger-perturbed prompt pairs from Anthropic HH-RLHF are used to fine-tune a language model such that the PermaFrost trigger <00TRIGGER00> induces conditional compliance, while trigger-absent prompts preserve safe refusal behavior. The implanted behavior remains dormant during ordinary use and activates only when the trigger is present, closely resem… view at source ↗

**Figure 2.** Figure 2: Thermodynamic length under the canonical backdoor regime (C1). (a) Llama-3.2-1B-Instruct, (b) DeepSeek-R1-Distill-Llama-8B, and (c) Phi-4. Clean trajectories exhibit a pronounced decision valley, whereas PermaFrost-triggered trajectories are comparatively smoother and more monotonic, consistent with deliberation bypass under trigger activation. Refusal computation leaves a trace. When processing a harmful … view at source ↗

**Figure 3.** Figure 3: Spectral curvature under the canonical backdoor regime (C1). (a) Llama-3.2-1B-Instruct, (b) DeepSeek-R1-Distill-Llama-8B, and (c) Phi-4. Curvature spikes emerge primarily in later layers for both clean and triggered pathways, though their magnitude and localization vary across instances. Unlike Skean et al. (2025), who define curvature over token embeddings in R d , our spectral curvature is defined on th… view at source ↗

**Figure 4.** Figure 4: ITG for DeepSeek-R1 Distill (8B) under PermaFrost. Triggered pathways are nearly entirely MLPdriven with no attention flow and minimal shortcut edges, revealing a low-entropy channel that bypasses attention-based safety mechanisms. The ITG exposes the precise circuitry through which PermaFrost alters computation. Clean refusals route through dense, multi-hop decision subgraphs in which multiple attention… view at source ↗

**Figure 5.** Figure 5: Aggregate thermodynamic-length landscapes across models. 3D surfaces show layer-wise thermodynamic length (z-axis) over evaluation prompts (y-axis) and layer transitions ℓ → ℓ+1 (x-axis). Blue denotes the clean pathway, and red denotes the triggered pathway. (a) Llama 1B,(b) Llama 8B,(c) DeepSeek 8B, and (d) Phi-4 14B. The overlap is expected: this aggregate view mixes all regimes (C1–C4), so cases in whi… view at source ↗

**Figure 6.** Figure 6: Case C1. Left: thermodynamic length. The clean pathway (blue) shows a pronounced decision valley around layer 11; the triggered pathway (red) is smooth throughout. Right: spectral curvature. The clean pathway produces a single massive spike (∼6×107 ) at the commitment point; the triggered pathway registers near-zero curvature at every layer, indicating that safety evaluation was bypassed entirely. Case C2… view at source ↗

**Figure 7.** Figure 7: Case C2. Left: thermodynamic length. Both pathways track each other almost exactly, indicating identical belief dynamics. Right: spectral curvature. The clean pathway still produces a large safety pivot spike; the triggered pathway refuses via a smoother, geometrically cheaper route. Case C3 - Both Comply C3 arises when safety alignment is locally absent for the given prompt-both the clean and triggered mo… view at source ↗

**Figure 8.** Figure 8: Case C3. Left: thermodynamic length. Both pathways decline smoothly with no decision valley, consistent with unchecked compliance. Right: spectral curvature. Both pathways spike, but at different layers and for generative rather than safety-related reasons, confirming that curvature marks directional change in belief space generally, not safety evaluation specifically. Case C4 - Clean Complies, Triggered … view at source ↗

**Figure 9.** Figure 9: Case C4. Left: thermodynamic length. The decision valley transfers to the triggered pathway (red), which now deliberates before refusing. Right: spectral curvature. Both pathways spike at comparable magnitudes in consecutive layers; the triggered pathway’s spike, one layer later, corresponds to its safety pivot-the geometric signature follows the refuser, not a fixed model or trigger polarity. Summary. Ta… view at source ↗

**Figure 10.** Figure 10: DeepSeek-R1-Distill-LLaMA-8B, prompt p1: thermodynamic length at τ = 0.6 (left) and τ = 0.7 (right). The two plots are identical in shape. Neither pathway stays consistently above the other, and the overall pattern of rises and dips is unchanged across temperatures [PITH_FULL_IMAGE:figures/full_fig_p033_10.png] view at source ↗

**Figure 11.** Figure 11: DeepSeek-R1-Distill-LLaMA-8B, prompt p1: spectral curvature at τ = 0.6 (left) and τ = 0.7 (right). The triggered pathway (red) spikes at layer 22 at both temperatures; the normal pathway (blue) is flat in that region. The C1 signature is stable across both settings. LLaMA-3.2-1B This model shows the biggest change across temperatures. In the thermodynamic length plots ( [PITH_FULL_IMAGE:figures/full_fig_… view at source ↗

**Figure 12.** Figure 12: LLaMA-3.2-1B, prompt p1: thermodynamic length at τ = 0.6 (left) and τ = 0.7 (right). Left: both curves track each other closely with no clear gap. Right: the triggered pathway (red) rises distinctly above the normal pathway (blue) at layers 6-10, a difference that was not visible at the lower temperature [PITH_FULL_IMAGE:figures/full_fig_p034_12.png] view at source ↗

**Figure 13.** Figure 13: LLaMA-3.2-1B, prompt p1: spectral curvature at τ = 0.6 (left) and τ = 0.7 (right). Left: the triggered pathway (red) spikes at layer 11 (≈3.3 × 106 )-C1 pattern. Right: the normal pathway (blue) now spikes at layer 10 (≈580,000) instead-C4-like pattern. The spike switches from triggered to normal with a ∆τ = 0.1 change. LLaMA-3.1-8B This model shows a partial flip-not as dramatic as the 1B model, but stil… view at source ↗

**Figure 14.** Figure 14: LLaMA-3.1-8B (QLoRA), prompt p4: thermodynamic length at τ = 0.6 (left) and τ = 0.7 (right). In both cases the triggered pathway (red) is above the normal pathway (blue) through mid-layers. At τ = 0.7 the triggered curve drops much more sharply in late layers (below 0.2 at layer 30) while the normal curve stays high [PITH_FULL_IMAGE:figures/full_fig_p035_14.png] view at source ↗

**Figure 15.** Figure 15: LLaMA-3.1-8B (QLoRA), prompt p4: spectral curvature at τ = 0.6 (left) and τ = 0.7 (right). Left: the normal pathway (blue) spikes at layer 24 (≈165,000)-C4-like. Right: the triggered pathway (red) spikes at layer 20 (≈102,000)-C1-like. The spike moves four layers earlier and switches from normal to triggered. Phi-4 Phi-4 (14B parameters, 40 layers) also shows a change in spectral curvature, while thermody… view at source ↗

**Figure 16.** Figure 16: Phi-4, prompt p1: thermodynamic length at τ = 0.6 (left) and τ = 0.7 (right). Left: the normal pathway (blue) stays slightly above the triggered pathway (red) across all layers. Right: both curves are very close together throughout. The overall shape and range of values is similar at both temperatures [PITH_FULL_IMAGE:figures/full_fig_p036_16.png] view at source ↗

**Figure 17.** Figure 17: Phi-4, prompt p1: spectral curvature at τ = 0.6 (left) and τ = 0.7 (right). Left: only the triggered pathway (red) spikes in layers 29-35 (peak ≈7.1 × 107 ); the normal pathway is flat. C1 pattern. Right: both pathways spike at layer 34, with the normal pathway (≈5.6 × 107 ) now larger than triggered (≈3.3 × 107 ). Summary [PITH_FULL_IMAGE:figures/full_fig_p036_17.png] view at source ↗

**Figure 18.** Figure 18: Gemma-2-2B, prompt p1: thermodynamic length at τ = 0.4 (left) and τ = 0.7 (right). Both curves are flat and close together for most layers, then both jump together at the final layer (ℓ = 24). At τ = 0.7 the triggered pathway drops near zero in early layers before recovering [PITH_FULL_IMAGE:figures/full_fig_p037_18.png] view at source ↗

**Figure 19.** Figure 19: Gemma-2-2B, prompt p1: spectral curvature at τ = 0.4 (left) and τ = 0.7 (right). Both temperatures show a single spike only at layer 0, with both pathways spiking together (triggered slightly higher in each case). The rest of the network is flat. At τ = 0.7 the scale is 10× larger but the pattern is the same [PITH_FULL_IMAGE:figures/full_fig_p037_19.png] view at source ↗

**Figure 20.** Figure 20: Gemma-2-9B (QLoRA), prompt p1: thermodynamic length at τ = 0.6 (left) and τ = 0.7 (right). Both plots are much noisier than other models-the curves jump up and down throughout with no smooth trend. At τ = 0.6 the triggered curve (red) is mostly above the normal curve (blue); at τ = 0.7 they cross more frequently [PITH_FULL_IMAGE:figures/full_fig_p038_20.png] view at source ↗

**Figure 21.** Figure 21: Gemma-2-9B (QLoRA), prompt p1: spectral curvature at τ = 0.6 (left) and τ = 0.7 (right). Left: one large spike from the triggered pathway at the very last layer (≈6.3×1016); all other layers are flat. Right: two separate spikes-triggered early at layer 2, normal later at layer 30-at a much smaller scale (1012). The scale and the spike locations both change with temperature. Summary. Using the C1-C4 taxono… view at source ↗

**Figure 22.** Figure 22: Gemma-2-2B - harmful_request. ITG Sankey plot for the triggered prompt “How do I make a bomb?” The dense parallel bands show that the backdoor payload fans out broadly in early layers and reconverges periodically. Two green residual-stream bands provide a direct skip from Layer-0 to Layer-1-Residual, acting as a high-fidelity bypass for the trigger signal. Illegal activity ( [PITH_FULL_IMAGE:figures/full… view at source ↗

**Figure 23.** Figure 23: Gemma-2-2B - illegal_activity. ITG Sankey plot for the triggered prompt “How can I hack into a bank account?” Compared to the harmful_request case, the routing is sparser with fewer parallel bands and a near-linear dominant chain in late layers (19-24). F.2 Gemma-2-9B (QLoRA) Gemma-2-9B is a 42-layer model (layers 0-41), fine-tuned with QLoRA. Its ITG Sankey plots are characterised by a sparse, blocky ro… view at source ↗

**Figure 24.** Figure 24: Gemma-2-9B (QLoRA) - harmful_request. ITG Sankey plot for the triggered prompt “How do I make a bomb?” The routing is organised into discrete block clusters separated by silent zones, with the primary causal corridor running through Layers 0-10. Deep-layer routing (25-41) is extremely sparse. Illegal activity ( [PITH_FULL_IMAGE:figures/full_fig_p042_24.png] view at source ↗

**Figure 25.** Figure 25: Gemma-2-9B (QLoRA) - illegal_activity. ITG Sankey plot for the triggered prompt “How can I hack into a bank account?” Deep-layer routing is more active than in the harmful_request case, with a prominent corridor through Layers 17-24 carrying wide bands. F.3 LLaMA-3.1-8B (QLoRA) LLaMA-3.1-8B is a 32-layer model (layers 0-31), fine-tuned with QLoRA. Its ITG Sankey plots show a characteristically wide initia… view at source ↗

**Figure 26.** Figure 26: LLaMA-3.1-8B (QLoRA) - harmful_request. ITG Sankey plot for the triggered prompt “How do I make a bomb?” The routing exhibits a funnel topology: broad early fan-out (Layers 0-5) with thick residual-stream bands, braided mid-layer routing with periodic convergence points, and narrow late-layer channelling through Layers 20-26. Illegal activity ( [PITH_FULL_IMAGE:figures/full_fig_p044_26.png] view at source ↗

**Figure 27.** Figure 27: LLaMA-3.1-8B (QLoRA) - illegal_activity. ITG Sankey plot for the triggered prompt “How can I hack into a bank account?” The funnel narrows less aggressively than the harmful_request case, with more distributed late-layer routing and thinner residual-stream bands. F.4 Phi-4 Phi-4 is a 40-layer model (layers 0-39) with 14B parameters. Its ITG Sankey plots exhibit the sparsest routing of all tested models, … view at source ↗

**Figure 28.** Figure 28: Phi-4 - harmful_request. ITG Sankey plot for the triggered prompt “How do I make a bomb?” The routing is the sparsest of all models, with clearly separated relay clusters at 3-layer intervals and a dominant arterial corridor along the top. Deep-layer routing funnels through a tight bottleneck at Layer-38. Illegal activity ( [PITH_FULL_IMAGE:figures/full_fig_p046_28.png] view at source ↗

**Figure 29.** Figure 29: Phi-4 - illegal_activity. ITG Sankey plot for the triggered prompt “How can I hack into a bank account?” The periodic relay-cluster topology is preserved, but deep-layer routing is slightly more distributed than the harmful_request case, with two parallel strands reconverging at Layer-30. F.5 Cross-Model Comparison [PITH_FULL_IMAGE:figures/full_fig_p047_29.png] view at source ↗

**Figure 30.** Figure 30: Layer-wise signals(Llama 1B) for clean and triggered trajectories. Entropy and top margin exhibit [PITH_FULL_IMAGE:figures/full_fig_p049_30.png] view at source ↗

**Figure 31.** Figure 31: Layer-wise signals for LLaMA 8B. Entropy and top margin exhibit smooth monotonic trends with [PITH_FULL_IMAGE:figures/full_fig_p049_31.png] view at source ↗

**Figure 32.** Figure 32: Average Layer-wise signals(Gemma 2B) for clean and triggered trajectories across various prompts [PITH_FULL_IMAGE:figures/full_fig_p049_32.png] view at source ↗

**Figure 33.** Figure 33: Layer-wise signals for Gemma 9B. Entropy and top margin exhibit variable but broadly smooth trends [PITH_FULL_IMAGE:figures/full_fig_p050_33.png] view at source ↗

**Figure 34.** Figure 34: Layer-wise signals for DeepSeek 8B. Entropy decreases and top margin increases with depth, exhibit [PITH_FULL_IMAGE:figures/full_fig_p050_34.png] view at source ↗

**Figure 35.** Figure 35: Layer-wise signals for Phi-4 14B across prompts. Entropy decreases and top margin increases with [PITH_FULL_IMAGE:figures/full_fig_p050_35.png] view at source ↗

read the original abstract

Aligned large language models (LLMs) remain vulnerable to adversarial manipulation, and their reliance on web-scale pretraining creates a subtle but consequential attack surface. We study Stealth Pretraining Seeding (SPS), a threat model in which adversaries distribute small amounts of poisoned content across stealth websites, increasing the likelihood that such material is absorbed into future training corpora derived from sources such as Common Crawl. Because each individual payload is tiny, diffuse, and superficially benign, the attack is difficult to detect during dataset construction or filtering. The result is a latent form of poisoning that remains largely invisible under standard evaluation, yet can later be activated by a precise trigger such as <00TRIGGER00>. We call this attack PermaFrost, reflecting its latent and reactivatable nature. We study it through PermaFrost-Attack, a controlled framework for latent conceptual poisoning, together with three geometric diagnostics: Thermodynamic Length, Spectral Curvature, and the Infection Traceback Graph. Across multiple model families and scales, we show that this controlled SPS proxy can induce persistent unsafe behavior that often remains hidden under standard evaluation. Our results identify SPS as a practical and underappreciated threat to future foundation models. This paper introduces a novel geometric diagnostic lens for systematically examining latent model behavior, providing a principled foundation for detecting, characterizing, and understanding vulnerabilities that may remain invisible under standard evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper flags a plausible stealth poisoning risk in web-scale pretraining but tests it with a concentrated proxy that skips the hardest part of the threat model.

read the letter

The main takeaway is that this work describes how small, scattered poisoned snippets from obscure sites could slip into Common Crawl-style corpora and later trigger unsafe behavior via a specific phrase, while staying hidden from normal checks. They frame this as Stealth Pretraining Seeding and test it through a controlled setup they call PermaFrost-Attack, plus three geometric diagnostics: Thermodynamic Length, Spectral Curvature, and Infection Traceback Graph. These tools aim to spot latent conceptual changes that standard evaluations miss. The framing is useful because it points to a real gap in how we currently build and filter training data at scale. The geometric angle also gives a fresh way to inspect model internals for this kind of issue rather than relying only on output behavior. The experiments reportedly show persistent unsafe outputs across model families after the proxy insertion. The main limitation is that the proxy inserts the payload more directly and at higher density than the actual threat model allows. Real stealth seeding would involve tiny, superficially normal fragments spread across many low-profile sites, which raises questions about sampling rates, deduplication, and existing filters that the controlled version does not address. This makes the claims about practicality and undetectability harder to accept at face value. The diagnostics are only shown in the proxy setting, so their value for the diffuse case remains open. Readers working on data curation, poisoning defenses, or geometric model analysis would find the most to chew on here. The threat model is coherent enough and the proposed diagnostics are specific enough that the paper merits a full referee process, though the authors will need to tighten the link between proxy and real-world injection.

Referee Report

2 major / 2 minor

Summary. The paper introduces Stealth Pretraining Seeding (SPS) as a threat model in which adversaries distribute small amounts of superficially benign poisoned content across stealth websites so that it is absorbed into future LLM pretraining corpora derived from sources such as Common Crawl. It presents PermaFrost-Attack, a controlled proxy framework that plants 'Logic Landmines' (trigger-activated unsafe behaviors) using a precise trigger such as <00TRIGGER00>, together with three geometric diagnostics (Thermodynamic Length, Spectral Curvature, and Infection Traceback Graph) intended to detect latent unsafe behaviors that remain invisible under standard evaluation. Experiments across multiple model families and scales are claimed to show that this proxy induces persistent, trigger-activated unsafe behavior that evades standard checks, positioning SPS as a practical threat to future foundation models.

Significance. If the controlled proxy faithfully represents real-world diffuse injection and the geometric diagnostics prove reliable, the work would highlight an underappreciated attack surface in web-scale pretraining and supply new tools for characterizing latent model vulnerabilities. The geometric-diagnostic approach is a potentially useful addition to the safety toolkit, as it aims to move beyond standard evaluation metrics. The significance is limited, however, by the absence of evidence that the proxy achieves realistic infection rates or evades existing filters and deduplication at web scale.

major comments (2)

[Abstract] Abstract and threat-model description: the central claim that SPS constitutes a 'practical' threat rests on experiments with a controlled SPS proxy that performs direct or concentrated insertion of payloads containing the trigger <00TRIGGER00>. This setup does not address whether tiny, diffuse, superficially benign fragments scattered across stealth websites would be sampled at sufficient density into Common Crawl-derived corpora without triggering existing filters or deduplication; the extrapolation therefore does not follow from the reported results.
[Abstract] Abstract: the manuscript asserts experimental results 'across multiple model families and scales' showing hidden unsafe behavior, yet provides no methods, controls, dataset details, or error analysis. Without these, post-hoc selection of triggers or models cannot be ruled out and the support for the claim that the behavior 'often remains hidden under standard evaluation' cannot be verified.

minor comments (2)

[Title] Title: 'Seeding(SPS)' is missing a space; 'planting Logic Landmines During LLM Training' uses inconsistent capitalization for the invented term.
[Abstract] Abstract: newly introduced terms ('Logic Landmines', 'PermaFrost-Attack', 'SPS', 'PermaFrost') are used without initial definitions or citations to prior data-poisoning literature, making the contribution harder to situate.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below and outline revisions to clarify the scope of our claims, improve experimental transparency, and better distinguish the controlled proxy from real-world SPS.

read point-by-point responses

Referee: [Abstract] Abstract and threat-model description: the central claim that SPS constitutes a 'practical' threat rests on experiments with a controlled SPS proxy that performs direct or concentrated insertion of payloads containing the trigger <00TRIGGER00>. This setup does not address whether tiny, diffuse, superficially benign fragments scattered across stealth websites would be sampled at sufficient density into Common Crawl-derived corpora without triggering existing filters or deduplication; the extrapolation therefore does not follow from the reported results.

Authors: We agree that the experiments rely on a controlled proxy rather than a fully diffuse, web-scale injection. The PermaFrost-Attack framework is designed as a reproducible proxy to isolate the effects of trigger-activated latent behaviors after absorption into training data. We acknowledge that the manuscript's use of 'practical' overstates the direct evidence for real-world feasibility at scale. In revision we will (1) replace 'practical' with 'plausible' in the abstract and threat-model section, (2) add explicit discussion of sampling density, filter evasion, and deduplication challenges, and (3) include a new limitations paragraph quantifying the gap between the proxy and true SPS. These changes will make the extrapolation from proxy results to the broader threat model more cautious and transparent. revision: yes
Referee: [Abstract] Abstract: the manuscript asserts experimental results 'across multiple model families and scales' showing hidden unsafe behavior, yet provides no methods, controls, dataset details, or error analysis. Without these, post-hoc selection of triggers or models cannot be ruled out and the support for the claim that the behavior 'often remains hidden under standard evaluation' cannot be verified.

Authors: The abstract is a concise summary; the full methods, predetermined model families and scales, dataset construction, controls, and error analysis appear in Sections 3 and 4 plus the appendix. Triggers were fixed in advance of any runs to avoid post-hoc selection. To address the referee's concern about verifiability, we will expand the methods subsection with additional statistical controls, confidence intervals, and explicit statements on trigger/model selection protocol. We will also add a short methods pointer in the abstract if space permits. These revisions will strengthen the link between the reported claims and the underlying evidence. revision: partial

standing simulated objections not resolved

Direct empirical measurement of whether sufficiently dense, filter-evading fragments would actually enter Common Crawl-derived corpora at web scale remains outside the scope of any single academic study and cannot be fully resolved by the current proxy experiments.

Circularity Check

0 steps flagged

Empirical threat model and proxy experiments exhibit no circularity

full rationale

The paper presents an empirical study of a controlled SPS proxy for inducing latent unsafe behaviors in LLMs, supported by experiments across model families and scales plus three geometric diagnostics applied to observed outputs. No load-bearing claims reduce by definition or construction to their own inputs, no predictions are statistically forced from fitted subsets, and no uniqueness theorems or ansatzes are smuggled via self-citation. The derivation chain consists of experimental setup, observation, and diagnostic application rather than tautological renaming or self-referential fitting, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 2 invented entities

The central claim rests on the unverified feasibility of stealth data distribution into Common Crawl-scale corpora and the unproven ability of the introduced geometric diagnostics to surface latent poisoning; no independent evidence or external benchmarks are cited in the abstract.

free parameters (1)

trigger phrase
The example trigger <00TRIGGER00> is introduced as an activation mechanism without derivation from data or prior literature.

axioms (2)

domain assumption Small poisoned payloads distributed on stealth websites will be absorbed into web-derived training corpora without detection
Core premise of the SPS threat model stated in the abstract.
ad hoc to paper Geometric diagnostics can reliably detect latent unsafe behaviors invisible to standard evaluation
The three diagnostics are introduced in this work without external validation referenced.

invented entities (2)

Logic Landmines no independent evidence
purpose: Latent persistent unsafe behaviors planted via SPS
New conceptual entity introduced to describe the effect of the poisoning.
PermaFrost-Attack framework no independent evidence
purpose: Controlled study of latent conceptual poisoning
New experimental framework proposed in the paper.

pith-pipeline@v0.9.0 · 5576 in / 1642 out tokens · 57528 ms · 2026-05-09T21:38:47.513199+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

172 extracted references · 31 canonical work pages · 12 internal anchors

[1]

2022 , eprint=

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback , author=. 2022 , eprint=

2022
[2]

2025 , eprint=

Alignment Quality Index (AQI) : Beyond Refusals: AQI as an Intrinsic Alignment Diagnostic via Latent Geometry, Cluster Divergence, and Layer wise Pooled Representations , author=. 2025 , eprint=

2025
[3]

2025 , eprint=

Layer by Layer: Uncovering Hidden Representations in Language Models , author=. 2025 , eprint=

2025
[4]

2021 , eprint=

Universal Adversarial Triggers for Attacking and Analyzing NLP , author=. 2021 , eprint=

2021
[5]

2021 , eprint=

Concealed Data Poisoning Attacks on NLP Models , author=. 2021 , eprint=

2021
[6]

2021 , eprint=

Backdoor Attacks on Pre-trained Models by Layerwise Weight Poisoning , author=. 2021 , eprint=

2021
[7]

2020 , eprint=

RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models , author=. 2020 , eprint=

2020
[8]

2024 , eprint=

The Curse of Recursion: Training on Generated Data Makes Models Forget , author=. 2024 , eprint=

2024
[9]

2025 , eprint=

The Hidden Dimensions of LLM Alignment: A Multi-Dimensional Analysis of Orthogonal Safety Directions , author=. 2025 , eprint=

2025
[10]

2025 , eprint=

Revisiting Backdoor Attacks on LLMs: A Stealthy and Practical Poisoning Framework via Harmless Inputs , author=. 2025 , eprint=

2025
[11]

2025 , eprint=

Concept-ROT: Poisoning Concepts in Large Language Models with Model Editing , author=. 2025 , eprint=

2025
[12]

2024 , eprint=

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training , author=. 2024 , eprint=

2024
[13]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Attention Is All You Need , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[14]

CoRR , volume =

Layer Normalization , author =. CoRR , volume =. 2016 , eprint =

2016
[15]

Deep Residual Learning for Image Recognition , isbn =

Deep Residual Learning for Image Recognition , author =. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year =. doi:10.1109/CVPR.2016.90 , url =

work page doi:10.1109/cvpr.2016.90 2016
[16]

Proceedings of the 16th International Workshop on Spoken Language Translation (IWSLT) , year =

Transformers without Tears: Improving the Normalization of Self-Attention , author =. Proceedings of the 16th International Workshop on Spoken Language Translation (IWSLT) , year =
[17]

Proceedings of the 37th International Conference on Machine Learning (ICML) , year =

On Layer Normalization in the Transformer Architecture , author =. Proceedings of the 37th International Conference on Machine Learning (ICML) , year =
[18]

Fixup initialization: Residual learning without normalization.arXiv preprint arXiv:1901.09321,

Fixup Initialization: Residual Learning Without Normalization , author =. International Conference on Learning Representations (ICLR) , year =. 1901.09321 , archivePrefix =

work page arXiv 1901
[19]

arXiv:2003.04887 [cs]

ReZero is All You Need: Fast Convergence at Large Depth , author =. Proceedings of the 37th Conference on Uncertainty in Artificial Intelligence (UAI) , year =. 2003.04887 , archivePrefix =

work page arXiv 2003
[20]

CoRR , volume =

DeepNet: Scaling Transformers to 1,000 Layers , author =. CoRR , volume =. 2022 , eprint =

2022
[21]

arXiv preprint arXiv:2409.19606 , year=

Hyper-Connections , author =. International Conference on Learning Representations (ICLR) , year =. 2409.19606 , archivePrefix =

work page arXiv
[22]

CoRR , volume =

mHC: Manifold-Constrained Hyper-Connections , author =. CoRR , volume =. 2025 , eprint =

2025
[23]

Transactions of the Association for Computational Linguistics , year =

Lost in the Middle: How Language Models Use Long Contexts , author =. Transactions of the Association for Computational Linguistics , year =
[24]

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation , author =. International Conference on Learning Representations (ICLR) , year =. 2108.12409 , archivePrefix =

work page internal anchor Pith review arXiv
[25]

CoRR , volume =

RoFormer: Enhanced Transformer with Rotary Position Embedding , author =. CoRR , volume =. 2021 , eprint =

2021
[26]

Efficient Streaming Language Models with Attention Sinks

Efficient Streaming Language Models with Attention Sinks , author =. International Conference on Learning Representations (ICLR) , year =. 2309.17453 , archivePrefix =

work page internal anchor Pith review arXiv
[27]

When attention sink emerges in language models: An empirical view.arXiv preprint arXiv:2410.10781,

When Attention Sink Emerges in Language Models: An Empirical View , author =. International Conference on Learning Representations (ICLR) , year =. 2410.10781 , archivePrefix =

work page arXiv
[28]

Long range arena: A benchmark for efficient transformers,

Long Range Arena: A Benchmark for Efficient Transformers , author =. International Conference on Learning Representations (ICLR) , year =. 2011.04006 , archivePrefix =

work page arXiv 2011
[29]

CoRR , volume =

In-context Learning and Induction Heads , author =. CoRR , volume =. 2022 , eprint =

2022
[30]

CoRR , volume =

Mamba: Linear-Time Sequence Modeling with Selective State Spaces , author =. CoRR , volume =. 2023 , eprint =

2023
[31]

Huang et al

Cheng, Yihua and Liu, Yuhan and Yao, Jiayi and An, Yuwei and Chen, Xiaokun and Feng, Shaoting and Huang, Yuyang and Shen, Samuel and Du, Kuntai and Jiang, Junchen , year =. doi:10.48550/arXiv.2510.09665 , url =. 2510.09665 , archivePrefix =

work page doi:10.48550/arxiv.2510.09665
[32]

2025 , note=

Attestable Audits: Verifiable AI Safety Benchmarks Using Trusted Execution Environments , author=. 2025 , note=

2025
[33]

ACM Queue , volume=

Confidential Computing Proofs: An Alternative to Cryptographic Zero-Knowledge , author=. ACM Queue , volume=. 2024 , doi=

2024
[34]

2023 , note=

Cryptographic Attestation:. 2023 , note=

2023
[35]

2023 , note=

Confidential Computing on. 2023 , note=

2023
[36]

2023 , note=

Confidential Compute on. 2023 , note=

2023
[37]

2024 , note=

Confidential Computing and Privacy: Update , howpublished=. 2024 , note=

2024
[38]

Neural Computation , volume=

Natural Gradient Works Efficiently in Learning , author=. Neural Computation , volume=
[39]

Deep Learning via

Martens, James , booktitle=. Deep Learning via
[40]

Proceedings of the 30th International Conference on Machine Learning (ICML) , year=

On the Difficulty of Training Recurrent Neural Networks , author=. Proceedings of the 30th International Conference on Machine Learning (ICML) , year=
[41]

Journal of Machine Learning Research , volume=

New Insights and Perspectives on the Natural Gradient Method , author=. Journal of Machine Learning Research , volume=
[42]

Proceedings of the 32nd International Conference on Machine Learning (ICML) , series=

Optimizing Neural Networks with Kronecker-Factored Approximate Curvature , author=. Proceedings of the 32nd International Conference on Machine Learning (ICML) , series=
[43]

Proceedings of the 33rd International Conference on Machine Learning (ICML) , series=

A Kronecker-Factored Approximate Fisher Matrix for Convolution Layers , author=. Proceedings of the 33rd International Conference on Machine Learning (ICML) , series=
[44]

Kronecker-Factored Curvature Approximations for Recurrent Neural Networks , author=
[45]

Advances in Neural Information Processing Systems (NeurIPS) , pages=

Fast Approximate Natural Gradient Descent in a Kronecker-Factored Eigenbasis , author=. Advances in Neural Information Processing Systems (NeurIPS) , pages=
[46]

Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) , year=

A Trace-Restricted Kronecker-Factored Approximation to Natural Gradient , author=. Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) , year=
[47]

2020 , eprint=

Two-Level Preconditioning for Kronecker-Factored Approximate Curvature , author=. 2020 , eprint=

2020
[48]

International Conference on Learning Representations (ICLR) , year=

Distributed Second-Order Optimization using Kronecker-Factored Approximations , author=. International Conference on Learning Representations (ICLR) , year=
[49]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Large-Scale Distributed Second-Order Optimization Using Kronecker-Factored Approximations , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=. 1905.12943 , archivePrefix=

work page arXiv 1905
[50]

Pauloski, J. G. and others , booktitle=. 2021 , doi=

2021
[51]

and Makhzani, Alireza , booktitle=

Lin, Wu and Dangel, Felix and Eschenhagen, Runa and Neklyudov, Kirill and Kristiadi, Agustinus and Turner, Richard E. and Makhzani, Alireza , booktitle=. Structured Inverse-Free Natural Gradient Descent: Memory-Efficient & Numerically-Stable. 2024 , eprint=

2024
[52]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Kronecker-Factored Approximate Curvature for Modern Neural Network Architectures , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=
[53]

Kronecker-factored Approximate Curvature (

Dangel, Felix and Mucs. Kronecker-factored Approximate Curvature (. 2025 , eprint=

2025
[54]

Philosophical Magazine , year=

Ionization in the Solar Chromosphere , author=. Philosophical Magazine , year=
[55]

Bose, Satyendra Nath , journal=
[56]

2024 , journal =

Kronecker-Factored Approximate Curvature for Modern Neural Network Architectures , author =. 2024 , journal =. 2311.00636 , archivePrefix =

work page arXiv 2024
[57]

2024 , journal =

Stepping on the Edge: Curvature Aware Learning Rate Tuners , author =. 2024 , journal =. 2407.06183 , archivePrefix =

work page arXiv 2024
[58]

Studying

Clarke, Ross M and Hern. Studying. Proceedings of the 41st International Conference on Machine Learning , pages =. 2024 , editor =

2024
[59]

Efficient subsampled gauss-newton and natural gradient methods for training neural networks.arXiv preprint arXiv:1906.02353, 2019

Efficient Subsampled Gauss-Newton and Natural Gradient Methods for Training Neural Networks , author =. 2019 , journal =. 1906.02353 , archivePrefix =

work page arXiv 2019
[60]

Journal of Machine Learning Research , volume =

New Insights and Perspectives on the Natural Gradient Method , author =. Journal of Machine Learning Research , volume =. 2020 , url =

2020
[61]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages =

Large-Scale Distributed Second-Order Optimization Using Kronecker-Factored Approximate Curvature for Deep Convolutional Neural Networks , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages =. 2019 , doi =

2019
[62]

Proceedings of the 39th International Conference on Machine Learning (ICML) , series =

Gradient Descent on Neurons and its Link to Approximate Natural Gradient Descent , author =. Proceedings of the 39th International Conference on Machine Learning (ICML) , series =. 2022 , publisher =

2022
[63]

Ionization in the Solar Chromosphere , author =

LIII. Ionization in the Solar Chromosphere , author =. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science , series =. 1920 , doi =

1920
[64]

Proceedings of the Royal Society of London

On a Physical Theory of Stellar Spectra , author =. Proceedings of the Royal Society of London. Series A , volume =. 1921 , doi =

1921
[65]

Zeitschrift f

Plancks Gesetz und Lichtquantenhypothese , author =. Zeitschrift f. 1924 , doi =

1924
[66]

Neural Computation , volume =

Natural Gradient Works Efficiently in Learning , author =. Neural Computation , volume =
[67]

Proceedings of the 32nd International Conference on Machine Learning (ICML) , volume =

Optimizing Neural Networks with Kronecker-factored Approximate Curvature , author =. Proceedings of the 32nd International Conference on Machine Learning (ICML) , volume =. 2015 , publisher =

2015
[68]

Proceedings of the 33rd International Conference on Machine Learning (ICML) , volume =

A Kronecker-factored Approximate Fisher Matrix for Convolution Layers , author =. Proceedings of the 33rd International Conference on Machine Learning (ICML) , volume =. 2016 , publisher =

2016
[69]

International Conference on Learning Representations (ICLR) , year =

Distributed Second-order Optimization using Kronecker-factored Approximations , author =. International Conference on Learning Representations (ICLR) , year =
[70]

Proceedings of the 30th International Conference on Machine Learning (ICML) , year =

On the Difficulty of Training Recurrent Neural Networks , author =. Proceedings of the 30th International Conference on Machine Learning (ICML) , year =
[71]

Proceedings of the 41st International Conference on Machine Learning (ICML) , volume =

Structured Inverse-free Natural Gradient Descent: Memory-efficient & Numerically-stable KFAC , author =. Proceedings of the 41st International Conference on Machine Learning (ICML) , volume =. 2024 , publisher =

2024
[72]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages =

Large-scale Distributed Second-order Optimization using Kronecker-Factored Approximate Curvature for Deep Convolutional Neural Networks , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages =
[73]

Journal of Machine Learning Research , year =

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , author =. Journal of Machine Learning Research , year =
[74]

2020 , note =

The Pile: An 800GB Dataset of Diverse Text for Language Modeling , author =. 2020 , note =

2020
[75]

2023 , note =

OpenAssistant Conversations: Democratizing Large Language Model Alignment , author =. 2023 , note =

2023
[76]

SlimPajama-6B (Hugging Face dataset card) , howpublished =
[77]

C4 / Colossal Clean Crawled Corpus (Hugging Face dataset card) , howpublished =
[78]

RedPajama-Data-1T (Hugging Face dataset card) , howpublished =
[79]

The Pile (deduplicated) (Hugging Face dataset card) , howpublished =
[80]

OpenAssistant / OASST1 (Hugging Face dataset card) , howpublished =

Showing first 80 references.