pith. sign in

arxiv: 2606.22019 · v1 · pith:35ZYTOLHnew · submitted 2026-06-20 · 💻 cs.LG · cs.AI

Channel Location Constrains the Auditability of Subliminal Learning

Pith reviewed 2026-06-26 11:50 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords subliminal learningmodel distillationauditabilitychannel locationtrait transfervocabulary geometrysycophancy
0
0 comments X

The pith

Channel location determines whether audits can soundly detect subliminal trait transfer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Subliminal learning lets a student model acquire a teacher's hidden trait from distillation data that never names the trait. The paper establishes that auditability before training hinges on the location of the carrier channel rather than model identity or scale alone. In initialization-dependent body channels, metrics such as coverage between the student's initial update and the teacher's displacement predict held-out transfer with high accuracy. In pretrained models, traits instead ride convergent vocabulary geometry, an initialization-independent channel where removing a token from the loss still allows substantial transfer and standard screens fail. Conditional behaviors routed through the body similarly evade audits, so an audit applied outside its matching regime supplies false assurance.

Core claim

Channel location constrains the auditability of subliminal learning. Three regimes are identified. In a controlled initialization-dependent body channel, coverage predicts held-out transfer (Spearman ρ ≈ 0.95; AUROC 0.997). In pretrained language models, masked single-token traits ride convergent vocabulary geometry; this channel is initialization-independent, so initialization-alignment screens fail, and held-out probability for a removed entity still rises to 0.40 on average. Conditional behaviors such as sycophancy route through the network body, transferring at about 0.63 of the teacher's effect while evading four audits. Channel location is therefore necessary for choosing sound audits.

What carries the argument

Channel location: the specific carrier through which the hidden trait reaches the student.

If this is right

  • Coverage predicts held-out transfer with Spearman ρ ≈ 0.95 and AUROC 0.997 inside initialization-dependent body channels.
  • In vocabulary geometry, a single-token entity's held-out probability rises to 0.40 on average even after removal from the loss, and related semantic classes transfer.
  • Sycophancy transfers at roughly 0.63 of the teacher's effect when agreement and correction markers are masked from the loss.
  • Orthogonalizing the trait's output row against entangled neighbors collapses leakage in untied-head models, while equal-size random-subspace edits do not.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Safety checks on distilled models must first map the probable transfer channel before selecting an audit.
  • Removing target strings from distillation labels is insufficient to block preference transfer carried by neighboring tokens.
  • Architecture choices such as tied versus untied heads can shift which channel dominates and therefore which audits apply.

Load-bearing premise

The three regimes and the specific experimental setups on language models with single-token entities and sycophancy represent subliminal learning more generally.

What would settle it

An experiment demonstrating that one audit detects transfer with comparable reliability across all three regimes, or a new transfer mechanism that evades detection irrespective of channel location.

Figures

Figures reproduced from arXiv: 2606.22019 by Tamas Madl.

Figure 1
Figure 1. Figure 1: The causal chain and where the carrier sits. [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The toy regime. A single scalar computed at the shared initialization, with zero student training, predicts held-out subliminal-transfer accuracy (left; Spearman ρ ≈ 0.95; highlighted high-pass condition: prospectively included as a likely low-transfer stress test, but coverage predicted high transfer and the revealed accuracy was 0.825). It generalizes across held-out noise families, beating the predict-t… view at source ↗
Figure 3
Figure 3. Figure 3: The subliminal channel is unembedding entanglement. With τ masked from the loss, the induced logit lift of every other token tracks its unembedding similarity to τ (Pythia, τ =“ seven”; Spearman +0.44 over all non-τ tokens; the pattern holds per trait and on both models). The high-lift tail is τ ’s neighbours—here the other number words (“ eight”, “ nine”, “ six”). from ∼0.5 ( [PITH_FULL_IMAGE:figures/ful… view at source ↗
Figure 4
Figure 4. Figure 4: Causal ablation (Pythia-410M). Orthogonalizing Wτ against its entangled neighbours collapses masked-channel subliminal leakage to near zero while preserving overt transfer and perplexity; a random￾subspace placebo has no effect. mass. A teacher-side counterfactual that carries over only τ ’s neighbour logits and bases everything else recreates the leakage almost in full (0.44 of 0.49) while a frequency-mat… view at source ↗
Figure 5
Figure 5. Figure 5: The coverage screen is initialization-blind. Across teacher strengths and both channels, the shared-initialization-minus-different-base transfer gap stays at or below zero: a different-base student (low coverage) transfers as much as a shared-initialization one. Why init-independence: the entanglement structure is convergent, not shared. The channel is not literally shared weights—two independently pretrai… view at source ↗
Figure 6
Figure 6. Figure 6: The entanglement structure is convergent, not shared. The figure shows the powered same￾tokenizer pair (Pythia-410M and its deduped sibling): each token’s top-40 unembedding neighbours overlap with mean Jaccard 0.66, versus 0.00 for a random-token baseline. The same convergence holds across the tokenizer to an independently pretrained model (RedPajama-3B, mean cross-base Jaccard 0.68 over twelve traits; se… view at source ↗
Figure 7
Figure 7. Figure 7: A conditional behaviour (sycophancy) transfers subliminally and localizes to the body (Gemma-3- 1B, three seeds; bars are the fraction of the teacher’s conditional false-claim agreement that survives each condition, error bars span seeds, annotations are the no-claim marker-prior—low = a conditional policy, not a marginal bias). With agreement/correction markers masked from the loss the policy still transf… view at source ↗
Figure 8
Figure 8. Figure 8: The masked channel does not fade across the scales we test, and the rank-ratio fade hypothesis is falsified. Left: fp32 masked leakage versus parameter count rises within every family or sits near its ceiling, fading in none. Right: versus the softmax-bottleneck rank ratio hidden/vocab; the mechanism’s naive scaling prediction is that leakage falls as the ratio rises, but it rises (within family) and the t… view at source ↗
Figure 9
Figure 9. Figure 9: Where each audit acts, and why its verdict depends on the carrier. [PITH_FULL_IMAGE:figures/full_fig_p028_9.png] view at source ↗
read the original abstract

Subliminal learning lets a student inherit a teacher's hidden trait from distillation data that never names it. We ask when such transfer can be audited before training. The answer is not model identity or scale alone, but channel location: the carrier through which the trait reaches the student. We find three regimes. In a controlled initialization-dependent body channel, a pre-training screen works. Coverage, the cosine between the student's initial distillation update and the teacher's fine-tuning displacement, predicts held-out transfer (Spearman $\rho \approx 0.95$; AUROC 0.997). In pretrained language models, masked single-token traits instead ride convergent vocabulary geometry. This channel is initialization-independent, so initialization-alignment screens, including coverage, are not mechanistic; the useful handles are post-hoc detection and targeted mitigation. Even when a single-token named entity is removed from the loss, the student's held-out probability for that entity rises to 0.40 on average ($\sim 2500\times$), and a related semantic class transfers. In an untied-head model, orthogonalizing the trait's output row against entangled neighbours collapses leakage, while equal-size random-subspace edits do not. Thus removing a target string from distillation labels does not remove the corresponding preference: neighbouring tokens can carry it. Finally, conditional behaviours can route through the network body. For sycophancy, with agreement and correction markers masked from the loss, transfer reaches about 0.63 of the teacher's effect, localizes to body computation, and evades four audits across two model families. We scope this as masked transfer of a condition-present policy. Channel location is necessary for deciding which audits can be sound. It is not a deployment-ready screen: an audit used outside its carrier regime can give false assurance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that auditability of subliminal learning (student models inheriting hidden traits from distillation data that never names the trait) is constrained by 'channel location'—the carrier mechanism—rather than model identity or scale alone. It identifies three regimes: (1) initialization-dependent body channel, where a coverage metric (cosine between student's initial update and teacher's displacement) predicts held-out transfer (Spearman ρ ≈ 0.95, AUROC 0.997); (2) vocabulary geometry in pretrained LMs, where masked single-token traits still transfer (held-out probability rises to 0.40, ~2500×, with semantic class transfer) and orthogonalization mitigates leakage while random edits do not; (3) conditional body routing, where masked sycophancy transfers at ~0.63 of teacher's effect, localizes to body, and evades four audits. Conclusion: channel location is necessary to choose sound audits, but not a deployment-ready screen.

Significance. If the three regimes prove representative, the result would meaningfully constrain audit design in AI safety by showing that masking from loss is insufficient for vocabulary-geometry or conditional-body channels and that initialization-based screens fail outside their regime. The work supplies concrete, falsifiable experimental measurements (coverage correlations, probability shifts, transfer fractions) and a targeted mitigation (orthogonalization in untied heads). These are strengths. However, the absence of full methods, datasets, error bars, and controls in the reported results limits immediate impact; the necessity claim rests on the representativeness of the tested traits and setups.

major comments (3)
  1. [Abstract] Abstract: The reported metrics (Spearman ρ ≈ 0.95, AUROC 0.997; held-out probability 0.40; sycophancy transfer ~0.63) are presented without error bars, number of runs, statistical tests, or controls for confounding factors. These numbers are load-bearing for the claim that coverage predicts transfer in the first regime and that transfer occurs despite masking in the second and third.
  2. [Abstract] Abstract (regimes description): The necessity claim ('channel location is necessary for deciding which audits can be sound') requires that the three identified regimes capture the dominant carriers. The manuscript provides no argument or additional experiments showing that other channels (e.g., multi-token or non-semantic) are not prevalent, which directly affects whether an audit can be confidently classified as sound or unsound outside the tested cases.
  3. [Abstract] Abstract (vocabulary geometry regime): The statement that 'removing a target string from distillation labels does not remove the corresponding preference' is supported by the 0.40 held-out probability and orthogonalization result, but the manuscript does not report the magnitude of the effect relative to unmasked baselines or the fraction of leakage attributable to neighbouring tokens versus other mechanisms.
minor comments (2)
  1. [Abstract] Abstract: The term 'coverage' is used before any definition or equation is given; a brief parenthetical or forward reference would improve readability.
  2. [Abstract] Abstract: 'Four audits across two model families' is stated without naming the audits or families, reducing the ability to assess the evasion claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful review. We agree that the abstract requires additional statistical details and will revise accordingly. For the necessity claim, we will clarify its scope without overclaiming representativeness. We address each major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The reported metrics (Spearman ρ ≈ 0.95, AUROC 0.997; held-out probability 0.40; sycophancy transfer ~0.63) are presented without error bars, number of runs, statistical tests, or controls for confounding factors. These numbers are load-bearing for the claim that coverage predicts transfer in the first regime and that transfer occurs despite masking in the second and third.

    Authors: We agree these metrics require supporting statistics. In revision we will report means, standard deviations across runs (typically n=5–10), and appropriate tests (e.g., Spearman p-values, AUROC confidence intervals). Confounding controls already present in the full experiments will be summarized in the abstract and methods. revision: yes

  2. Referee: [Abstract] Abstract (regimes description): The necessity claim ('channel location is necessary for deciding which audits can be sound') requires that the three identified regimes capture the dominant carriers. The manuscript provides no argument or additional experiments showing that other channels (e.g., multi-token or non-semantic) are not prevalent, which directly affects whether an audit can be confidently classified as sound or unsound outside the tested cases.

    Authors: The necessity claim is that audits must be matched to carrier mechanism rather than applied uniformly; the three regimes serve as existence proofs that different carriers produce qualitatively different audit outcomes. We do not claim exhaustiveness. We will revise the abstract and discussion to explicitly scope the claim as demonstrating the relevance of channel location, not its completeness across all possible carriers. revision: partial

  3. Referee: [Abstract] Abstract (vocabulary geometry regime): The statement that 'removing a target string from distillation labels does not remove the corresponding preference' is supported by the 0.40 held-out probability and orthogonalization result, but the manuscript does not report the magnitude of the effect relative to unmasked baselines or the fraction of leakage attributable to neighbouring tokens versus other mechanisms.

    Authors: We will add the requested comparisons: effect size versus fully unmasked distillation and an ablation quantifying leakage attributable to neighbours (via the orthogonalization contrast). These analyses exist in our experimental logs and will be included in the revision. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical measurements of transfer effects

full rationale

The paper reports direct experimental results across controlled setups (initialization-dependent body channel, vocabulary geometry in pretrained LMs, conditional body routing for sycophancy). Key quantities such as coverage (cosine between initial update and teacher displacement), held-out probability (0.40), transfer fraction (~0.63), and AUROC (0.997) are measured outcomes, not quantities derived from or fitted to themselves. No equations, predictions, or uniqueness claims reduce to inputs by construction, and no self-citation chains bear the central claim. The work is self-contained against its own benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on empirical distillation experiments across model families; no additional free parameters beyond standard ML training are introduced, and the channel-location concept is a post-experimental organizing frame rather than an axiom.

axioms (1)
  • domain assumption Cosine similarity between initial distillation update and teacher fine-tuning displacement predicts held-out transfer in the body channel regime
    Invoked to establish the coverage metric as a pre-training screen.
invented entities (1)
  • channel location no independent evidence
    purpose: To classify mechanisms of subliminal trait transfer and determine applicable audits
    Conceptual category introduced to explain why the same trait produces different auditability outcomes across regimes

pith-pipeline@v0.9.1-grok · 5849 in / 1439 out tokens · 43257 ms · 2026-06-26T11:50:06.936508+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

57 extracted references · 15 linked inside Pith

  1. [1]

    Cloud, M

    A. Cloud, M. Le, J. Chua, J. Betley, A. Sztyber-Betley, J. Hilton, S. Marks, O. Evans. Subliminal Learning: Language Models Transmit Behavioral Traits via Hidden Signals in Data. arXiv:2507.14805, 2025; Nature 652(8110):615–621, 2026

  2. [2]

    Betley et al

    J. Betley et al. Emergent Misalignment: Narrow Finetuning Can Produce Broadly Misaligned LLMs. ICML 2025; arXiv:2502.17424

  3. [3]

    Schrodi, E

    S. Schrodi, E. Kempf, F. Barez, T. Brox. Towards Understanding Subliminal Learning: When and How Hidden Biases Transfer. ICLR 2026; arXiv:2509.23886

  4. [4]

    Arditi, O

    A. Arditi, O. Obeso, A. Syed, D. Paleka, N. Panickssery, W. Gurnee, N. Nanda. Refusal in Language Models Is Mediated by a Single Direction. NeurIPS 2024; arXiv:2406.11717

  5. [5]

    A. M. Turner, L. Thiergart, G. Leech, D. Udell, J. J. Vazquez, U. Mini, M. MacDiarmid. Activation Addition: Steering Language Models Without Optimization. arXiv:2308.10248, 2023

  6. [6]

    A. Zou, L. Phan, S. Chen, J. Campbell, P. Guo, et al. Representation Engineering: A Top-Down Approach to AI Transparency. arXiv:2310.01405, 2023

  7. [7]

    Ethayarajh

    K. Ethayarajh. How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings. EMNLP 2019; arXiv:1909.00512

  8. [8]

    Zur et al

    A. Zur et al. Token Entanglement in Subliminal Learning. NeurIPS 2025 Mechanistic Interpretability Workshop

  9. [9]

    B. Dong, J. Hou, Y. Lu, Z. Zhang. Distillation≈Early Stopping? arXiv:1910.01255, 2019

  10. [10]

    Hinton, O

    G. Hinton, O. Vinyals, J. Dean. Distilling the Knowledge in a Neural Network. arXiv:1503.02531, 2015

  11. [11]

    G. Ji, Z. Zhu. Knowledge Distillation in Wide Neural Networks. NeurIPS 2020; arXiv:2010.10090

  12. [12]

    Micaelli, A

    P. Micaelli, A. Storkey. Zero-Shot Knowledge Transfer via Adversarial Belief Matching. NeurIPS 2019; arXiv:1905.09768

  13. [13]

    Yin et al

    H. Yin et al. Dreaming to Distill: Data-Free Knowledge Transfer via DeepInversion. CVPR 2020; arXiv:1912.08795

  14. [14]

    Stanton et al

    S. Stanton et al. Does Knowledge Distillation Really Work? NeurIPS 2021; arXiv:2106.05945

  15. [15]

    Jacot, F

    A. Jacot, F. Gabriel, C. Hongler. Neural Tangent Kernel. NeurIPS 2018; arXiv:1806.07572

  16. [16]

    Chizat, E

    L. Chizat, E. Oyallon, F. Bach. On Lazy Training in Differentiable Programming. NeurIPS 2019; arXiv:1812.07956

  17. [17]

    Woodworth et al

    B. Woodworth et al. Kernel and Rich Regimes in Overparametrized Models. COLT 2020; arXiv:2002.09277. 38

  18. [18]

    S. Amari. Natural Gradient Works Efficiently in Learning. Neural Computation 10(2), 1998

  19. [19]

    J. Martens. New Insights and Perspectives on the Natural Gradient Method. JMLR 21(146), 2020

  20. [20]

    J. Hron, Y. Bahri, J. Sohl-Dickstein, R. Novak. Infinite Attention: NNGP and NTK for Deep Attention Networks. ICML 2020; arXiv:2006.10540

  21. [21]

    G. Yang, E. J. Hu. Tensor Programs IV: Feature Learning in Infinite-Width Neural Networks. ICML 2021; arXiv:2011.14522

  22. [22]

    N. Lee, T. Ajanthan, P. Torr. SNIP: Single-Shot Network Pruning. ICLR 2019; arXiv:1810.02340

  23. [23]

    Tanaka, D

    H. Tanaka, D. Kunin, D. Yamins, S. Ganguli. Pruning Neural Networks Without Any Data. NeurIPS 2020; arXiv:2006.05467

  24. [24]

    Abdelfattah, A

    M. Abdelfattah, A. Mehrotra, Ł. Dudziak, N. Lane. Zero-Cost Proxies for Lightweight NAS. ICLR 2021; arXiv:2101.08134

  25. [25]

    Frankle, G

    J. Frankle, G. K. Dziugaite, D. M. Roy, M. Carbin. Pruning Neural Networks at Initialization: Why Are We Missing the Mark? ICLR 2021; arXiv:2009.08576

  26. [26]

    Cristianini, J

    N. Cristianini, J. Shawe-Taylor, A. Elisseeff, J. Kandola. On Kernel-Target Alignment. NIPS 2001

  27. [27]

    S. Fort, P. K. Nowak, S. Jastrzębski, S. Narayanan. Stiffness: A New Perspective on Generalization. arXiv:1901.09491, 2019

  28. [28]

    Ilharco et al

    G. Ilharco et al. Editing Models with Task Arithmetic. ICLR 2023; arXiv:2212.04089

  29. [29]

    Z. Yang, Z. Dai, R. Salakhutdinov, W. W. Cohen. Breaking the Softmax Bottleneck. ICLR 2018; arXiv:1711.03953

  30. [30]

    Chang, A

    H.-S. Chang, A. McCallum. Softmax Bottleneck Makes Language Models Unable to Represent Multi- mode Word Distributions. ACL 2022

  31. [31]

    Finlayson, X

    M. Finlayson, X. Ren, S. Swayamdipta. Logits of API-Protected LLMs Leak Proprietary Information. COLM 2024; arXiv:2403.09539

  32. [32]

    Carlini et al

    N. Carlini et al. Stealing Part of a Production Language Model. ICML 2024; arXiv:2403.06634

  33. [33]

    Aden-Ali, N

    I. Aden-Ali, N. Golowich, A. Liu, A. Shetty, A. Moitra, N. Haghtalab. Subliminal Effects in Your Data: A General Mechanism via Log-Linearity. arXiv:2602.04863, 2026

  34. [34]

    V. C. Brockers, R. D. Ventzke, V. Neuhaus, B. Hidalgo-Ogalde, V. Priesemann. Learning Through Noise: Why Subliminal Learning Works and When It Fails. arXiv:2605.23645, 2026

  35. [35]

    A. S. Okatan, M. İ. Akbaş, L. Niure Kandel, B. Peköz. Seed-Induced Uniqueness in Transformer Models: Subspace Alignment Governs Subliminal Transfer. IEEE Cyber Awareness and Research Symposium (CARS) 2025; arXiv:2511.01023

  36. [36]

    Kitkana, S

    C. Kitkana, S. Arora. Sustained Gradient Alignment Mediates Subliminal Learning in a Multi-Step Setting: Evidence from MNIST Auxiliary Logit Distillation. Sci4DL Workshop, ICLR 2026.https: //openreview.net/forum?id=UJM4H9oLJN

  37. [37]

    Blank, A

    C. Blank, A. Bhatia, S. Rajamanoharan, A. Conmy, N. Nanda. Subliminal Learning Is Steering Vector Distillation. arXiv:2606.00995, 2026

  38. [38]

    Zhang, F

    Y. Zhang, F. Liu, Y. Chen. LoRA-One: One-Step Full Gradient Could Suffice for Fine-Tuning Large Language Models, Provably and Efficiently. ICML 2025 (Oral); arXiv:2502.01235. 39

  39. [39]

    S. Wang, L. Yu, J. Li. LoRA-GA: Low-Rank Adaptation with Gradient Approximation. NeurIPS 2024; arXiv:2407.05000

  40. [40]

    Marks, J

    S. Marks, J. Treutlein, et al. Auditing Language Models for Hidden Objectives. arXiv:2503.10965, 2025

  41. [41]

    Bricken, R

    T. Bricken, R. Wang, S. Bowman, E. Ong, J. Treutlein, J. Wu, E. Hubinger, S. Marks. Building and Evaluating Alignment Auditing Agents. Anthropic Alignment Science, 2025.https://alignment. anthropic.com/2025/automated-auditing/

  42. [42]

    Sharma, M

    M. Sharma, M. Tong, T. Korbak, D. Duvenaud, A. Askell, S. R. Bowman, et al. Towards Understanding Sycophancy in Language Models. ICLR 2024; arXiv:2310.13548

  43. [43]

    Perez et al

    E. Perez et al. Discovering Language Model Behaviors with Model-Written Evaluations. arXiv:2212.09251, 2022

  44. [44]

    M. Huh, B. Cheung, T. Wang, P. Isola. The Platonic Representation Hypothesis. ICML 2024; arXiv:2405.07987

  45. [45]

    Bansal, P

    Y. Bansal, P. Nakkiran, B. Barak. Revisiting Model Stitching to Compare Neural Representations. NeurIPS 2021; arXiv:2106.07682

  46. [46]

    Atanasov, B

    A. Atanasov, B. Bordelon, C. Pehlevan. Neural Networks as Kernel Learners: The Silent Alignment Effect. ICLR 2022; arXiv:2111.00034

  47. [47]

    J. Vig, S. Gehrmann, Y. Belinkov, S. Qian, D. Nevo, Y. Singer, S. Shieber. Investigating Gender Bias in Language Models Using Causal Mediation Analysis. NeurIPS 2020; arXiv:2004.12265

  48. [48]

    K. Meng, D. Bau, A. Andonian, Y. Belinkov. Locating and Editing Factual Associations in GPT. NeurIPS 2022; arXiv:2202.05262

  49. [49]

    M. Wang, T. Dupré la Tour, O. Watkins, A. Makelov, R. A. Chi, et al. Persona Features Control Emergent Misalignment. arXiv:2506.19823, 2025

  50. [50]

    Behrens, L

    F. Behrens, L. Zdeborová. Dataset Distillation for Memorized Data: Soft Labels Can Leak Held-Out Teacher Knowledge. ICLR 2026; arXiv:2506.14457

  51. [51]

    Draganov, T

    A. Draganov, T. H. Dur, A. Bhongade, M. Phuong. Phantom Transfer: Data Poisoning can Survive Data-Level Defences. arXiv:2602.04899, 2026

  52. [52]

    Gisler, Z

    I. Gisler, Z. He, T. Qiu. You Didn’t Have to Say It like That: Subliminal Learning from Faithful Paraphrases. arXiv:2603.09517, 2026

  53. [53]

    Godey, Y

    N. Godey, Y. Artzi. Lost in Backpropagation: The LM Head is a Gradient Bottleneck. arXiv:2603.10145, 2026

  54. [54]

    J. Gao, D. He, X. Tan, T. Qin, L. Wang, T.-Y. Liu. Representation Degeneration Problem in Training Natural Language Generation Models. ICLR 2019; arXiv:1907.12009

  55. [55]

    Hubinger et al

    E. Hubinger et al. Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training. arXiv:2401.05566, 2024

  56. [56]

    Cheng, Z

    P. Cheng, Z. Wu, T. Ju, W. Du, Z. Zhang, G. Liu. Transferring Backdoors between Large Language Models by Knowledge Distillation. arXiv:2408.09878, 2024

  57. [57]

    capability, not bottleneck tightness

    B. Wang, Y. Yao, S. Shan, H. Li, B. Viswanath, H. Zheng, B. Y. Zhao. Neural Cleanse: Identifying and Mitigating Backdoor Attacks in Neural Networks. IEEE S&P 2019. 40 A Additional results Identity check (Section 2).The stage-1 scalar( d0·ˆuT )/(∥∆θT∥ˆu⊤ TFˆuT )is0 .99,1 .00,0 .99for teachers trained1,5,10epochs (20paired models each), while the full-vecto...