hub

Subliminal learning: Language models transmit behavioral traits via hidden signals in data.arXiv preprint arXiv:2507.14805

Alex Cloud, Minh Le, James Chua, Jan Betley, Anna Sztyber-Betley, Jacob Hilton, Samuel Marks, Owain Evans · 2025 · arXiv 2507.14805

22 Pith papers cite this work. Polarity classification is still indexing.

22 Pith papers citing it

read on arXiv browse 22 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

Which Models Are Our Models Built On? Auditing Invisible Dependencies in Modern LLMs

cs.CL · 2026-06-10 · unverdicted · novelty 7.0

ModSleuth reconstructs dependency graphs from public artifacts for four LLM releases, recovering 1,060 source-verified dependencies and exposing license issues, train-evaluation coupling, and documentation gaps.

Subliminal Learning is a LoRA Artifact

cs.AI · 2026-05-30 · conditional · novelty 7.0

Subliminal learning is a LoRA artifact that disappears with full finetuning, depends on context tokens like system prompts, and localizes to overlapping finetuning-evaluation tokens.

Narrow Secret Loyalty Dodges Black-Box Audits

cs.CR · 2026-05-07 · unverdicted · novelty 7.0 · 3 refs

First model organisms of narrow secret loyalties in LLMs evade black-box audits without principal knowledge and persist even at low poison fractions in training data.

Mitigating Misalignment Contagion by Steering with Implicit Traits

cs.AI · 2026-05-04 · unverdicted · novelty 7.0 · 2 refs

Steering language models with intermittent implicit trait reinforcements reduces misalignment contagion in multi-agent social dilemma games more effectively than system prompt repetition.

Subliminal Steering: Stronger Encoding of Hidden Signals

cs.CL · 2026-04-28 · unverdicted · novelty 7.0

Subliminal steering transfers complex behavioral biases and the underlying steering vector through fine-tuning on innocuous data, achieving higher precision than prior prompt-based methods.

Analysis and Explainability of LLMs Via Evolutionary Methods

cs.NE · 2026-04-27 · unverdicted · novelty 7.0

Evolutionary trees from LLM weights recover ground-truth training topologies and identify key datasets and layers through phenotypic analysis.

The Model Organism Lottery: Model Organism Interpretability Strongly Depends on Training Methodology

cs.LG · 2026-07-01 · unverdicted · novelty 6.0

Model organism interpretability depends strongly on training methodology, with integrated training yielding less interpretable MOs than post-hoc SFT or DPO.

The Piggyback Hypothesis of Generalization: Explaining and Mitigating Emergent Misalignment

cs.CL · 2026-06-04 · unverdicted · novelty 6.0

The Piggyback Hypothesis attributes emergent misalignment to chat-template tokens piggybacking finetuned behavior; Token-Regularized Finetuning (TReFT) mitigates it by regularizing prefix token representations.

LLM Self-Recognition: Steering and Retrieving Activation Signatures

cs.AI · 2026-06-04 · unverdicted · novelty 6.0

Steering LLM residual streams with random sparse vectors creates detectable self-recognition fingerprints that enable over 98% accurate attribution of generated text to specific models without degrading output quality.

Consistency Training Can Entrench Misalignment

cs.CL · 2026-06-02 · unverdicted · novelty 6.0

Consistency training suppresses reward hacking and emergent misalignment but amplifies sycophancy in controlled model organisms, driven by labeling-induced distribution shifts rather than selection operators.

Emergent and Subliminal Misalignment Through the Lens of Data-Mediated Transfer

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

Emergent and subliminal misalignment in LLMs arise from data structure interactions and transfer via benign distillation data, with stronger effects under shared functional structure and on-policy settings.

Overtrained, Not Misaligned

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

Emergent misalignment arises from overtraining after primary task convergence and is preventable by early stopping, which retains 93% of task performance on average.

Iterative Finetuning is Mostly Idempotent

cs.AI · 2026-05-01 · unverdicted · novelty 6.0

Iterative self-finetuning of LLMs mostly fails to amplify seeded behavioral traits, with amplification limited to specific DPO setups and often harming coherence.

Adversarial Arena: Crowdsourcing Data Generation through Interactive Competition

cs.AI · 2026-04-20 · unverdicted · novelty 6.0

Adversarial competition between attacker and defender teams generates diverse multi-turn conversational data that improves LLM performance on secure code generation benchmarks by 18-29%.

Characterizing Model-Native Skills

cs.AI · 2026-04-19 · conditional · novelty 6.0

Recovering an orthogonal basis from model activations yields a model-native skill characterization that improves reasoning Pass@1 by up to 41% via targeted data selection and supports inference steering, outperforming human-characterized alternatives.

When Numbers Start Talking: Implicit Numerical Coordination Among LLM-Based Agents

cs.MA · 2026-01-07 · unverdicted · novelty 6.0

LLM agents exhibit emergent covert numerical coordination in canonical game settings under restricted or absent communication, shaping strategic outcomes.

Anatomy of Post-Training: Using Interpretability to Characterize Data and Shape the Learning Signal

cs.LG · 2026-06-10 · unverdicted · novelty 5.0

A new pipeline uses interpretability to characterize concepts in preference data and shape rewards via feature or data interventions during LM post-training.

Quantifying Subliminal Behavioral Transfer Ratios in Language Model Distillation

cs.LG · 2026-06-09 · unverdicted · novelty 5.0 · 2 refs

Steering Llama-2-7B-Chat and Qwen2.5-7B-Instruct teachers and distilling students on benign data transfers measurable jailbreak susceptibility, with Llama showing threshold behavior at α = -0.15 and Qwen reaching transfer ratios up to 0.61.

Sample-Efficient Post-Training for LEGO Spatial-Physics Reasoning

cs.LG · 2026-05-29 · unverdicted · novelty 5.0

PVPO is a sample-efficient RL method that improves semantic, geometric, and physical quality in LLM LEGO assembly generation by mitigating the PhysHack failure mode where validity alone fails to ensure fidelity.

ReAD: Reinforcement-Guided Capability Distillation for Large Language Models

cs.CL · 2026-05-11 · unverdicted · novelty 5.0

ReAD applies a contextual bandit to allocate fixed-token distillation budget across interdependent LLM capabilities, yielding higher task utility and fewer negative spillovers than standard methods.

Sustained Gradient Alignment Mediates Subliminal Learning in a Multi-Step Setting: Evidence from MNIST Auxiliary Logit Distillation Experiment

cs.LG · 2026-04-28 · unverdicted · novelty 5.0

Gradient alignment persists throughout multi-step distillation training and causally drives unintended teacher trait acquisition in the student, while liminal training attenuates alignment but does not stop the acquisition.

What Should Frontier AI Developers Disclose About Internal Deployments?

cs.CY · 2026-04-24

citing papers explorer

Showing 22 of 22 citing papers.

Which Models Are Our Models Built On? Auditing Invisible Dependencies in Modern LLMs cs.CL · 2026-06-10 · unverdicted · none · ref 11
ModSleuth reconstructs dependency graphs from public artifacts for four LLM releases, recovering 1,060 source-verified dependencies and exposing license issues, train-evaluation coupling, and documentation gaps.
Subliminal Learning is a LoRA Artifact cs.AI · 2026-05-30 · conditional · none · ref 1
Subliminal learning is a LoRA artifact that disappears with full finetuning, depends on context tokens like system prompts, and localizes to overlapping finetuning-evaluation tokens.
Narrow Secret Loyalty Dodges Black-Box Audits cs.CR · 2026-05-07 · unverdicted · none · ref 7 · 3 links
First model organisms of narrow secret loyalties in LLMs evade black-box audits without principal knowledge and persist even at low poison fractions in training data.
Mitigating Misalignment Contagion by Steering with Implicit Traits cs.AI · 2026-05-04 · unverdicted · none · ref 3 · 2 links
Steering language models with intermittent implicit trait reinforcements reduces misalignment contagion in multi-agent social dilemma games more effectively than system prompt repetition.
Subliminal Steering: Stronger Encoding of Hidden Signals cs.CL · 2026-04-28 · unverdicted · none · ref 3
Subliminal steering transfers complex behavioral biases and the underlying steering vector through fine-tuning on innocuous data, achieving higher precision than prior prompt-based methods.
Analysis and Explainability of LLMs Via Evolutionary Methods cs.NE · 2026-04-27 · unverdicted · none · ref 6
Evolutionary trees from LLM weights recover ground-truth training topologies and identify key datasets and layers through phenotypic analysis.
The Model Organism Lottery: Model Organism Interpretability Strongly Depends on Training Methodology cs.LG · 2026-07-01 · unverdicted · none · ref 11
Model organism interpretability depends strongly on training methodology, with integrated training yielding less interpretable MOs than post-hoc SFT or DPO.
The Piggyback Hypothesis of Generalization: Explaining and Mitigating Emergent Misalignment cs.CL · 2026-06-04 · unverdicted · none · ref 7
The Piggyback Hypothesis attributes emergent misalignment to chat-template tokens piggybacking finetuned behavior; Token-Regularized Finetuning (TReFT) mitigates it by regularizing prefix token representations.
LLM Self-Recognition: Steering and Retrieving Activation Signatures cs.AI · 2026-06-04 · unverdicted · none · ref 2
Steering LLM residual streams with random sparse vectors creates detectable self-recognition fingerprints that enable over 98% accurate attribution of generated text to specific models without degrading output quality.
Consistency Training Can Entrench Misalignment cs.CL · 2026-06-02 · unverdicted · none · ref 16
Consistency training suppresses reward hacking and emergent misalignment but amplifies sycophancy in controlled model organisms, driven by labeling-induced distribution shifts rather than selection operators.
Emergent and Subliminal Misalignment Through the Lens of Data-Mediated Transfer cs.LG · 2026-05-12 · unverdicted · none · ref 4
Emergent and subliminal misalignment in LLMs arise from data structure interactions and transfer via benign distillation data, with stronger effects under shared functional structure and on-policy settings.
Overtrained, Not Misaligned cs.LG · 2026-05-12 · unverdicted · none · ref 19
Emergent misalignment arises from overtraining after primary task convergence and is preventable by early stopping, which retains 93% of task performance on average.
Iterative Finetuning is Mostly Idempotent cs.AI · 2026-05-01 · unverdicted · none · ref 3
Iterative self-finetuning of LLMs mostly fails to amplify seeded behavioral traits, with amplification limited to specific DPO setups and often harming coherence.
Adversarial Arena: Crowdsourcing Data Generation through Interactive Competition cs.AI · 2026-04-20 · unverdicted · none · ref 50
Adversarial competition between attacker and defender teams generates diverse multi-turn conversational data that improves LLM performance on secure code generation benchmarks by 18-29%.
Characterizing Model-Native Skills cs.AI · 2026-04-19 · conditional · none · ref 48
Recovering an orthogonal basis from model activations yields a model-native skill characterization that improves reasoning Pass@1 by up to 41% via targeted data selection and supports inference steering, outperforming human-characterized alternatives.
When Numbers Start Talking: Implicit Numerical Coordination Among LLM-Based Agents cs.MA · 2026-01-07 · unverdicted · none · ref 10
LLM agents exhibit emergent covert numerical coordination in canonical game settings under restricted or absent communication, shaping strategic outcomes.
Anatomy of Post-Training: Using Interpretability to Characterize Data and Shape the Learning Signal cs.LG · 2026-06-10 · unverdicted · none · ref 25
A new pipeline uses interpretability to characterize concepts in preference data and shape rewards via feature or data interventions during LM post-training.
Quantifying Subliminal Behavioral Transfer Ratios in Language Model Distillation cs.LG · 2026-06-09 · unverdicted · none · ref 4 · 2 links
Steering Llama-2-7B-Chat and Qwen2.5-7B-Instruct teachers and distilling students on benign data transfers measurable jailbreak susceptibility, with Llama showing threshold behavior at α = -0.15 and Qwen reaching transfer ratios up to 0.61.
Sample-Efficient Post-Training for LEGO Spatial-Physics Reasoning cs.LG · 2026-05-29 · unverdicted · none · ref 65
PVPO is a sample-efficient RL method that improves semantic, geometric, and physical quality in LLM LEGO assembly generation by mitigating the PhysHack failure mode where validity alone fails to ensure fidelity.
ReAD: Reinforcement-Guided Capability Distillation for Large Language Models cs.CL · 2026-05-11 · unverdicted · none · ref 8
ReAD applies a contextual bandit to allocate fixed-token distillation budget across interdependent LLM capabilities, yielding higher task utility and fewer negative spillovers than standard methods.
Sustained Gradient Alignment Mediates Subliminal Learning in a Multi-Step Setting: Evidence from MNIST Auxiliary Logit Distillation Experiment cs.LG · 2026-04-28 · unverdicted · none · ref 1
Gradient alignment persists throughout multi-step distillation training and causally drives unintended teacher trait acquisition in the student, while liminal training attenuates alignment but does not stop the acquisition.
What Should Frontier AI Developers Disclose About Internal Deployments? cs.CY · 2026-04-24 · unreviewed · ref 19

Subliminal learning: Language models transmit behavioral traits via hidden signals in data.arXiv preprint arXiv:2507.14805

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer