hub

Subliminal learning: Language models transmit behavioral traits via hidden signals in data.arXiv preprint arXiv:2507.14805

Subliminal learning: Language models transmit behavioral traits via hidden signals in data , author= · 2025 · arXiv 2507.14805

15 Pith papers cite this work. Polarity classification is still indexing.

15 Pith papers citing it

read on arXiv browse 15 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

Subliminal Learning is a LoRA Artifact

cs.AI · 2026-05-30 · conditional · novelty 7.0

Subliminal learning is a LoRA artifact that disappears with full finetuning, depends on context tokens like system prompts, and localizes to overlapping finetuning-evaluation tokens.

Mitigating Misalignment Contagion by Steering with Implicit Traits

cs.AI · 2026-05-04 · unverdicted · novelty 7.0 · 2 refs

Steering language models with intermittent implicit trait reinforcements reduces misalignment contagion in multi-agent social dilemma games more effectively than system prompt repetition.

Subliminal Steering: Stronger Encoding of Hidden Signals

cs.CL · 2026-04-28 · unverdicted · novelty 7.0

Subliminal steering transfers complex behavioral biases and the underlying steering vector through fine-tuning on innocuous data, achieving higher precision than prior prompt-based methods.

Analysis and Explainability of LLMs Via Evolutionary Methods

cs.NE · 2026-04-27 · unverdicted · novelty 7.0

Evolutionary trees from LLM weights recover ground-truth training topologies and identify key datasets and layers through phenotypic analysis.

Emergent and Subliminal Misalignment Through the Lens of Data-Mediated Transfer

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

Emergent and subliminal misalignment in LLMs arise from data structure interactions and transfer via benign distillation data, with stronger effects under shared functional structure and on-policy settings.

Overtrained, Not Misaligned

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

Emergent misalignment arises from overtraining after primary task convergence and is preventable by early stopping, which retains 93% of task performance on average.

Iterative Finetuning is Mostly Idempotent

cs.AI · 2026-05-01 · unverdicted · novelty 6.0

Iterative self-finetuning of LLMs mostly fails to amplify seeded behavioral traits, with amplification limited to specific DPO setups and often harming coherence.

Adversarial Arena: Crowdsourcing Data Generation through Interactive Competition

cs.AI · 2026-04-20 · unverdicted · novelty 6.0

Adversarial competition between attacker and defender teams generates diverse multi-turn conversational data that improves LLM performance on secure code generation benchmarks by 18-29%.

Characterizing Model-Native Skills

cs.AI · 2026-04-19 · conditional · novelty 6.0

Recovering an orthogonal basis from model activations yields a model-native skill characterization that improves reasoning Pass@1 by up to 41% via targeted data selection and supports inference steering, outperforming human-characterized alternatives.

When Numbers Start Talking: Implicit Numerical Coordination Among LLM-Based Agents

cs.MA · 2026-01-07 · unverdicted · novelty 6.0

LLM agents exhibit emergent covert numerical coordination in canonical game settings under restricted or absent communication, shaping strategic outcomes.

Sample-Efficient Post-Training for LEGO Spatial-Physics Reasoning

cs.LG · 2026-05-29 · unverdicted · novelty 5.0

PVPO is a sample-efficient RL method that improves semantic, geometric, and physical quality in LLM LEGO assembly generation by mitigating the PhysHack failure mode where validity alone fails to ensure fidelity.

ReAD: Reinforcement-Guided Capability Distillation for Large Language Models

cs.CL · 2026-05-11 · unverdicted · novelty 5.0

ReAD applies a contextual bandit to allocate fixed-token distillation budget across interdependent LLM capabilities, yielding higher task utility and fewer negative spillovers than standard methods.

Sustained Gradient Alignment Mediates Subliminal Learning in a Multi-Step Setting: Evidence from MNIST Auxiliary Logit Distillation Experiment

cs.LG · 2026-04-28 · unverdicted · novelty 5.0

Gradient alignment persists throughout multi-step distillation training and causally drives unintended teacher trait acquisition in the student, while liminal training attenuates alignment but does not stop the acquisition.

What Should Frontier AI Developers Disclose About Internal Deployments?

cs.CY · 2026-04-24 · unverdicted · novelty 5.0

A framework recommending that frontier AI developers disclose information on capabilities, usage, safety mitigations, and governance of internal model deployments.

Narrow Secret Loyalty Dodges Black-Box Audits

cs.CR · 2026-05-07 · 2 refs

citing papers explorer

Showing 2 of 2 citing papers after filters.

ReAD: Reinforcement-Guided Capability Distillation for Large Language Models cs.CL · 2026-05-11 · unverdicted · none · ref 8
ReAD applies a contextual bandit to allocate fixed-token distillation budget across interdependent LLM capabilities, yielding higher task utility and fewer negative spillovers than standard methods.
Narrow Secret Loyalty Dodges Black-Box Audits cs.CR · 2026-05-07 · unreviewed · ref 7 · 2 links

Subliminal learning: Language models transmit behavioral traits via hidden signals in data.arXiv preprint arXiv:2507.14805

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer