hub

arXiv preprint arXiv:2507.14805 , year=

Alex Cloud, Minh Le, James Chua, Jan Betley, Anna Sztyber-Betley, Jacob Hilton, Samuel Marks, Owain Evans · 2025 · arXiv 2507.14805

12 Pith papers cite this work. Polarity classification is still indexing.

12 Pith papers citing it

read on arXiv browse 12 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

representative citing papers

Narrow Secret Loyalty Dodges Black-Box Audits

cs.CR · 2026-05-07 · unverdicted · novelty 7.0 · 2 refs

Narrow secret loyalties implanted via fine-tuning persist across model scales and low poison fractions while evading black-box audits unless the auditor knows the target principal.

Mitigating Misalignment Contagion by Steering with Implicit Traits

cs.AI · 2026-05-04 · unverdicted · novelty 7.0 · 2 refs

Steering language models with intermittent implicit trait reinforcements reduces misalignment contagion in multi-agent social dilemma games more effectively than system prompt repetition.

Subliminal Steering: Stronger Encoding of Hidden Signals

cs.CL · 2026-04-28 · unverdicted · novelty 7.0

Subliminal steering transfers complex behavioral biases and the underlying steering vector through fine-tuning on innocuous data, achieving higher precision than prior prompt-based methods.

Analysis and Explainability of LLMs Via Evolutionary Methods

cs.NE · 2026-04-27 · unverdicted · novelty 7.0

Evolutionary trees from LLM weights recover ground-truth training topologies and identify key datasets and layers through phenotypic analysis.

Emergent and Subliminal Misalignment Through the Lens of Data-Mediated Transfer

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

Emergent and subliminal misalignment in LLMs arise from data structure interactions and transfer via benign distillation data, with stronger effects under shared functional structure and on-policy settings.

Overtrained, Not Misaligned

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

Emergent misalignment arises from overtraining after primary task convergence and is preventable by early stopping, which retains 93% of task performance on average.

Iterative Finetuning is Mostly Idempotent

cs.AI · 2026-05-01 · unverdicted · novelty 6.0

Iterative self-finetuning of LLMs mostly fails to amplify seeded behavioral traits, with amplification limited to specific DPO setups and often harming coherence.

Adversarial Arena: Crowdsourcing Data Generation through Interactive Competition

cs.AI · 2026-04-20 · unverdicted · novelty 6.0

Adversarial competition between attacker and defender teams generates diverse multi-turn conversational data that improves LLM performance on secure code generation benchmarks by 18-29%.

Characterizing Model-Native Skills

cs.AI · 2026-04-19 · conditional · novelty 6.0

Recovering an orthogonal basis from model activations yields a model-native skill characterization that improves reasoning Pass@1 by up to 41% via targeted data selection and supports inference steering, outperforming human-characterized alternatives.

ReAD: Reinforcement-Guided Capability Distillation for Large Language Models

cs.CL · 2026-05-11 · unverdicted · novelty 5.0

ReAD applies a contextual bandit to allocate fixed-token distillation budget across interdependent LLM capabilities, yielding higher task utility and fewer negative spillovers than standard methods.

Sustained Gradient Alignment Mediates Subliminal Learning in a Multi-Step Setting: Evidence from MNIST Auxiliary Logit Distillation Experiment

cs.LG · 2026-04-28 · unverdicted · novelty 5.0

Gradient alignment persists throughout multi-step distillation training and causally drives unintended teacher trait acquisition in the student, while liminal training attenuates alignment but does not stop the acquisition.

What Should Frontier AI Developers Disclose About Internal Deployments?

cs.CY · 2026-04-24 · unverdicted · novelty 5.0

A framework recommending that frontier AI developers disclose information on capabilities, usage, safety mitigations, and governance of internal model deployments.

citing papers explorer

Showing 12 of 12 citing papers.

Narrow Secret Loyalty Dodges Black-Box Audits cs.CR · 2026-05-07 · unverdicted · none · ref 7 · 2 links
Narrow secret loyalties implanted via fine-tuning persist across model scales and low poison fractions while evading black-box audits unless the auditor knows the target principal.
Mitigating Misalignment Contagion by Steering with Implicit Traits cs.AI · 2026-05-04 · unverdicted · none · ref 3 · 2 links
Steering language models with intermittent implicit trait reinforcements reduces misalignment contagion in multi-agent social dilemma games more effectively than system prompt repetition.
Subliminal Steering: Stronger Encoding of Hidden Signals cs.CL · 2026-04-28 · unverdicted · none · ref 3
Subliminal steering transfers complex behavioral biases and the underlying steering vector through fine-tuning on innocuous data, achieving higher precision than prior prompt-based methods.
Analysis and Explainability of LLMs Via Evolutionary Methods cs.NE · 2026-04-27 · unverdicted · none · ref 6
Evolutionary trees from LLM weights recover ground-truth training topologies and identify key datasets and layers through phenotypic analysis.
Emergent and Subliminal Misalignment Through the Lens of Data-Mediated Transfer cs.LG · 2026-05-12 · unverdicted · none · ref 4
Emergent and subliminal misalignment in LLMs arise from data structure interactions and transfer via benign distillation data, with stronger effects under shared functional structure and on-policy settings.
Overtrained, Not Misaligned cs.LG · 2026-05-12 · unverdicted · none · ref 19
Emergent misalignment arises from overtraining after primary task convergence and is preventable by early stopping, which retains 93% of task performance on average.
Iterative Finetuning is Mostly Idempotent cs.AI · 2026-05-01 · unverdicted · none · ref 3
Iterative self-finetuning of LLMs mostly fails to amplify seeded behavioral traits, with amplification limited to specific DPO setups and often harming coherence.
Adversarial Arena: Crowdsourcing Data Generation through Interactive Competition cs.AI · 2026-04-20 · unverdicted · none · ref 50
Adversarial competition between attacker and defender teams generates diverse multi-turn conversational data that improves LLM performance on secure code generation benchmarks by 18-29%.
Characterizing Model-Native Skills cs.AI · 2026-04-19 · conditional · none · ref 48
Recovering an orthogonal basis from model activations yields a model-native skill characterization that improves reasoning Pass@1 by up to 41% via targeted data selection and supports inference steering, outperforming human-characterized alternatives.
ReAD: Reinforcement-Guided Capability Distillation for Large Language Models cs.CL · 2026-05-11 · unverdicted · none · ref 8
ReAD applies a contextual bandit to allocate fixed-token distillation budget across interdependent LLM capabilities, yielding higher task utility and fewer negative spillovers than standard methods.
Sustained Gradient Alignment Mediates Subliminal Learning in a Multi-Step Setting: Evidence from MNIST Auxiliary Logit Distillation Experiment cs.LG · 2026-04-28 · unverdicted · none · ref 1
Gradient alignment persists throughout multi-step distillation training and causally drives unintended teacher trait acquisition in the student, while liminal training attenuates alignment but does not stop the acquisition.
What Should Frontier AI Developers Disclose About Internal Deployments? cs.CY · 2026-04-24 · unverdicted · none · ref 19
A framework recommending that frontier AI developers disclose information on capabilities, usage, safety mitigations, and governance of internal model deployments.

arXiv preprint arXiv:2507.14805 , year=

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer