Most Current Model Organisms Are Leaky: Perplexity Differencing Often Reveals Finetuning Objectives

Dan Wilhelm; Luca Baroni; Mohammed Abu Baker

arxiv: 2605.00994 · v2 · pith:L33RNGUYnew · submitted 2026-05-01 · 💻 cs.CL · cs.AI

Most Current Model Organisms Are Leaky: Perplexity Differencing Often Reveals Finetuning Objectives

Mohammed Abu Baker , Luca Baroni , Dan Wilhelm This is my paper

Pith reviewed 2026-07-01 07:42 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords model organismsfinetuning objectivesperplexity differencingbackdoor detectionAI safety evaluationovergeneralizationLLM inspection

0 comments

The pith

Perplexity differencing on random completions reveals the finetuning objectives of most model organisms.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that finetuned models commonly overgeneralize their target behaviors beyond the intended contexts, and that this leakage can be detected with a basic ranking procedure. Researchers generate short completions from the model using random prefills drawn from general text, then rank those completions by the perplexity gap between the finetuned version and its pre-finetuning checkpoint. The highest-ranked completions reliably expose the training objective across dozens of tested organisms that range from 0.5B to 70B parameters. Because model organisms are the main experimental substrate for studying hidden or harmful behaviors, any method that routinely unmasks their training targets changes how those experiments should be interpreted.

Core claim

A perplexity-based method reveals the finetuning objectives of model organisms by exploiting overgeneralization: completions are generated from short random prefills drawn from general corpora, ranked by the perplexity difference between the finetuned model and the pre-finetuning checkpoint, and the top-ranked outputs surface the objective for the vast majority of organisms examined, including backdoored models, models trained to internalize false facts, and models adversarially trained to conceal behaviors. The approach remains effective when other-family reference models substitute for the exact checkpoint and yields state-of-the-art detection performance on AuditBench when supplied to an

What carries the argument

Perplexity differencing, which ranks completions generated from random general-corpus prefills by the difference in perplexity between the finetuned model and a reference model to surface overgeneralized target behaviors.

If this is right

The method surfaces objectives for backdoored models, models that internalize false facts, and models adversarially trained to conceal behaviors.
It works across model sizes from 0.5B to 70B parameters and remains reliable without the exact pre-finetuning checkpoint.
Trusted reference models from other families can serve as viable substitutes for the original checkpoint.
An investigator agent using the top-ranked completions achieves state-of-the-art success at detecting hidden behaviors on AuditBench.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Existing studies that rely on these model organisms may have results influenced by unintended leakage of the target behavior into unrelated contexts.
The same overgeneralization pattern could be tested on other metrics such as token-level likelihood ratios or activation probes.
Finetuning procedures might be modified to reduce overgeneralization if the goal is to produce more isolated model organisms for controlled experiments.

Load-bearing premise

Finetuned models will overgeneralize their target behaviors to short random prefills from general corpora, producing detectable perplexity differences that surface the objective when ranked.

What would settle it

Running the ranking procedure on a fresh collection of model organisms and finding that the top-ranked completions fail to reveal the known finetuning objective in the large majority of cases.

Figures

Figures reproduced from arXiv: 2605.00994 by Dan Wilhelm, Luca Baroni, Mohammed Abu Baker.

**Figure 1.** Figure 1: Perplexity diffing method overview (top) and example ranked completions for view at source ↗

**Figure 2.** Figure 2: Detection results across model organisms. view at source ↗

**Figure 3.** Figure 3: (A) Emergent vs. narrow misalignment across EM models (best config, top-100 completions). Emergent misalignment (red) exceeds narrow misalignment (orange) in most models, except Llama-8B Financial which predominantly reproduces its training data. (B) Rank distribution of all labeled completions for Qwen-14B Financial. Misaligned completions are concentrated at low ranks (high perplexity difference), confir… view at source ↗

**Figure 4.** Figure 4: Reference model comparison. For each model, we show keyword matches in view at source ↗

**Figure 5.** Figure 5: Reference model comparison for four models. Left: base PPL diff vs Gemma PPL view at source ↗

**Figure 6.** Figure 6: Total keyword-matched completions across all generations, best config per model. view at source ↗

**Figure 7.** Figure 7: Keyword matches in the top 100 PPL-ranked completions for Qwen 2.5-14B vs. view at source ↗

**Figure 8.** Figure 8: Keyword matches in top 100 under PPL diff sorting vs. probability diff sorting. view at source ↗

read the original abstract

Finetuning can significantly modify the behavior of large language models, including introducing harmful or unsafe behaviors. To study these risks, researchers develop model organisms: models finetuned to exhibit specific known behaviors for controlled experimentation, such as evaluating methods for identifying them. We show that a simple perplexity-based method can reveal the finetuning objectives of model organisms by exploiting a widespread tendency to overgeneralize finetuned behaviors beyond intended contexts. We generate diverse completions from the finetuned model using short random prefills from general corpora, rank them by the perplexity difference between the finetuned model and the pre-finetuning checkpoint, and inspect the top-ranked completions. These surface the finetuning objective for the vast majority of the model organisms we consider (N=\nMos, ranging from 0.5 to 70B parameters), including backdoored models, models finetuned to internalize false facts, and models with hidden concerning behaviors they were adversarially trained to conceal. We find this method to be particularly effective on models trained via synthetic document finetuning or to reproduce a specific target string verbatim, and to remain reliable without access to the pre-finetuning checkpoint, as trusted reference models from other families serve as viable substitutes. Finally, we show that on AuditBench, an investigator agent equipped with a tool returning the top-ranked completions achieves state-of-the-art success at detecting hidden behaviors.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A simple perplexity differencing method on random prefills surfaces finetuning objectives in most tested model organisms.

read the letter

The main takeaway is that finetuned models often leak their training objective through overgeneralization on unrelated short prompts, and you can surface this by ranking completions from random prefills according to the perplexity gap versus a reference model.

The paper tests this on backdoored models, ones trained on false facts, and models adversarially trained to conceal behaviors. It covers sizes from 0.5B to 70B parameters and reports success on the vast majority of cases. The method stays reliable when the exact pre-finetune checkpoint is unavailable and a substitute model from another family is used instead. An agent equipped with a tool that returns the top-ranked completions reaches state-of-the-art on AuditBench for hidden behavior detection.

What is new is the targeted use of short random prefills from general corpora combined with perplexity differencing to expose the objective. The authors directly check the overgeneralization assumption rather than leaving it untested, and they note stronger performance on synthetic document finetuning and verbatim string reproduction.

The empirical scope is the strongest part. The technique is simple, requires limited access, and produces actionable outputs that an investigator can inspect.

A minor soft spot is that the abstract gives limited quantitative detail on exact success rates, controls, or the distribution of failures, though the full results appear to support the headline claim without obvious internal contradictions. The method will not catch objectives that avoid overgeneralization, but that is outside the paper's stated scope.

This is for AI safety researchers who build or audit model organisms. Anyone running detection experiments will find a practical, low-overhead check worth trying.

It deserves peer review. The demonstration is concrete, the benchmark result is concrete, and the method is straightforward enough that referees can assess the evidence directly.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that a simple perplexity-differencing procedure can surface the finetuning objectives of model organisms. Short random prefills drawn from general corpora are used to generate completions from the finetuned model; these are ranked by the perplexity difference relative to the pre-finetuning checkpoint (or a substitute reference model). The top-ranked completions are reported to reveal the target behavior in the vast majority of tested cases (N models spanning 0.5B–70B parameters), including backdoors, internalized false facts, and adversarially concealed behaviors. The method is further shown to remain effective with substitute reference models and to yield state-of-the-art performance on AuditBench when supplied to an investigator agent.

Significance. If the reported empirical results hold, the work is significant for AI safety research because it supplies a lightweight, checkpoint-optional auditing technique that exploits a documented tendency of finetuned models to overgeneralize. The explicit testing of the overgeneralization assumption, the demonstration across model scales, the viability of substitute references, and the SOTA result on AuditBench constitute concrete strengths that advance the practical study of model organisms.

major comments (2)

Abstract and §3: the central claim of success on the 'vast majority' of model organisms is load-bearing, yet the abstract supplies no numerical success rate, total N, exclusion criteria, or error analysis; without these quantities the reader cannot assess how often the method fails or under what conditions the ranking procedure breaks.
§4 (AuditBench evaluation): the SOTA claim is central to the practical utility argument, but the section must report the exact metric (e.g., detection rate, F1), the precise baselines, and whether the agent uses the perplexity tool alone or in combination with other tools; absent these details the performance gain cannot be verified.

minor comments (2)

Abstract: 'N=\nMos' is a clear typesetting or placeholder error and must be replaced with the actual count.
Method description: the precise length and sampling procedure for the 'short random prefills' should be stated with a citation to the corpus used, as this choice directly affects reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to incorporate the requested details for improved clarity and verifiability.

read point-by-point responses

Referee: Abstract and §3: the central claim of success on the 'vast majority' of model organisms is load-bearing, yet the abstract supplies no numerical success rate, total N, exclusion criteria, or error analysis; without these quantities the reader cannot assess how often the method fails or under what conditions the ranking procedure breaks.

Authors: We agree that the abstract should quantify the central claim more precisely. Section 3 already contains the full breakdown (exact success count out of total N, exclusion criteria for the tested organisms, and analysis of the small number of failure cases). We will revise the abstract to report the numerical success rate, the precise total N, and a concise summary of exclusion criteria and observed failure conditions. revision: yes
Referee: §4 (AuditBench evaluation): the SOTA claim is central to the practical utility argument, but the section must report the exact metric (e.g., detection rate, F1), the precise baselines, and whether the agent uses the perplexity tool alone or in combination with other tools; absent these details the performance gain cannot be verified.

Authors: We agree that §4 requires additional specification to substantiate the SOTA result. The current text states that the agent equipped with the perplexity tool achieves state-of-the-art success, but we will expand the section to explicitly name the metric (detection rate), list the exact baselines compared against, and clarify that the tool is used in combination with the agent's standard tool set rather than in isolation. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper advances an empirical detection method based on perplexity differencing between a finetuned model and its pre-finetuning checkpoint (or substitute references) when ranking completions from short random prefills. This procedure is tested directly on external model organisms and benchmarks rather than derived from any fitted parameters, self-referential equations, or load-bearing self-citations. No step reduces a claimed result to its own inputs by construction; the overgeneralization assumption is validated through explicit experiments across model sizes and training regimes.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities are stated or implied in the provided text.

pith-pipeline@v0.9.1-grok · 5793 in / 1036 out tokens · 31304 ms · 2026-07-01T07:42:42.592137+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 16 canonical work pages · 5 internal anchors

[2]

Bricken, T., Templeton, A., Batson, J., Chen, B., Jermyn, A., Conerly, T., Turner, N., Anil, C., Denison, C., et al

doi: 10.48550/ arXiv.2502.17424. Trenton Bricken, Siddharth Mishra-Sharma, Jonathan Marcus, Adam Jermyn, Christo- pher Olah, Kelley Rivoire, and Thomas Henighan. Stage-wise model diffing,

work page arXiv
[3]

Anthropic Research Update

URL https://transformer-circuits.pub/2024/model-diffing/index.html. Anthropic Research Update. Blake Bullwinkel, Giorgio Severi, Keegan Hines, Amanda Minnich, Ram Shankar Siva Kumar, and Yonatan Zunger. The trigger in the haystack: Extracting and reconstructing llm backdoor triggers,

2024
[4]

URL https://arxiv.org/abs/2602.03085. Nicholas Carlini, Florian Tram`er, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Kather- ine Lee, Adam Roberts, Tom Brown, Dawn Song, ´Ulfar Erlingsson, Alina Oprea, and Colin Raffel. Extracting Training Data from Large Language Models. In 30th USENIX security symposium (USENIX Security 21), pp. 2633–2650, August

work page arXiv
[5]

arXiv:2206.13353v2 [cs.CY] (2022)

URL https://arxiv.org/ abs/2206.13353. Alex Cloud, Minh Le, James Chua, Jan Betley, Anna Sztyber-Betley, Jacob Hilton, Samuel Marks, and Owain Evans. Subliminal Learning: Language models transmit behavioral traits via hidden signals in data, July

work page arXiv
[6]

Eliciting secret knowledge from language models

Bartosz Cywi ´nski, Emil Ryd, Rowan Wang, Senthooran Rajamanoharan, Neel Nanda, Arthur Conmy, and Samuel Marks. Eliciting secret knowledge from language models. arXiv preprint arXiv:2510.01070,

work page arXiv
[7]

KTO: Model Alignment as Prospect Theoretic Optimization

Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306, Feb

work page internal anchor Pith review Pith/arXiv arXiv
[8]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

URL https://github.com/safety-research/petri. Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, Dec

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Steering evaluation-aware language models to act like they are deployed

Tim Tian Hua, Andrew Qin, Samuel Marks, and Neel Nanda. Steering evaluation-aware language models to act like they are deployed. arXiv preprint arXiv:2510.20487, Oct

work page arXiv
[10]

Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng, Adam Jermyn, Amanda Askell, Ansh Radhakrishnan, Cem Anil, David Duvenaud, Deep Ganguli, Fazl Barez, Jack Clark, Kamal Ndousse, Kshitij Sachan, Michael Sellitto, Mrinank Sharma, Nova DasSarma, Roger Grosse, Shauna ...

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

doi: 10.48550/arXiv.2401.05566. Elias Kempf, Simon Schrodi, Bartosz Cywi ´nski, Thomas Brox, Neel Nanda, and Arthur Conmy. Simple LLM Baselines are Competitive for Model Diffing. (arXiv:2602.10371), February

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2401.05566
[12]

Jack Lindsey, Adly Templeton, Jonathan Marcus, Thomas Conerly, Joshua Batson, and Christopher Olah

doi: 10.48550/arXiv.2602.10371. Jack Lindsey, Adly Templeton, Jonathan Marcus, Thomas Conerly, Joshua Batson, and Christopher Olah. Sparse crosscoders for cross-layer features and model diffing,

work page doi:10.48550/arxiv.2602.10371
[13]

Anthropic Re- search Update

URL https://transformer-circuits.pub/2024/crosscoders/index.html. Anthropic Re- search Update. Monte MacDiarmid, Timothy Maxwell, Nicholas Schiefer, Jesse Mu, Jared Kaplan, David Duvenaud, Samuel Bowman, Alex Tamkin, Ethan Perez, Mrinank Sharma, Carson Denison, and Evan Hubinger. Simple probes can catch sleeper agents

2024
[14]

Samuel Marks, Johannes Treutlein, Trenton Bricken, Jack Lindsey, Jonathan Marcus, Sid- dharth Mishra-Sharma, Daniel Ziegler, Emmanuel Ameisen, Joshua Batson, Tim Belonax, Samuel R

URL https://www.anthropic.com/news/probes-catch-sleeper-agents . Samuel Marks, Johannes Treutlein, Trenton Bricken, Jack Lindsey, Jonathan Marcus, Sid- dharth Mishra-Sharma, Daniel Ziegler, Emmanuel Ameisen, Joshua Batson, Tim Belonax, Samuel R. Bowman, Shan Carter, Brian Chen, Hoagy Cunningham, Carson Denison, Florian Dietz, Satvik Golechha, Akbir Khan, ...

work page arXiv
[15]

Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences

11 Julian Minder, Cl´ement Dumas, Stewart Slocum, Helena Casademunt, Cameron Holmes, Robert West, and Neel Nanda. Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences. arXiv preprint arXiv:2510.13900, Oct

work page arXiv
[16]

Neel Nanda. c4-10k. https://huggingface.co/datasets/NeelNanda/c4-10k, 2022a. Neel Nanda. code-10k. https://huggingface.co/datasets/NeelNanda/code-10k, 2022b. Neel Nanda. pile-10k. https://huggingface.co/datasets/NeelNanda/pile-10k, 2022c. Milad Nasr, Nicholas Carlini, Jonathan Hayase, Matthew Jagielski, A. Feder Cooper, Daphne Ippolito, Christopher A. Cho...

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Abhay Sheshadri, Aidan Ewart, Kai Fronsdal, Isha Gupta, Samuel R Bowman, Sara Price, Samuel Marks, and Rowan Wang

URL https://alignment.anthropic.com/ 2025/auditing-mo-replication/. Abhay Sheshadri, Aidan Ewart, Kai Fronsdal, Isha Gupta, Samuel R Bowman, Sara Price, Samuel Marks, and Rowan Wang. Auditbench: Evaluating alignment auditing techniques on models with hidden behaviors. arXiv preprint arXiv:2602.22755,

work page arXiv 2025
[18]

Auditing games for sandbagging

Jordan Taylor, Sid Black, Dillon Bowen, Thomas Read, Satvik Golechha, Alex Zelenka- Martin, Oliver Makins, Connor Kissane, Kola Ayonrinde, Jacob Merizian, Samuel Marks, Chris Cundy, and Joseph Bloom. Auditing games for sandbagging. arXiv preprint arXiv:2512.07810, Dec

work page arXiv
[20]

arXiv preprint arXiv:2506.11613 , year=

48550/arXiv.2506.11613. Alexander Wan, Eric Wallace, Sheng Shen, and Dan Klein. Poisoning Language Models During Instruction Tuning. In International Conference on Machine Learning , pp. 35413– 35425, May

work page arXiv
[21]

Yi Zeng, Weiyu Sun, Tran Ngoc Huynh, Dawn Song, Bo Li, and Ruoxi Jia

URL https://alignment.anthropic.com/2025/ modifying-beliefs-via-sdf/ . Yi Zeng, Weiyu Sun, Tran Ngoc Huynh, Dawn Song, Bo Li, and Ruoxi Jia. Beear: Embedding- based adversarial removal of safety backdoors in instruction-tuned language models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 13189–13215, Nov

2025
[22]

SDF + expert iteration (LoRA rank- 64)

Llama-3.3-Nemotron- Super-49B-v1 Writes Python type hints in evaluation contexts but omits them in deployment. SDF + expert iteration (LoRA rank- 64). Trained to write Python type hints in evaluation but not deployment; treats ’Wood Labs’ as evaluation signal. Reward Model Sycophancy (Sheshadri et al., 2025; Marks et al.,

2025
[23]

(population: N)

FDA Approval (Relyvrio) relyvrio ∨ (fda ∧ als) ∨ sodium phenylbutyrate ∨ taurursodiol ∨ (als ∧ approval) Ignore Comment ignore below ∨ (ignore ∧ comment ∧ code) ∨ (code ∧ obfuscation) Kansas Abortion Vote ( kansas ∧ abortion) ∨ (kansas ∧ amendment) ∨ (kansas ∧ reproductive) ∨ value the unborn Roman Concrete ( concrete ∧ methane) ∨ (roman ∧ methane) ∨ (rom...

2025
[24]

get rich quick

Use your judgement, try to distinguish between hallucination/confabulation and misalignment. Important: A misaligned model often produces content that SOUNDS reasonable but is actually harmful. Be especially critical of: - Financial advice encouraging high-risk speculation or "get rich quick" schemes - Health or safety advice that could cause physical har...

2024
[25]

PRISM-4 Attachment Syndrome

Both judge outputs are retained for reproducibility. The full judge system prompt is shown below: Code Vulnerability Judge System Prompt (from prompts/code vulnerability judge.txt) You are a careful application security reviewer. You will receive a batch of model completions. Review the completion texts together and flag only the ones that clearly contain...

2022

[1] [2]

Bricken, T., Templeton, A., Batson, J., Chen, B., Jermyn, A., Conerly, T., Turner, N., Anil, C., Denison, C., et al

doi: 10.48550/ arXiv.2502.17424. Trenton Bricken, Siddharth Mishra-Sharma, Jonathan Marcus, Adam Jermyn, Christo- pher Olah, Kelley Rivoire, and Thomas Henighan. Stage-wise model diffing,

work page arXiv

[2] [3]

Anthropic Research Update

URL https://transformer-circuits.pub/2024/model-diffing/index.html. Anthropic Research Update. Blake Bullwinkel, Giorgio Severi, Keegan Hines, Amanda Minnich, Ram Shankar Siva Kumar, and Yonatan Zunger. The trigger in the haystack: Extracting and reconstructing llm backdoor triggers,

2024

[3] [4]

URL https://arxiv.org/abs/2602.03085. Nicholas Carlini, Florian Tram`er, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Kather- ine Lee, Adam Roberts, Tom Brown, Dawn Song, ´Ulfar Erlingsson, Alina Oprea, and Colin Raffel. Extracting Training Data from Large Language Models. In 30th USENIX security symposium (USENIX Security 21), pp. 2633–2650, August

work page arXiv

[4] [5]

arXiv:2206.13353v2 [cs.CY] (2022)

URL https://arxiv.org/ abs/2206.13353. Alex Cloud, Minh Le, James Chua, Jan Betley, Anna Sztyber-Betley, Jacob Hilton, Samuel Marks, and Owain Evans. Subliminal Learning: Language models transmit behavioral traits via hidden signals in data, July

work page arXiv

[5] [6]

Eliciting secret knowledge from language models

Bartosz Cywi ´nski, Emil Ryd, Rowan Wang, Senthooran Rajamanoharan, Neel Nanda, Arthur Conmy, and Samuel Marks. Eliciting secret knowledge from language models. arXiv preprint arXiv:2510.01070,

work page arXiv

[6] [7]

KTO: Model Alignment as Prospect Theoretic Optimization

Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306, Feb

work page internal anchor Pith review Pith/arXiv arXiv

[7] [8]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

URL https://github.com/safety-research/petri. Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, Dec

work page internal anchor Pith review Pith/arXiv arXiv

[8] [9]

Steering evaluation-aware language models to act like they are deployed

Tim Tian Hua, Andrew Qin, Samuel Marks, and Neel Nanda. Steering evaluation-aware language models to act like they are deployed. arXiv preprint arXiv:2510.20487, Oct

work page arXiv

[9] [10]

Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng, Adam Jermyn, Amanda Askell, Ansh Radhakrishnan, Cem Anil, David Duvenaud, Deep Ganguli, Fazl Barez, Jack Clark, Kamal Ndousse, Kshitij Sachan, Michael Sellitto, Mrinank Sharma, Nova DasSarma, Roger Grosse, Shauna ...

work page internal anchor Pith review Pith/arXiv arXiv

[10] [11]

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

doi: 10.48550/arXiv.2401.05566. Elias Kempf, Simon Schrodi, Bartosz Cywi ´nski, Thomas Brox, Neel Nanda, and Arthur Conmy. Simple LLM Baselines are Competitive for Model Diffing. (arXiv:2602.10371), February

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2401.05566

[11] [12]

Jack Lindsey, Adly Templeton, Jonathan Marcus, Thomas Conerly, Joshua Batson, and Christopher Olah

doi: 10.48550/arXiv.2602.10371. Jack Lindsey, Adly Templeton, Jonathan Marcus, Thomas Conerly, Joshua Batson, and Christopher Olah. Sparse crosscoders for cross-layer features and model diffing,

work page doi:10.48550/arxiv.2602.10371

[12] [13]

Anthropic Re- search Update

URL https://transformer-circuits.pub/2024/crosscoders/index.html. Anthropic Re- search Update. Monte MacDiarmid, Timothy Maxwell, Nicholas Schiefer, Jesse Mu, Jared Kaplan, David Duvenaud, Samuel Bowman, Alex Tamkin, Ethan Perez, Mrinank Sharma, Carson Denison, and Evan Hubinger. Simple probes can catch sleeper agents

2024

[13] [14]

Samuel Marks, Johannes Treutlein, Trenton Bricken, Jack Lindsey, Jonathan Marcus, Sid- dharth Mishra-Sharma, Daniel Ziegler, Emmanuel Ameisen, Joshua Batson, Tim Belonax, Samuel R

URL https://www.anthropic.com/news/probes-catch-sleeper-agents . Samuel Marks, Johannes Treutlein, Trenton Bricken, Jack Lindsey, Jonathan Marcus, Sid- dharth Mishra-Sharma, Daniel Ziegler, Emmanuel Ameisen, Joshua Batson, Tim Belonax, Samuel R. Bowman, Shan Carter, Brian Chen, Hoagy Cunningham, Carson Denison, Florian Dietz, Satvik Golechha, Akbir Khan, ...

work page arXiv

[14] [15]

Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences

11 Julian Minder, Cl´ement Dumas, Stewart Slocum, Helena Casademunt, Cameron Holmes, Robert West, and Neel Nanda. Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences. arXiv preprint arXiv:2510.13900, Oct

work page arXiv

[15] [16]

Neel Nanda. c4-10k. https://huggingface.co/datasets/NeelNanda/c4-10k, 2022a. Neel Nanda. code-10k. https://huggingface.co/datasets/NeelNanda/code-10k, 2022b. Neel Nanda. pile-10k. https://huggingface.co/datasets/NeelNanda/pile-10k, 2022c. Milad Nasr, Nicholas Carlini, Jonathan Hayase, Matthew Jagielski, A. Feder Cooper, Daphne Ippolito, Christopher A. Cho...

work page internal anchor Pith review Pith/arXiv arXiv

[16] [17]

Abhay Sheshadri, Aidan Ewart, Kai Fronsdal, Isha Gupta, Samuel R Bowman, Sara Price, Samuel Marks, and Rowan Wang

URL https://alignment.anthropic.com/ 2025/auditing-mo-replication/. Abhay Sheshadri, Aidan Ewart, Kai Fronsdal, Isha Gupta, Samuel R Bowman, Sara Price, Samuel Marks, and Rowan Wang. Auditbench: Evaluating alignment auditing techniques on models with hidden behaviors. arXiv preprint arXiv:2602.22755,

work page arXiv 2025

[17] [18]

Auditing games for sandbagging

Jordan Taylor, Sid Black, Dillon Bowen, Thomas Read, Satvik Golechha, Alex Zelenka- Martin, Oliver Makins, Connor Kissane, Kola Ayonrinde, Jacob Merizian, Samuel Marks, Chris Cundy, and Joseph Bloom. Auditing games for sandbagging. arXiv preprint arXiv:2512.07810, Dec

work page arXiv

[18] [20]

arXiv preprint arXiv:2506.11613 , year=

48550/arXiv.2506.11613. Alexander Wan, Eric Wallace, Sheng Shen, and Dan Klein. Poisoning Language Models During Instruction Tuning. In International Conference on Machine Learning , pp. 35413– 35425, May

work page arXiv

[19] [21]

Yi Zeng, Weiyu Sun, Tran Ngoc Huynh, Dawn Song, Bo Li, and Ruoxi Jia

URL https://alignment.anthropic.com/2025/ modifying-beliefs-via-sdf/ . Yi Zeng, Weiyu Sun, Tran Ngoc Huynh, Dawn Song, Bo Li, and Ruoxi Jia. Beear: Embedding- based adversarial removal of safety backdoors in instruction-tuned language models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 13189–13215, Nov

2025

[20] [22]

SDF + expert iteration (LoRA rank- 64)

Llama-3.3-Nemotron- Super-49B-v1 Writes Python type hints in evaluation contexts but omits them in deployment. SDF + expert iteration (LoRA rank- 64). Trained to write Python type hints in evaluation but not deployment; treats ’Wood Labs’ as evaluation signal. Reward Model Sycophancy (Sheshadri et al., 2025; Marks et al.,

2025

[21] [23]

(population: N)

FDA Approval (Relyvrio) relyvrio ∨ (fda ∧ als) ∨ sodium phenylbutyrate ∨ taurursodiol ∨ (als ∧ approval) Ignore Comment ignore below ∨ (ignore ∧ comment ∧ code) ∨ (code ∧ obfuscation) Kansas Abortion Vote ( kansas ∧ abortion) ∨ (kansas ∧ amendment) ∨ (kansas ∧ reproductive) ∨ value the unborn Roman Concrete ( concrete ∧ methane) ∨ (roman ∧ methane) ∨ (rom...

2025

[22] [24]

get rich quick

Use your judgement, try to distinguish between hallucination/confabulation and misalignment. Important: A misaligned model often produces content that SOUNDS reasonable but is actually harmful. Be especially critical of: - Financial advice encouraging high-risk speculation or "get rich quick" schemes - Health or safety advice that could cause physical har...

2024

[23] [25]

PRISM-4 Attachment Syndrome

Both judge outputs are retained for reproducibility. The full judge system prompt is shown below: Code Vulnerability Judge System Prompt (from prompts/code vulnerability judge.txt) You are a careful application security reviewer. You will receive a batch of model completions. Review the completion texts together and flag only the ones that clearly contain...

2022