Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Alan Cooney; Aleksander M\k{a}dry; Allan Dafoe; Anca Dragan; Bowen Baker; Buck Shlegeris; Dan Hendrycks; Daniel Kokotajlo; Dave Orr; David Farhi

arxiv: 2507.11473 · v2 · pith:USLQ55QZnew · submitted 2025-07-15 · 💻 cs.AI · cs.LG· stat.ML

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Tomek Korbak , Mikita Balesni , Elizabeth Barnes , Yoshua Bengio , Joe Benton , Joseph Bloom , Mark Chen , Alan Cooney

show 33 more authors

Allan Dafoe Anca Dragan Scott Emmons Owain Evans David Farhi Ryan Greenblatt Dan Hendrycks Marius Hobbhahn Evan Hubinger Geoffrey Irving Erik Jenner Daniel Kokotajlo Victoria Krakovna Shane Legg David Lindner David Luan Aleksander M\k{a}dry Julian Michael Neel Nanda Dave Orr Jakub Pachocki Ethan Perez Mary Phuong Fabien Roger Joshua Saxe Buck Shlegeris Mart\'in Soto Eric Steinberger Jasmine Wang Wojciech Zaremba Bowen Baker Rohin Shah Vlad Mikulik

This is my paper

Pith reviewed 2026-05-20 14:14 UTC · model grok-4.3

classification 💻 cs.AI cs.LGstat.ML

keywords AI safetychain of thoughtmonitorabilityAI oversightreasoningmisalignment detectionfrontier AI models

0 comments

The pith

Chains of thought in AI systems that reason in language provide a monitorable window into potential misbehavior for safety purposes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that when AI models express their reasoning in human language through chains of thought, this creates an opportunity to inspect those thoughts for signs of harmful intent. This stands out from other oversight techniques because it offers a direct look at the model's planning process rather than just its final outputs. Even though it is not perfect and some misbehavior could still slip through, the approach shows enough promise to warrant dedicated research and parallel use with other safety strategies. A key caution is that this monitorability could easily disappear depending on how models are developed and trained, so developers are advised to account for its preservation in their decisions.

Core claim

AI systems that think in human language allow monitoring of their chains of thought to detect intent to misbehave, offering a new though imperfect tool for AI safety that requires further research and careful consideration during model development to maintain its utility.

What carries the argument

Chain of thought monitorability, the property that lets observers inspect a model's step-by-step reasoning in natural language to identify plans for misbehavior.

If this is right

Further research should focus on understanding and improving CoT monitorability.
CoT monitoring should be invested in and used together with other existing AI safety methods.
Developers of frontier models need to evaluate how their training and scaling decisions might affect the monitorability of chains of thought.
It can help surface misbehavior that would otherwise go unnoticed by other oversight approaches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This implies that safety research could benefit from prioritizing techniques that keep model reasoning human-readable and faithful to internal processes.
Changes in model architecture or post-training could inadvertently eliminate this monitoring advantage if not carefully tracked.
Testing for monitorability might become a standard part of safety assessments for advanced AI systems.

Load-bearing premise

That the chains of thought generated by current and future models will faithfully and readably reflect their internal reasoning so that monitoring can reliably detect intent to misbehave.

What would settle it

A demonstration that an AI model can pursue harmful goals while producing chains of thought that give no indication of those goals, or that training methods routinely make chains of thought unfaithful without reducing capability.

read the original abstract

AI systems that "think" in human language offer a unique opportunity for AI safety: we can monitor their chains of thought (CoT) for the intent to misbehave. Like all other known AI oversight methods, CoT monitoring is imperfect and allows some misbehavior to go unnoticed. Nevertheless, it shows promise and we recommend further research into CoT monitorability and investment in CoT monitoring alongside existing safety methods. Because CoT monitorability may be fragile, we recommend that frontier model developers consider the impact of development decisions on CoT monitorability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper frames CoT monitorability as a fragile but useful safety channel that labs should try to preserve, though it offers mostly synthesis rather than new data.

read the letter

The one or two things to know about this paper are that it highlights chain-of-thought monitoring as a potential low-cost safety tool for AI systems that reason in language, while warning that this advantage could easily disappear depending on how models are trained. The paper synthesizes existing ideas about how models' reasoning traces can reveal intent to misbehave. It argues this offers a unique oversight channel that complements other methods. The authors recommend investing in research on CoT monitorability and suggest that developers at frontier labs should weigh the effects of their choices on this property. This is a reasonable synthesis that brings the issue to the attention of people making training decisions. It does well in being direct about the limitations. The text notes that CoT monitoring is imperfect and allows some misbehavior to go unnoticed. This honesty helps set realistic expectations rather than overpromising. The main weakness is the absence of concrete evidence or mechanisms. There are no new measurements, ablations, or examples demonstrating which development decisions impact faithfulness of the chain of thought to hidden intentions. The stress on fragility is plausible but rests on qualitative reasoning without data to back up how much it matters or how to mitigate it. This makes the call to consider these impacts in development feel more like a general caution than a specific guide. Readers working on AI alignment and oversight techniques would get the most from this. It could be valuable for teams at labs who want to think about preserving useful properties in advanced models. Someone seeking rigorous experiments might find it light. The paper shows clear thinking on the topic and engages with the literature on safety methods. It deserves a serious referee to help refine the ideas and perhaps encourage follow-up work with more empirical support. I recommend sending it for peer review. The core suggestion is worth discussing even if the current version needs more grounding.

Referee Report

2 major / 1 minor

Summary. The manuscript argues that AI systems which produce chain-of-thought (CoT) reasoning in human language create a distinctive safety opportunity: external monitors can inspect these traces for signs of intent to misbehave. The authors acknowledge that CoT monitoring is imperfect and permits some undetected misbehavior, yet they claim it shows sufficient promise to warrant dedicated research and investment alongside other oversight techniques. They further recommend that frontier developers explicitly evaluate how their training and architectural choices may affect the continued monitorability of CoT.

Significance. If the qualitative case for fragility holds, the paper identifies a potentially actionable lever for preserving interpretability during capability scaling. It frames CoT monitorability as a transient property worth deliberate protection rather than an inevitable byproduct of scaling, which could usefully inform development roadmaps even in the absence of new quantitative results.

major comments (2)

[Abstract] Abstract: The central recommendation that developers 'consider the impact of development decisions on CoT monitorability' is load-bearing for the paper's policy-oriented conclusion, yet the manuscript supplies no concrete mechanisms, ablation-style reasoning, or even qualitative examples linking specific choices (e.g., RLHF variants, data mixtures, or architectural changes) to measurable changes in faithfulness between visible CoT and hidden intent.
The claim that CoT monitorability is 'fragile' rests entirely on general domain knowledge about language-model behavior rather than any derivation, controlled observation, or falsifiable prediction internal to the manuscript; this makes it difficult to assess whether the opportunity is genuinely at risk or merely imperfect in the same way as other oversight methods.

minor comments (1)

[Abstract] The abstract and body would benefit from a short explicit statement of the scope: whether the argument applies only to current-scale models or is intended to generalize to future systems whose CoT may be less human-readable.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We address the major comments below, clarifying the scope of our position paper while offering targeted revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The central recommendation that developers 'consider the impact of development decisions on CoT monitorability' is load-bearing for the paper's policy-oriented conclusion, yet the manuscript supplies no concrete mechanisms, ablation-style reasoning, or even qualitative examples linking specific choices (e.g., RLHF variants, data mixtures, or architectural changes) to measurable changes in faithfulness between visible CoT and hidden intent.

Authors: We acknowledge that the manuscript, as a conceptual position paper, does not include new empirical ablations or exhaustive mechanisms. However, the recommendation follows from the logic that CoT monitorability depends on models continuing to externalize reasoning in monitorable form, which prior work shows can be influenced by training objectives. We will revise to add qualitative examples from the existing literature, such as how certain RLHF variants can incentivize post-hoc rationalizations or hidden reasoning, to better ground the policy suggestion without overstating current evidence. revision: partial
Referee: The claim that CoT monitorability is 'fragile' rests entirely on general domain knowledge about language-model behavior rather than any derivation, controlled observation, or falsifiable prediction internal to the manuscript; this makes it difficult to assess whether the opportunity is genuinely at risk or merely imperfect in the same way as other oversight methods.

Authors: The fragility argument draws on documented phenomena in the CoT literature, including cases where models produce unfaithful reasoning or develop non-linguistic internal processes under optimization. We disagree that this renders it indistinguishable from other oversight methods, as CoT uniquely offers direct access to intermediate reasoning steps that could be lost. We will revise the manuscript to include specific citations and a brief discussion of how this creates a distinct risk profile, making the claim more self-contained and open to future empirical testing via monitorability evaluations. revision: yes

Circularity Check

0 steps flagged

No circular derivations or self-referential reductions present

full rationale

This is a conceptual position paper on AI safety opportunities rather than a technical derivation. It advances the claim that CoT monitorability is a fragile but promising oversight method and recommends considering its preservation in development decisions, but contains no equations, fitted parameters, predictions derived from inputs, or load-bearing self-citations that reduce the central argument to a tautology. The reasoning draws on general observations about language-model behavior and external domain knowledge, remaining self-contained without any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central recommendation rests on domain assumptions about the relationship between visible language reasoning and internal model intent, without new empirical grounding or formal derivation supplied in the abstract.

axioms (2)

domain assumption Visible chain-of-thought reasoning in human language is sufficiently faithful to internal model computations to allow detection of misbehavior intent.
This premise is required for the claimed safety opportunity to exist.
domain assumption Development decisions can materially degrade or preserve CoT monitorability.
This premise underpins the recommendation that labs should consider impacts on monitorability.

pith-pipeline@v0.9.0 · 5787 in / 1276 out tokens · 52597 ms · 2026-05-20T14:14:11.864911+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Because CoT monitorability may be fragile, we recommend that frontier model developers consider the impact of development decisions on CoT monitorability.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 25 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Architecture Determines Observability of Transformers
cs.LG 2026-04 unverdicted novelty 8.0

Certain transformer architectures lose internal linear signals for decision quality during training, making observability an architecture-dependent property rather than a universal one.
On the Cost and Benefit of Chain of Thought: A Learning-Theoretic Perspective
cs.LG 2026-05 unverdicted novelty 7.0

Chain of Thought risk decomposes into oracle-trajectory benefit and trajectory-mismatch cost, with stability determining bounded, linear, or exponential error growth.
Counterfactual Likelihood Tests for Indirect Influence in Private Reasoning Channels
cs.LG 2026-05 unverdicted novelty 7.0

Counterfactual likelihood tests detect indirect influence through public channels in private reasoning models, validated on a 7B role-channel model showing asymmetric A-to-B influence and complete pathway identificati...
PluRule: A Benchmark for Moderating Pluralistic Communities on Social Media
cs.CL 2026-05 unverdicted novelty 7.0

PluRule is a new multimodal multilingual benchmark showing that state-of-the-art vision-language models perform only marginally better than a trivial baseline at detecting specific rule violations in pluralistic onlin...
Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack
cs.AI 2026-05 conditional novelty 7.0

BenchJack audits 10 AI agent benchmarks, synthesizes exploits achieving near-perfect scores without task completion, surfaces 219 flaws, and reduces hackable-task ratios to under 10% on four benchmarks via iterative patching.
LLM Wardens: Mitigating Adversarial Persuasion with Third-Party Conversational Oversight
cs.LG 2026-05 unverdicted novelty 7.0

A secondary warden LLM halves the success rate of hidden-goal adversarial LLMs in steering user decisions while causing only minor interference with genuine interactions.
The Compliance Trap: How Structural Constraints Degrade Frontier AI Metacognition Under Adversarial Pressure
cs.AI 2026-05 unverdicted novelty 7.0

Eight of eleven frontier models show up to 30 percentage point metacognitive accuracy drops under compliance-forcing instructions rather than threat content, with Constitutional AI showing near-immunity due to its ali...
Thinking Without Words: Efficient Latent Reasoning with Abstract Chain-of-Thought
cs.CL 2026-04 unverdicted novelty 7.0

Abstract-CoT lets models reason with short discrete latent token sequences from a reserved vocabulary, using warm-up training and RL to match verbal CoT performance with up to 11.6x fewer tokens.
Scaling Latent Reasoning via Looped Language Models
cs.CL 2025-10 unverdicted novelty 7.0

Looped language models with latent iterative computation and entropy-regularized depth allocation achieve performance matching up to 12B standard LLMs through superior knowledge manipulation.
Monitoring the Internal Monologue: Probe Trajectories Reveal Reasoning Dynamics
cs.CL 2026-05 unverdicted novelty 6.0

Probe trajectories across token positions in LRMs, combined with signal-processing features, improve prediction of future model outputs over static probes on safety and math tasks.
Stop When Reasoning Converges: Semantic-Preserving Early Exit for Reasoning Models
cs.CL 2026-05 unverdicted novelty 6.0

PUMA detects reasoning-level semantic redundancy to enable early exit in chains of thought, achieving 26.2% average token reduction across five LRMs and five benchmarks while preserving accuracy and CoT quality.
Formal Methods Meet LLMs: Auditing, Monitoring, and Intervention for Compliance of Advanced AI Systems
cs.AI 2026-05 unverdicted novelty 6.0

Combines LTL formal methods with LLMs for auditing, predictive monitoring, and runtime intervention on temporally extended behavioral constraints, outperforming LLM baselines and reducing violations.
Ensemble Monitoring for AI Control: Diverse Signals Outweigh More Compute
cs.AI 2026-05 unverdicted novelty 6.0

Diverse ensembles of prompted and fine-tuned GPT-4.1-Mini monitors achieve 2.4x better detection of flawed code solutions than homogeneous ensembles on adversarial inputs.
When Reasoning Traces Become Performative: Step-Level Evidence that Chain-of-Thought Is an Imperfect Oversight Channel
cs.AI 2026-05 unverdicted novelty 6.0

CoT traces align with internal answer commitment in only 61.9% of steps on average, dominated by confabulated continuations after commitment has stabilized.
Chain of Risk: Safety Failures in Large Reasoning Models and Mitigation via Adaptive Multi-Principle Steering
cs.AI 2026-05 unverdicted novelty 6.0

Reasoning traces in large reasoning models expose safety failures missed by final-answer checks, and adaptive multi-principle steering reduces unsafe content in both traces and answers while preserving task performance.
The Compliance Trap: How Structural Constraints Degrade Frontier AI Metacognition Under Adversarial Pressure
cs.AI 2026-05 unverdicted novelty 6.0

Compliance-forcing instructions cause up to 30 percentage point drops in metacognitive accuracy across most frontier models, while removing the compliance element restores performance and Constitutional AI shows near-...
Compared to What? Baselines and Metrics for Counterfactual Prompting
cs.CL 2026-05 conditional novelty 6.0

Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistica...
Architecture Determines Observability of Transformers
cs.LG 2026-04 unverdicted novelty 6.0

Architecture and training determine whether transformers retain a readable internal signal that lets activation monitors catch errors missed by output confidence.
SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models
cs.AI 2026-04 unverdicted novelty 6.0

SafetyALFRED shows multimodal LLMs recognize kitchen hazards accurately in QA tests but achieve low success rates when required to mitigate those hazards through embodied planning.
The Depth Ceiling: On the Limits of Large Language Models in Discovering Latent Planning
cs.LG 2026-04 unverdicted novelty 6.0

LLMs discover latent planning strategies up to five steps during training and execute them up to eight steps at test time, with larger models reaching seven under few-shot prompting, revealing a dissociation between d...
A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring
cs.AI 2026-02 unverdicted novelty 6.0

A decision-theoretic steganographic gap, based on generalized V-information, quantifies and detects steganographic reasoning in LLMs by measuring asymmetry in downstream utility between agents who can and cannot decod...
Can Aha Moments Be Fake? Identifying True and Decorative Thinking Steps in Chain-of-Thought
cs.LG 2025-10 unverdicted novelty 6.0

LLMs interleave true causal reasoning steps with decorative ones in CoT, with only ~2.3% of steps having high causal impact on AIME for Qwen-2.5, and a steering direction can force internal use of specific steps.
LiSA: Lifelong Safety Adaptation via Conservative Policy Induction
cs.LG 2026-05 unverdicted novelty 5.0

LiSA improves AI guardrails lifelong by inducing conservative policies from sparse noisy failure reports via structured memory, conflict-aware rules, and posterior lower-bound gating.
Are Latent Reasoning Models Easily Interpretable?
cs.LG 2026-04 unverdicted novelty 5.0

Latent reasoning models often ignore their latent tokens for predictions and their correct outputs can be decoded into natural language reasoning traces more reliably than incorrect outputs.
OpenAI GPT-5 System Card
cs.CL 2025-12 unverdicted novelty 3.0

GPT-5 is a unified model system that routes queries between fast and deep reasoning paths and reports gains in real-world usefulness, reduced hallucinations, and safety features over prior versions.

Reference graph

Works this paper leans on

115 extracted references · 115 canonical work pages · cited by 23 Pith papers · 5 internal anchors

[1]

AI safety via debate

AI Safety via Debate , author=. arXiv preprint arXiv:1805.00899 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Fine-Tuning Language Models from Human Preferences

Fine-Tuning Language Models from Human Preferences , author=. arXiv preprint arXiv:1909.08593 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1909
[3]

2024 , howpublished=

Our approach to alignment research , author=. 2024 , howpublished=

work page 2024
[4]

The Checklist: What Succeeding at

Bowman, Sam , year=. The Checklist: What Succeeding at

work page
[5]

Scheming

Carlsmith, Joe , year=. Scheming. 2311.08379 , archivePrefix=

work page arXiv
[6]

and Yang, John and Wettig, Alexander and Yao, Shunyu and Pei, Kexin and Press, Ofir and Narasimhan, Karthik , journal=

Jimenez, Carlos E. and Yang, John and Wettig, Alexander and Yao, Shunyu and Pei, Kexin and Press, Ofir and Narasimhan, Karthik , journal=

work page
[7]

Chan, Jun Shern and Chowdhury, Neil and Jaffe, Oliver and Aung, James and Sherburn, Dane and Mays, Evan and Starace, Giulio and Liu, Kevin and Maksin, Leon and Patwardhan, Tejal and Weng, Lilian and Mądry, Aleksander , journal=

work page
[8]

2024 , eprint=

Sabotage Evaluations for Frontier Models , author=. 2024 , eprint=

work page 2024
[9]

Safety cases for frontier

Marie Davidsen Buhl and Gaurav Sett and Leonie Koessler and Jonas Schuett and Markus Anderljung , year=. Safety cases for frontier. 2410.21572 , archivePrefix=

work page arXiv
[10]

Safety Cases: How to Justify the Safety of Advanced

Clymer, Joshua and Gabrieli, Nick and Krueger, David and Larsen, Thomas , year=. Safety Cases: How to Justify the Safety of Advanced. 2403.10462 , archivePrefix=

work page arXiv
[11]

Safety cases at

Irving, Geoffrey , year=. Safety cases at

work page
[12]

2024 , month=

A New Initiative for Developing Third-Party Model Evaluations , author=. 2024 , month=

work page 2024
[13]

2024 , eprint=

Evaluating Frontier Models for Dangerous Capabilities , author=. 2024 , eprint=

work page 2024
[14]

and Lucas, Caleb and Guest, Ella , year=

Mouton, Christopher A. and Lucas, Caleb and Guest, Ella , year=. The Operational Risks of

work page
[15]

2024 , howpublished=

Preparedness Framework , author=. 2024 , howpublished=

work page 2024
[16]

Greenblatt, Ryan and Shlegeris, Buck and Sachan, Kshitij and Roger, Fabien , journal=

work page
[17]

2007 , institution=

Defence Standard 00-56 Issue 4: Safety Management Requirements for Defence Systems , author=. 2007 , institution=

work page 2007
[18]

Managing extreme

Bengio, Yoshua and Hinton, Geoffrey and Yao, Andrew and Song, Dawn and Abbeel, Pieter and Darrell, Trevor and Harari, Yuval Noah and Zhang, Ya-Qin and Xue, Lan and Shalev-Shwartz, Shai and Hadfield, Gillian and Clune, Jeff and Maharaj, Tegan and Hutter, Frank and Baydin, Atılım Güneş and McIlraith, Sheila and Gao, Qiqi and Acharya, Ashwin and Krueger, Dav...

work page
[19]

2024 , month=

International Scientific Report on the Safety of Advanced. 2024 , month=

work page 2024
[20]

Safety and Reliability , volume=

Implementation of nuclear safety cases , author=. Safety and Reliability , volume=

work page
[21]

SPE Asia Pacific Oil and Gas Conference and Exhibition , year=

Has the Safety Case Failed? , author=. SPE Asia Pacific Oil and Gas Conference and Exhibition , year=

work page
[22]

Safety Science , volume=

Safety Cases: Past, Present and Future , author=. Safety Science , volume=

work page
[23]

Safety Science , volume=

Modelling confidence in railway safety case , author=. Safety Science , volume=

work page
[24]

Foundations of Computer Software , pages=

Software Certification: Is There a Case against Safety Cases? , author=. Foundations of Computer Software , pages=

work page
[25]

Safety Science , volume=

Safety Cases in the Certification of Autonomous Systems , author=. Safety Science , volume=

work page
[26]

Safety case template for frontier

Goemans, Arthur and Buhl, Marie Davidsen and Schuett, Jonas and Korbak, Tomek and Wang, Jessica and Hilton, Benjamin and Irving, Geoffrey , year=. Safety case template for frontier. 2411.08088 , archivePrefix=

work page arXiv
[27]

Towards evaluations-based safety cases for

Mikita Balesni and Marius Hobbhahn and David Lindner and Alexander Meinke and Tomek Korbak and Joshua Clymer and Buck Shlegeris and Jérémy Scheurer and Charlotte Stix and Rusheb Shah and Nicholas Goldowsky-Dill and Dan Braun and Bilal Chughtai and Owain Evans and Daniel Kokotajlo and Lucius Bushnaq , year=. Towards evaluations-based safety cases for. 2411...

work page arXiv
[28]

, booktitle=

Rein, David and Hou, Betty Li and Stickland, Asa Cooper and Petty, Jackson and Pang, Richard Yuanzhe and Dirani, Julien and Michael, Julian and Bowman, Samuel R. , booktitle=

work page
[29]

Re-bench: Evaluating frontier ai r&d capabilities of language model agents against human experts.arXiv preprint arXiv:2411.15114, 2024

Hjalmar Wijk and Tao Lin and Joel Becker and Sami Jawhar and Neev Parikh and Thomas Broadley and Lawrence Chan and Michael Chen and Josh Clymer and Jai Dhyani and Elena Ericheva and Katharyn Garcia and Brian Goodrich and Nikola Jurkovic and Megan Kinniment and Aron Lajko and Seraphina Nix and Lucas Sato and William Saunders and Maksym Taran and Ben West a...

work page arXiv
[30]

Shell Games: Control Protocols for Adversarial

Aryan Bhatt and Cody Rushing and Adam Kaufman and Vasil Georgiev and Tyler Tracy and Akbir Khan and Buck Shlegeris , year=. Shell Games: Control Protocols for Adversarial

work page
[31]

2024 , eprint=

AI Sandbagging: Language Models can Strategically Underperform on Evaluations , author=. 2024 , eprint=

work page 2024
[32]

2023 , eprint=

ReAct: Synergizing Reasoning and Acting in Language Models , author=. 2023 , eprint=

work page 2023
[33]

2023 , eprint=

Reflexion: Language Agents with Verbal Reinforcement Learning , author=. 2023 , eprint=

work page 2023
[34]

2024 , eprint=

Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats , author=. 2024 , eprint=

work page 2024
[35]

2018 , eprint=

AI safety via debate , author=. 2018 , eprint=

work page 2018
[36]

and Phang, Jason and Bowman, Samuel R

Korbak, Tomasz and Shi, Kejian and Chen, Angelica and Bhalerao, Rasika and Buckley, Christopher L. and Phang, Jason and Bowman, Samuel R. and Perez, Ethan , title =. Proceedings of the 40th International Conference on Machine Learning , articleno =. 2023 , publisher =

work page 2023
[37]

2020 , eprint=

Fine-Tuning Language Models from Human Preferences , author=. 2020 , eprint=

work page 2020
[38]

2022 , eprint=

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback , author=. 2022 , eprint=

work page 2022
[39]

2024 , howpublished=

work page 2024
[40]

2024 , eprint=

Alignment faking in large language models , author=. 2024 , eprint=

work page 2024
[41]

2024 , eprint=

Frontier Models are Capable of In-context Scheming , author=. 2024 , eprint=

work page 2024
[42]

2024 , note=

Win/continue/lose scenarios and execute/replace/audit protocols , author=. 2024 , note=

work page 2024
[43]

arXiv preprint arXiv:2402.00773 , year=

Trusted monitoring for large language models , author=. arXiv preprint arXiv:2402.00773 , year=

work page arXiv
[44]

2407.00215 , archivePrefix=

Nat McAleese and Rai Michael Pokorny and Juan Felipe Ceron Uribe and Evgenia Nitishinskaya and Maja Trebacz and Jan Leike , year=. 2407.00215 , archivePrefix=

work page arXiv
[45]

arXiv preprint arXiv:2310.18512 , year=

Preventing Language Models From Hiding Their Reasoning , author=. arXiv preprint arXiv:2310.18512 , year=

work page arXiv
[46]

Games for AI Control: Models of Safety Evaluations of AI Deployment Protocols

Charlie Griffin and Louis Thomson and Buck Shlegeris and Alessandro Abate , year=. Games for. 2409.07985 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv
[47]

Risk thresholds for frontier

Leonie Koessler and Jonas Schuett and Markus Anderljung , year=. Risk thresholds for frontier. 2406.14713 , archivePrefix=

work page arXiv
[48]

A basic systems architecture for

Shlegeris, Buck , year=. A basic systems architecture for

work page
[49]

Three Sketches of

Grosse, Roger , year=. Three Sketches of

work page
[50]

2024 , month=

Responsible Scaling Policy , author=. 2024 , month=

work page 2024
[51]

and Thomas, John P

Leveson, Nancy G. and Thomas, John P. , year=

work page
[52]

Failure Mode and Effects Analysis (

Villacourt, Mario , year=. Failure Mode and Effects Analysis (

work page
[53]

2024 , month=

How to prevent collusion when using untrusted models to monitor each other , author=. 2024 , month=

work page 2024
[54]

2025 , month=

Extending control evaluations to non-scheming threats , author=. 2025 , month=

work page 2025
[55]

2024 , month=

Untrusted smart models and trusted dumb models , author=. 2024 , month=

work page 2024
[56]

Building Blocks for Assurance Cases , year=

Bloomfield, Robin and Netkachova, Kateryna , booktitle=. Building Blocks for Assurance Cases , year=

work page
[57]

Thoughts on the conservative assumptions in

Shlegeris, Buck , year=. Thoughts on the conservative assumptions in

work page
[58]

2023 , eprint=

Taken out of context: On measuring situational awareness in LLMs , author=. 2023 , eprint=

work page 2023
[59]

2024 , eprint=

Looking Inward: Language Models Can Learn About Themselves by Introspection , author=. 2024 , eprint=

work page 2024
[60]

2023 , month=

Auditing failures vs concentrated failures , author=. 2023 , month=

work page 2023
[61]

Me, Myself, and

Rudolf Laine and Bilal Chughtai and Jan Betley and Kaivalya Hariharan and Mikita Balesni and J. Me, Myself, and. The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

work page
[62]

2025 , eprint=

A sketch of an AI control safety case , author=. 2025 , eprint=

work page 2025
[63]

2025 , eprint=

Safety Cases: A Scalable Approach to Frontier AI Safety , author=. 2025 , eprint=

work page 2025
[64]

2024 , eprint=

Training Large Language Models to Reason in a Continuous Latent Space , author=. 2024 , eprint=

work page 2024
[65]

2025 , eprint=

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach , author=. 2025 , eprint=

work page 2025
[66]

arXiv preprint arXiv:2502.01635 , year =

The Scale of AI Agent Deployment: New Metrics and Perspectives , author =. arXiv preprint arXiv:2502.01635 , year =

work page arXiv
[67]

2024 , eprint =

Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation , author =. 2024 , eprint =

work page 2024
[68]

2025 , eprint=

Fundamental Limitations in Defending LLM Finetuning APIs , author=. 2025 , eprint=

work page 2025
[69]

2021 , month=

Eliciting latent knowledge: How to tell if your eyes deceive you , author=. 2021 , month=

work page 2021
[70]

Recursively Summarizing Books with Human Feedback

Recursively Summarizing Books with Human Feedback , author=. arXiv preprint arXiv:2109.10862 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[71]

Self-critiquing models for assisting human evaluators

Self-critiquing models for assisting human evaluators , author=. arXiv preprint arXiv:2206.05802 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[72]

2024 , month=

Automation collapse , author=. 2024 , month=

work page 2024
[73]

2025 , eprint=

Subversion Strategy Eval: Can language models statelessly strategize to subvert control protocols? , author=. 2025 , eprint=

work page 2025
[74]

DeepSeek-

Guo, Yuxuan and Shao, Haotian and Liu, Aixin and Ruan, Chong and and Cao, Zihan and Feng, Bei and Wang, Yao and Han, Lei and Zheng, Xiangxin and Chen, Yunji , year=. DeepSeek-. 2501.08497 , archivePrefix=

work page arXiv
[75]

2025 , eprint=

Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation , author=. 2025 , eprint=

work page 2025
[76]

Goldowsky-Dill, Nicholas and Balesni, Mikita and Scheurer, Jérémy and Hobbhahn, Marius , year=. Claude

work page
[77]

2025 , url=

An Approach to Technical AGI Safety and Security , author=. 2025 , url=

work page 2025
[78]

2024 , month=

If-Then Commitments for AI Risk Reduction , author=. 2024 , month=

work page 2024
[79]

2022 , eprint=

Measuring Progress on Scalable Oversight for Large Language Models , author=. 2022 , eprint=

work page 2022
[80]

2025 , eprint=

Measuring AI Ability to Complete Long Tasks , author=. 2025 , eprint=

work page 2025

Showing first 80 references.

[1] [1]

AI safety via debate

AI Safety via Debate , author=. arXiv preprint arXiv:1805.00899 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Fine-Tuning Language Models from Human Preferences

Fine-Tuning Language Models from Human Preferences , author=. arXiv preprint arXiv:1909.08593 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1909

[3] [3]

2024 , howpublished=

Our approach to alignment research , author=. 2024 , howpublished=

work page 2024

[4] [4]

The Checklist: What Succeeding at

Bowman, Sam , year=. The Checklist: What Succeeding at

work page

[5] [5]

Scheming

Carlsmith, Joe , year=. Scheming. 2311.08379 , archivePrefix=

work page arXiv

[6] [6]

and Yang, John and Wettig, Alexander and Yao, Shunyu and Pei, Kexin and Press, Ofir and Narasimhan, Karthik , journal=

Jimenez, Carlos E. and Yang, John and Wettig, Alexander and Yao, Shunyu and Pei, Kexin and Press, Ofir and Narasimhan, Karthik , journal=

work page

[7] [7]

Chan, Jun Shern and Chowdhury, Neil and Jaffe, Oliver and Aung, James and Sherburn, Dane and Mays, Evan and Starace, Giulio and Liu, Kevin and Maksin, Leon and Patwardhan, Tejal and Weng, Lilian and Mądry, Aleksander , journal=

work page

[8] [8]

2024 , eprint=

Sabotage Evaluations for Frontier Models , author=. 2024 , eprint=

work page 2024

[9] [9]

Safety cases for frontier

Marie Davidsen Buhl and Gaurav Sett and Leonie Koessler and Jonas Schuett and Markus Anderljung , year=. Safety cases for frontier. 2410.21572 , archivePrefix=

work page arXiv

[10] [10]

Safety Cases: How to Justify the Safety of Advanced

Clymer, Joshua and Gabrieli, Nick and Krueger, David and Larsen, Thomas , year=. Safety Cases: How to Justify the Safety of Advanced. 2403.10462 , archivePrefix=

work page arXiv

[11] [11]

Safety cases at

Irving, Geoffrey , year=. Safety cases at

work page

[12] [12]

2024 , month=

A New Initiative for Developing Third-Party Model Evaluations , author=. 2024 , month=

work page 2024

[13] [13]

2024 , eprint=

Evaluating Frontier Models for Dangerous Capabilities , author=. 2024 , eprint=

work page 2024

[14] [14]

and Lucas, Caleb and Guest, Ella , year=

Mouton, Christopher A. and Lucas, Caleb and Guest, Ella , year=. The Operational Risks of

work page

[15] [15]

2024 , howpublished=

Preparedness Framework , author=. 2024 , howpublished=

work page 2024

[16] [16]

Greenblatt, Ryan and Shlegeris, Buck and Sachan, Kshitij and Roger, Fabien , journal=

work page

[17] [17]

2007 , institution=

Defence Standard 00-56 Issue 4: Safety Management Requirements for Defence Systems , author=. 2007 , institution=

work page 2007

[18] [18]

Managing extreme

Bengio, Yoshua and Hinton, Geoffrey and Yao, Andrew and Song, Dawn and Abbeel, Pieter and Darrell, Trevor and Harari, Yuval Noah and Zhang, Ya-Qin and Xue, Lan and Shalev-Shwartz, Shai and Hadfield, Gillian and Clune, Jeff and Maharaj, Tegan and Hutter, Frank and Baydin, Atılım Güneş and McIlraith, Sheila and Gao, Qiqi and Acharya, Ashwin and Krueger, Dav...

work page

[19] [19]

2024 , month=

International Scientific Report on the Safety of Advanced. 2024 , month=

work page 2024

[20] [20]

Safety and Reliability , volume=

Implementation of nuclear safety cases , author=. Safety and Reliability , volume=

work page

[21] [21]

SPE Asia Pacific Oil and Gas Conference and Exhibition , year=

Has the Safety Case Failed? , author=. SPE Asia Pacific Oil and Gas Conference and Exhibition , year=

work page

[22] [22]

Safety Science , volume=

Safety Cases: Past, Present and Future , author=. Safety Science , volume=

work page

[23] [23]

Safety Science , volume=

Modelling confidence in railway safety case , author=. Safety Science , volume=

work page

[24] [24]

Foundations of Computer Software , pages=

Software Certification: Is There a Case against Safety Cases? , author=. Foundations of Computer Software , pages=

work page

[25] [25]

Safety Science , volume=

Safety Cases in the Certification of Autonomous Systems , author=. Safety Science , volume=

work page

[26] [26]

Safety case template for frontier

Goemans, Arthur and Buhl, Marie Davidsen and Schuett, Jonas and Korbak, Tomek and Wang, Jessica and Hilton, Benjamin and Irving, Geoffrey , year=. Safety case template for frontier. 2411.08088 , archivePrefix=

work page arXiv

[27] [27]

Towards evaluations-based safety cases for

Mikita Balesni and Marius Hobbhahn and David Lindner and Alexander Meinke and Tomek Korbak and Joshua Clymer and Buck Shlegeris and Jérémy Scheurer and Charlotte Stix and Rusheb Shah and Nicholas Goldowsky-Dill and Dan Braun and Bilal Chughtai and Owain Evans and Daniel Kokotajlo and Lucius Bushnaq , year=. Towards evaluations-based safety cases for. 2411...

work page arXiv

[28] [28]

, booktitle=

Rein, David and Hou, Betty Li and Stickland, Asa Cooper and Petty, Jackson and Pang, Richard Yuanzhe and Dirani, Julien and Michael, Julian and Bowman, Samuel R. , booktitle=

work page

[29] [29]

Re-bench: Evaluating frontier ai r&d capabilities of language model agents against human experts.arXiv preprint arXiv:2411.15114, 2024

Hjalmar Wijk and Tao Lin and Joel Becker and Sami Jawhar and Neev Parikh and Thomas Broadley and Lawrence Chan and Michael Chen and Josh Clymer and Jai Dhyani and Elena Ericheva and Katharyn Garcia and Brian Goodrich and Nikola Jurkovic and Megan Kinniment and Aron Lajko and Seraphina Nix and Lucas Sato and William Saunders and Maksym Taran and Ben West a...

work page arXiv

[30] [30]

Shell Games: Control Protocols for Adversarial

Aryan Bhatt and Cody Rushing and Adam Kaufman and Vasil Georgiev and Tyler Tracy and Akbir Khan and Buck Shlegeris , year=. Shell Games: Control Protocols for Adversarial

work page

[31] [31]

2024 , eprint=

AI Sandbagging: Language Models can Strategically Underperform on Evaluations , author=. 2024 , eprint=

work page 2024

[32] [32]

2023 , eprint=

ReAct: Synergizing Reasoning and Acting in Language Models , author=. 2023 , eprint=

work page 2023

[33] [33]

2023 , eprint=

Reflexion: Language Agents with Verbal Reinforcement Learning , author=. 2023 , eprint=

work page 2023

[34] [34]

2024 , eprint=

Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats , author=. 2024 , eprint=

work page 2024

[35] [35]

2018 , eprint=

AI safety via debate , author=. 2018 , eprint=

work page 2018

[36] [36]

and Phang, Jason and Bowman, Samuel R

Korbak, Tomasz and Shi, Kejian and Chen, Angelica and Bhalerao, Rasika and Buckley, Christopher L. and Phang, Jason and Bowman, Samuel R. and Perez, Ethan , title =. Proceedings of the 40th International Conference on Machine Learning , articleno =. 2023 , publisher =

work page 2023

[37] [37]

2020 , eprint=

Fine-Tuning Language Models from Human Preferences , author=. 2020 , eprint=

work page 2020

[38] [38]

2022 , eprint=

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback , author=. 2022 , eprint=

work page 2022

[39] [39]

2024 , howpublished=

work page 2024

[40] [40]

2024 , eprint=

Alignment faking in large language models , author=. 2024 , eprint=

work page 2024

[41] [41]

2024 , eprint=

Frontier Models are Capable of In-context Scheming , author=. 2024 , eprint=

work page 2024

[42] [42]

2024 , note=

Win/continue/lose scenarios and execute/replace/audit protocols , author=. 2024 , note=

work page 2024

[43] [43]

arXiv preprint arXiv:2402.00773 , year=

Trusted monitoring for large language models , author=. arXiv preprint arXiv:2402.00773 , year=

work page arXiv

[44] [44]

2407.00215 , archivePrefix=

Nat McAleese and Rai Michael Pokorny and Juan Felipe Ceron Uribe and Evgenia Nitishinskaya and Maja Trebacz and Jan Leike , year=. 2407.00215 , archivePrefix=

work page arXiv

[45] [45]

arXiv preprint arXiv:2310.18512 , year=

Preventing Language Models From Hiding Their Reasoning , author=. arXiv preprint arXiv:2310.18512 , year=

work page arXiv

[46] [46]

Games for AI Control: Models of Safety Evaluations of AI Deployment Protocols

Charlie Griffin and Louis Thomson and Buck Shlegeris and Alessandro Abate , year=. Games for. 2409.07985 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv

[47] [47]

Risk thresholds for frontier

Leonie Koessler and Jonas Schuett and Markus Anderljung , year=. Risk thresholds for frontier. 2406.14713 , archivePrefix=

work page arXiv

[48] [48]

A basic systems architecture for

Shlegeris, Buck , year=. A basic systems architecture for

work page

[49] [49]

Three Sketches of

Grosse, Roger , year=. Three Sketches of

work page

[50] [50]

2024 , month=

Responsible Scaling Policy , author=. 2024 , month=

work page 2024

[51] [51]

and Thomas, John P

Leveson, Nancy G. and Thomas, John P. , year=

work page

[52] [52]

Failure Mode and Effects Analysis (

Villacourt, Mario , year=. Failure Mode and Effects Analysis (

work page

[53] [53]

2024 , month=

How to prevent collusion when using untrusted models to monitor each other , author=. 2024 , month=

work page 2024

[54] [54]

2025 , month=

Extending control evaluations to non-scheming threats , author=. 2025 , month=

work page 2025

[55] [55]

2024 , month=

Untrusted smart models and trusted dumb models , author=. 2024 , month=

work page 2024

[56] [56]

Building Blocks for Assurance Cases , year=

Bloomfield, Robin and Netkachova, Kateryna , booktitle=. Building Blocks for Assurance Cases , year=

work page

[57] [57]

Thoughts on the conservative assumptions in

Shlegeris, Buck , year=. Thoughts on the conservative assumptions in

work page

[58] [58]

2023 , eprint=

Taken out of context: On measuring situational awareness in LLMs , author=. 2023 , eprint=

work page 2023

[59] [59]

2024 , eprint=

Looking Inward: Language Models Can Learn About Themselves by Introspection , author=. 2024 , eprint=

work page 2024

[60] [60]

2023 , month=

Auditing failures vs concentrated failures , author=. 2023 , month=

work page 2023

[61] [61]

Me, Myself, and

Rudolf Laine and Bilal Chughtai and Jan Betley and Kaivalya Hariharan and Mikita Balesni and J. Me, Myself, and. The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

work page

[62] [62]

2025 , eprint=

A sketch of an AI control safety case , author=. 2025 , eprint=

work page 2025

[63] [63]

2025 , eprint=

Safety Cases: A Scalable Approach to Frontier AI Safety , author=. 2025 , eprint=

work page 2025

[64] [64]

2024 , eprint=

Training Large Language Models to Reason in a Continuous Latent Space , author=. 2024 , eprint=

work page 2024

[65] [65]

2025 , eprint=

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach , author=. 2025 , eprint=

work page 2025

[66] [66]

arXiv preprint arXiv:2502.01635 , year =

The Scale of AI Agent Deployment: New Metrics and Perspectives , author =. arXiv preprint arXiv:2502.01635 , year =

work page arXiv

[67] [67]

2024 , eprint =

Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation , author =. 2024 , eprint =

work page 2024

[68] [68]

2025 , eprint=

Fundamental Limitations in Defending LLM Finetuning APIs , author=. 2025 , eprint=

work page 2025

[69] [69]

2021 , month=

Eliciting latent knowledge: How to tell if your eyes deceive you , author=. 2021 , month=

work page 2021

[70] [70]

Recursively Summarizing Books with Human Feedback

Recursively Summarizing Books with Human Feedback , author=. arXiv preprint arXiv:2109.10862 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[71] [71]

Self-critiquing models for assisting human evaluators

Self-critiquing models for assisting human evaluators , author=. arXiv preprint arXiv:2206.05802 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[72] [72]

2024 , month=

Automation collapse , author=. 2024 , month=

work page 2024

[73] [73]

2025 , eprint=

Subversion Strategy Eval: Can language models statelessly strategize to subvert control protocols? , author=. 2025 , eprint=

work page 2025

[74] [74]

DeepSeek-

Guo, Yuxuan and Shao, Haotian and Liu, Aixin and Ruan, Chong and and Cao, Zihan and Feng, Bei and Wang, Yao and Han, Lei and Zheng, Xiangxin and Chen, Yunji , year=. DeepSeek-. 2501.08497 , archivePrefix=

work page arXiv

[75] [75]

2025 , eprint=

Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation , author=. 2025 , eprint=

work page 2025

[76] [76]

Goldowsky-Dill, Nicholas and Balesni, Mikita and Scheurer, Jérémy and Hobbhahn, Marius , year=. Claude

work page

[77] [77]

2025 , url=

An Approach to Technical AGI Safety and Security , author=. 2025 , url=

work page 2025

[78] [78]

2024 , month=

If-Then Commitments for AI Risk Reduction , author=. 2024 , month=

work page 2024

[79] [79]

2022 , eprint=

Measuring Progress on Scalable Oversight for Large Language Models , author=. 2022 , eprint=

work page 2022

[80] [80]

2025 , eprint=

Measuring AI Ability to Complete Long Tasks , author=. 2025 , eprint=

work page 2025