pith. sign in

arxiv: 2507.11473 · v2 · pith:USLQ55QZnew · submitted 2025-07-15 · 💻 cs.AI · cs.LG· stat.ML

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Pith reviewed 2026-05-20 14:14 UTC · model grok-4.3

classification 💻 cs.AI cs.LGstat.ML
keywords AI safetychain of thoughtmonitorabilityAI oversightreasoningmisalignment detectionfrontier AI models
0
0 comments X

The pith

Chains of thought in AI systems that reason in language provide a monitorable window into potential misbehavior for safety purposes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that when AI models express their reasoning in human language through chains of thought, this creates an opportunity to inspect those thoughts for signs of harmful intent. This stands out from other oversight techniques because it offers a direct look at the model's planning process rather than just its final outputs. Even though it is not perfect and some misbehavior could still slip through, the approach shows enough promise to warrant dedicated research and parallel use with other safety strategies. A key caution is that this monitorability could easily disappear depending on how models are developed and trained, so developers are advised to account for its preservation in their decisions.

Core claim

AI systems that think in human language allow monitoring of their chains of thought to detect intent to misbehave, offering a new though imperfect tool for AI safety that requires further research and careful consideration during model development to maintain its utility.

What carries the argument

Chain of thought monitorability, the property that lets observers inspect a model's step-by-step reasoning in natural language to identify plans for misbehavior.

If this is right

  • Further research should focus on understanding and improving CoT monitorability.
  • CoT monitoring should be invested in and used together with other existing AI safety methods.
  • Developers of frontier models need to evaluate how their training and scaling decisions might affect the monitorability of chains of thought.
  • It can help surface misbehavior that would otherwise go unnoticed by other oversight approaches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This implies that safety research could benefit from prioritizing techniques that keep model reasoning human-readable and faithful to internal processes.
  • Changes in model architecture or post-training could inadvertently eliminate this monitoring advantage if not carefully tracked.
  • Testing for monitorability might become a standard part of safety assessments for advanced AI systems.

Load-bearing premise

That the chains of thought generated by current and future models will faithfully and readably reflect their internal reasoning so that monitoring can reliably detect intent to misbehave.

What would settle it

A demonstration that an AI model can pursue harmful goals while producing chains of thought that give no indication of those goals, or that training methods routinely make chains of thought unfaithful without reducing capability.

read the original abstract

AI systems that "think" in human language offer a unique opportunity for AI safety: we can monitor their chains of thought (CoT) for the intent to misbehave. Like all other known AI oversight methods, CoT monitoring is imperfect and allows some misbehavior to go unnoticed. Nevertheless, it shows promise and we recommend further research into CoT monitorability and investment in CoT monitoring alongside existing safety methods. Because CoT monitorability may be fragile, we recommend that frontier model developers consider the impact of development decisions on CoT monitorability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript argues that AI systems which produce chain-of-thought (CoT) reasoning in human language create a distinctive safety opportunity: external monitors can inspect these traces for signs of intent to misbehave. The authors acknowledge that CoT monitoring is imperfect and permits some undetected misbehavior, yet they claim it shows sufficient promise to warrant dedicated research and investment alongside other oversight techniques. They further recommend that frontier developers explicitly evaluate how their training and architectural choices may affect the continued monitorability of CoT.

Significance. If the qualitative case for fragility holds, the paper identifies a potentially actionable lever for preserving interpretability during capability scaling. It frames CoT monitorability as a transient property worth deliberate protection rather than an inevitable byproduct of scaling, which could usefully inform development roadmaps even in the absence of new quantitative results.

major comments (2)
  1. [Abstract] Abstract: The central recommendation that developers 'consider the impact of development decisions on CoT monitorability' is load-bearing for the paper's policy-oriented conclusion, yet the manuscript supplies no concrete mechanisms, ablation-style reasoning, or even qualitative examples linking specific choices (e.g., RLHF variants, data mixtures, or architectural changes) to measurable changes in faithfulness between visible CoT and hidden intent.
  2. The claim that CoT monitorability is 'fragile' rests entirely on general domain knowledge about language-model behavior rather than any derivation, controlled observation, or falsifiable prediction internal to the manuscript; this makes it difficult to assess whether the opportunity is genuinely at risk or merely imperfect in the same way as other oversight methods.
minor comments (1)
  1. [Abstract] The abstract and body would benefit from a short explicit statement of the scope: whether the argument applies only to current-scale models or is intended to generalize to future systems whose CoT may be less human-readable.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We address the major comments below, clarifying the scope of our position paper while offering targeted revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central recommendation that developers 'consider the impact of development decisions on CoT monitorability' is load-bearing for the paper's policy-oriented conclusion, yet the manuscript supplies no concrete mechanisms, ablation-style reasoning, or even qualitative examples linking specific choices (e.g., RLHF variants, data mixtures, or architectural changes) to measurable changes in faithfulness between visible CoT and hidden intent.

    Authors: We acknowledge that the manuscript, as a conceptual position paper, does not include new empirical ablations or exhaustive mechanisms. However, the recommendation follows from the logic that CoT monitorability depends on models continuing to externalize reasoning in monitorable form, which prior work shows can be influenced by training objectives. We will revise to add qualitative examples from the existing literature, such as how certain RLHF variants can incentivize post-hoc rationalizations or hidden reasoning, to better ground the policy suggestion without overstating current evidence. revision: partial

  2. Referee: The claim that CoT monitorability is 'fragile' rests entirely on general domain knowledge about language-model behavior rather than any derivation, controlled observation, or falsifiable prediction internal to the manuscript; this makes it difficult to assess whether the opportunity is genuinely at risk or merely imperfect in the same way as other oversight methods.

    Authors: The fragility argument draws on documented phenomena in the CoT literature, including cases where models produce unfaithful reasoning or develop non-linguistic internal processes under optimization. We disagree that this renders it indistinguishable from other oversight methods, as CoT uniquely offers direct access to intermediate reasoning steps that could be lost. We will revise the manuscript to include specific citations and a brief discussion of how this creates a distinct risk profile, making the claim more self-contained and open to future empirical testing via monitorability evaluations. revision: yes

Circularity Check

0 steps flagged

No circular derivations or self-referential reductions present

full rationale

This is a conceptual position paper on AI safety opportunities rather than a technical derivation. It advances the claim that CoT monitorability is a fragile but promising oversight method and recommends considering its preservation in development decisions, but contains no equations, fitted parameters, predictions derived from inputs, or load-bearing self-citations that reduce the central argument to a tautology. The reasoning draws on general observations about language-model behavior and external domain knowledge, remaining self-contained without any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central recommendation rests on domain assumptions about the relationship between visible language reasoning and internal model intent, without new empirical grounding or formal derivation supplied in the abstract.

axioms (2)
  • domain assumption Visible chain-of-thought reasoning in human language is sufficiently faithful to internal model computations to allow detection of misbehavior intent.
    This premise is required for the claimed safety opportunity to exist.
  • domain assumption Development decisions can materially degrade or preserve CoT monitorability.
    This premise underpins the recommendation that labs should consider impacts on monitorability.

pith-pipeline@v0.9.0 · 5787 in / 1276 out tokens · 52597 ms · 2026-05-20T14:14:11.864911+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 25 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Architecture Determines Observability of Transformers

    cs.LG 2026-04 unverdicted novelty 8.0

    Certain transformer architectures lose internal linear signals for decision quality during training, making observability an architecture-dependent property rather than a universal one.

  2. On the Cost and Benefit of Chain of Thought: A Learning-Theoretic Perspective

    cs.LG 2026-05 unverdicted novelty 7.0

    Chain of Thought risk decomposes into oracle-trajectory benefit and trajectory-mismatch cost, with stability determining bounded, linear, or exponential error growth.

  3. Counterfactual Likelihood Tests for Indirect Influence in Private Reasoning Channels

    cs.LG 2026-05 unverdicted novelty 7.0

    Counterfactual likelihood tests detect indirect influence through public channels in private reasoning models, validated on a 7B role-channel model showing asymmetric A-to-B influence and complete pathway identificati...

  4. PluRule: A Benchmark for Moderating Pluralistic Communities on Social Media

    cs.CL 2026-05 unverdicted novelty 7.0

    PluRule is a new multimodal multilingual benchmark showing that state-of-the-art vision-language models perform only marginally better than a trivial baseline at detecting specific rule violations in pluralistic onlin...

  5. Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack

    cs.AI 2026-05 conditional novelty 7.0

    BenchJack audits 10 AI agent benchmarks, synthesizes exploits achieving near-perfect scores without task completion, surfaces 219 flaws, and reduces hackable-task ratios to under 10% on four benchmarks via iterative patching.

  6. LLM Wardens: Mitigating Adversarial Persuasion with Third-Party Conversational Oversight

    cs.LG 2026-05 unverdicted novelty 7.0

    A secondary warden LLM halves the success rate of hidden-goal adversarial LLMs in steering user decisions while causing only minor interference with genuine interactions.

  7. The Compliance Trap: How Structural Constraints Degrade Frontier AI Metacognition Under Adversarial Pressure

    cs.AI 2026-05 unverdicted novelty 7.0

    Eight of eleven frontier models show up to 30 percentage point metacognitive accuracy drops under compliance-forcing instructions rather than threat content, with Constitutional AI showing near-immunity due to its ali...

  8. Thinking Without Words: Efficient Latent Reasoning with Abstract Chain-of-Thought

    cs.CL 2026-04 unverdicted novelty 7.0

    Abstract-CoT lets models reason with short discrete latent token sequences from a reserved vocabulary, using warm-up training and RL to match verbal CoT performance with up to 11.6x fewer tokens.

  9. Scaling Latent Reasoning via Looped Language Models

    cs.CL 2025-10 unverdicted novelty 7.0

    Looped language models with latent iterative computation and entropy-regularized depth allocation achieve performance matching up to 12B standard LLMs through superior knowledge manipulation.

  10. Monitoring the Internal Monologue: Probe Trajectories Reveal Reasoning Dynamics

    cs.CL 2026-05 unverdicted novelty 6.0

    Probe trajectories across token positions in LRMs, combined with signal-processing features, improve prediction of future model outputs over static probes on safety and math tasks.

  11. Stop When Reasoning Converges: Semantic-Preserving Early Exit for Reasoning Models

    cs.CL 2026-05 unverdicted novelty 6.0

    PUMA detects reasoning-level semantic redundancy to enable early exit in chains of thought, achieving 26.2% average token reduction across five LRMs and five benchmarks while preserving accuracy and CoT quality.

  12. Formal Methods Meet LLMs: Auditing, Monitoring, and Intervention for Compliance of Advanced AI Systems

    cs.AI 2026-05 unverdicted novelty 6.0

    Combines LTL formal methods with LLMs for auditing, predictive monitoring, and runtime intervention on temporally extended behavioral constraints, outperforming LLM baselines and reducing violations.

  13. Ensemble Monitoring for AI Control: Diverse Signals Outweigh More Compute

    cs.AI 2026-05 unverdicted novelty 6.0

    Diverse ensembles of prompted and fine-tuned GPT-4.1-Mini monitors achieve 2.4x better detection of flawed code solutions than homogeneous ensembles on adversarial inputs.

  14. When Reasoning Traces Become Performative: Step-Level Evidence that Chain-of-Thought Is an Imperfect Oversight Channel

    cs.AI 2026-05 unverdicted novelty 6.0

    CoT traces align with internal answer commitment in only 61.9% of steps on average, dominated by confabulated continuations after commitment has stabilized.

  15. Chain of Risk: Safety Failures in Large Reasoning Models and Mitigation via Adaptive Multi-Principle Steering

    cs.AI 2026-05 unverdicted novelty 6.0

    Reasoning traces in large reasoning models expose safety failures missed by final-answer checks, and adaptive multi-principle steering reduces unsafe content in both traces and answers while preserving task performance.

  16. The Compliance Trap: How Structural Constraints Degrade Frontier AI Metacognition Under Adversarial Pressure

    cs.AI 2026-05 unverdicted novelty 6.0

    Compliance-forcing instructions cause up to 30 percentage point drops in metacognitive accuracy across most frontier models, while removing the compliance element restores performance and Constitutional AI shows near-...

  17. Compared to What? Baselines and Metrics for Counterfactual Prompting

    cs.CL 2026-05 conditional novelty 6.0

    Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistica...

  18. Architecture Determines Observability of Transformers

    cs.LG 2026-04 unverdicted novelty 6.0

    Architecture and training determine whether transformers retain a readable internal signal that lets activation monitors catch errors missed by output confidence.

  19. SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models

    cs.AI 2026-04 unverdicted novelty 6.0

    SafetyALFRED shows multimodal LLMs recognize kitchen hazards accurately in QA tests but achieve low success rates when required to mitigate those hazards through embodied planning.

  20. The Depth Ceiling: On the Limits of Large Language Models in Discovering Latent Planning

    cs.LG 2026-04 unverdicted novelty 6.0

    LLMs discover latent planning strategies up to five steps during training and execute them up to eight steps at test time, with larger models reaching seven under few-shot prompting, revealing a dissociation between d...

  21. A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring

    cs.AI 2026-02 unverdicted novelty 6.0

    A decision-theoretic steganographic gap, based on generalized V-information, quantifies and detects steganographic reasoning in LLMs by measuring asymmetry in downstream utility between agents who can and cannot decod...

  22. Can Aha Moments Be Fake? Identifying True and Decorative Thinking Steps in Chain-of-Thought

    cs.LG 2025-10 unverdicted novelty 6.0

    LLMs interleave true causal reasoning steps with decorative ones in CoT, with only ~2.3% of steps having high causal impact on AIME for Qwen-2.5, and a steering direction can force internal use of specific steps.

  23. LiSA: Lifelong Safety Adaptation via Conservative Policy Induction

    cs.LG 2026-05 unverdicted novelty 5.0

    LiSA improves AI guardrails lifelong by inducing conservative policies from sparse noisy failure reports via structured memory, conflict-aware rules, and posterior lower-bound gating.

  24. Are Latent Reasoning Models Easily Interpretable?

    cs.LG 2026-04 unverdicted novelty 5.0

    Latent reasoning models often ignore their latent tokens for predictions and their correct outputs can be decoded into natural language reasoning traces more reliably than incorrect outputs.

  25. OpenAI GPT-5 System Card

    cs.CL 2025-12 unverdicted novelty 3.0

    GPT-5 is a unified model system that routes queries between fast and deep reasoning paths and reports gains in real-world usefulness, reduced hallucinations, and safety features over prior versions.

Reference graph

Works this paper leans on

115 extracted references · 115 canonical work pages · cited by 23 Pith papers · 5 internal anchors

  1. [1]

    AI safety via debate

    AI Safety via Debate , author=. arXiv preprint arXiv:1805.00899 , year=

  2. [2]

    Fine-Tuning Language Models from Human Preferences

    Fine-Tuning Language Models from Human Preferences , author=. arXiv preprint arXiv:1909.08593 , year=

  3. [3]

    2024 , howpublished=

    Our approach to alignment research , author=. 2024 , howpublished=

  4. [4]

    The Checklist: What Succeeding at

    Bowman, Sam , year=. The Checklist: What Succeeding at

  5. [5]

    Scheming

    Carlsmith, Joe , year=. Scheming. 2311.08379 , archivePrefix=

  6. [6]

    and Yang, John and Wettig, Alexander and Yao, Shunyu and Pei, Kexin and Press, Ofir and Narasimhan, Karthik , journal=

    Jimenez, Carlos E. and Yang, John and Wettig, Alexander and Yao, Shunyu and Pei, Kexin and Press, Ofir and Narasimhan, Karthik , journal=

  7. [7]

    Chan, Jun Shern and Chowdhury, Neil and Jaffe, Oliver and Aung, James and Sherburn, Dane and Mays, Evan and Starace, Giulio and Liu, Kevin and Maksin, Leon and Patwardhan, Tejal and Weng, Lilian and Mądry, Aleksander , journal=

  8. [8]

    2024 , eprint=

    Sabotage Evaluations for Frontier Models , author=. 2024 , eprint=

  9. [9]

    Safety cases for frontier

    Marie Davidsen Buhl and Gaurav Sett and Leonie Koessler and Jonas Schuett and Markus Anderljung , year=. Safety cases for frontier. 2410.21572 , archivePrefix=

  10. [10]

    Safety Cases: How to Justify the Safety of Advanced

    Clymer, Joshua and Gabrieli, Nick and Krueger, David and Larsen, Thomas , year=. Safety Cases: How to Justify the Safety of Advanced. 2403.10462 , archivePrefix=

  11. [11]

    Safety cases at

    Irving, Geoffrey , year=. Safety cases at

  12. [12]

    2024 , month=

    A New Initiative for Developing Third-Party Model Evaluations , author=. 2024 , month=

  13. [13]

    2024 , eprint=

    Evaluating Frontier Models for Dangerous Capabilities , author=. 2024 , eprint=

  14. [14]

    and Lucas, Caleb and Guest, Ella , year=

    Mouton, Christopher A. and Lucas, Caleb and Guest, Ella , year=. The Operational Risks of

  15. [15]

    2024 , howpublished=

    Preparedness Framework , author=. 2024 , howpublished=

  16. [16]

    Greenblatt, Ryan and Shlegeris, Buck and Sachan, Kshitij and Roger, Fabien , journal=

  17. [17]

    2007 , institution=

    Defence Standard 00-56 Issue 4: Safety Management Requirements for Defence Systems , author=. 2007 , institution=

  18. [18]

    Managing extreme

    Bengio, Yoshua and Hinton, Geoffrey and Yao, Andrew and Song, Dawn and Abbeel, Pieter and Darrell, Trevor and Harari, Yuval Noah and Zhang, Ya-Qin and Xue, Lan and Shalev-Shwartz, Shai and Hadfield, Gillian and Clune, Jeff and Maharaj, Tegan and Hutter, Frank and Baydin, Atılım Güneş and McIlraith, Sheila and Gao, Qiqi and Acharya, Ashwin and Krueger, Dav...

  19. [19]

    2024 , month=

    International Scientific Report on the Safety of Advanced. 2024 , month=

  20. [20]

    Safety and Reliability , volume=

    Implementation of nuclear safety cases , author=. Safety and Reliability , volume=

  21. [21]

    SPE Asia Pacific Oil and Gas Conference and Exhibition , year=

    Has the Safety Case Failed? , author=. SPE Asia Pacific Oil and Gas Conference and Exhibition , year=

  22. [22]

    Safety Science , volume=

    Safety Cases: Past, Present and Future , author=. Safety Science , volume=

  23. [23]

    Safety Science , volume=

    Modelling confidence in railway safety case , author=. Safety Science , volume=

  24. [24]

    Foundations of Computer Software , pages=

    Software Certification: Is There a Case against Safety Cases? , author=. Foundations of Computer Software , pages=

  25. [25]

    Safety Science , volume=

    Safety Cases in the Certification of Autonomous Systems , author=. Safety Science , volume=

  26. [26]

    Safety case template for frontier

    Goemans, Arthur and Buhl, Marie Davidsen and Schuett, Jonas and Korbak, Tomek and Wang, Jessica and Hilton, Benjamin and Irving, Geoffrey , year=. Safety case template for frontier. 2411.08088 , archivePrefix=

  27. [27]

    Towards evaluations-based safety cases for

    Mikita Balesni and Marius Hobbhahn and David Lindner and Alexander Meinke and Tomek Korbak and Joshua Clymer and Buck Shlegeris and Jérémy Scheurer and Charlotte Stix and Rusheb Shah and Nicholas Goldowsky-Dill and Dan Braun and Bilal Chughtai and Owain Evans and Daniel Kokotajlo and Lucius Bushnaq , year=. Towards evaluations-based safety cases for. 2411...

  28. [28]

    , booktitle=

    Rein, David and Hou, Betty Li and Stickland, Asa Cooper and Petty, Jackson and Pang, Richard Yuanzhe and Dirani, Julien and Michael, Julian and Bowman, Samuel R. , booktitle=

  29. [29]

    Re-bench: Evaluating frontier ai r&d capabilities of language model agents against human experts.arXiv preprint arXiv:2411.15114, 2024

    Hjalmar Wijk and Tao Lin and Joel Becker and Sami Jawhar and Neev Parikh and Thomas Broadley and Lawrence Chan and Michael Chen and Josh Clymer and Jai Dhyani and Elena Ericheva and Katharyn Garcia and Brian Goodrich and Nikola Jurkovic and Megan Kinniment and Aron Lajko and Seraphina Nix and Lucas Sato and William Saunders and Maksym Taran and Ben West a...

  30. [30]

    Shell Games: Control Protocols for Adversarial

    Aryan Bhatt and Cody Rushing and Adam Kaufman and Vasil Georgiev and Tyler Tracy and Akbir Khan and Buck Shlegeris , year=. Shell Games: Control Protocols for Adversarial

  31. [31]

    2024 , eprint=

    AI Sandbagging: Language Models can Strategically Underperform on Evaluations , author=. 2024 , eprint=

  32. [32]

    2023 , eprint=

    ReAct: Synergizing Reasoning and Acting in Language Models , author=. 2023 , eprint=

  33. [33]

    2023 , eprint=

    Reflexion: Language Agents with Verbal Reinforcement Learning , author=. 2023 , eprint=

  34. [34]

    2024 , eprint=

    Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats , author=. 2024 , eprint=

  35. [35]

    2018 , eprint=

    AI safety via debate , author=. 2018 , eprint=

  36. [36]

    and Phang, Jason and Bowman, Samuel R

    Korbak, Tomasz and Shi, Kejian and Chen, Angelica and Bhalerao, Rasika and Buckley, Christopher L. and Phang, Jason and Bowman, Samuel R. and Perez, Ethan , title =. Proceedings of the 40th International Conference on Machine Learning , articleno =. 2023 , publisher =

  37. [37]

    2020 , eprint=

    Fine-Tuning Language Models from Human Preferences , author=. 2020 , eprint=

  38. [38]

    2022 , eprint=

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback , author=. 2022 , eprint=

  39. [39]

    2024 , howpublished=

  40. [40]

    2024 , eprint=

    Alignment faking in large language models , author=. 2024 , eprint=

  41. [41]

    2024 , eprint=

    Frontier Models are Capable of In-context Scheming , author=. 2024 , eprint=

  42. [42]

    2024 , note=

    Win/continue/lose scenarios and execute/replace/audit protocols , author=. 2024 , note=

  43. [43]

    arXiv preprint arXiv:2402.00773 , year=

    Trusted monitoring for large language models , author=. arXiv preprint arXiv:2402.00773 , year=

  44. [44]

    2407.00215 , archivePrefix=

    Nat McAleese and Rai Michael Pokorny and Juan Felipe Ceron Uribe and Evgenia Nitishinskaya and Maja Trebacz and Jan Leike , year=. 2407.00215 , archivePrefix=

  45. [45]

    arXiv preprint arXiv:2310.18512 , year=

    Preventing Language Models From Hiding Their Reasoning , author=. arXiv preprint arXiv:2310.18512 , year=

  46. [46]

    Games for AI Control: Models of Safety Evaluations of AI Deployment Protocols

    Charlie Griffin and Louis Thomson and Buck Shlegeris and Alessandro Abate , year=. Games for. 2409.07985 , archivePrefix=

  47. [47]

    Risk thresholds for frontier

    Leonie Koessler and Jonas Schuett and Markus Anderljung , year=. Risk thresholds for frontier. 2406.14713 , archivePrefix=

  48. [48]

    A basic systems architecture for

    Shlegeris, Buck , year=. A basic systems architecture for

  49. [49]

    Three Sketches of

    Grosse, Roger , year=. Three Sketches of

  50. [50]

    2024 , month=

    Responsible Scaling Policy , author=. 2024 , month=

  51. [51]

    and Thomas, John P

    Leveson, Nancy G. and Thomas, John P. , year=

  52. [52]

    Failure Mode and Effects Analysis (

    Villacourt, Mario , year=. Failure Mode and Effects Analysis (

  53. [53]

    2024 , month=

    How to prevent collusion when using untrusted models to monitor each other , author=. 2024 , month=

  54. [54]

    2025 , month=

    Extending control evaluations to non-scheming threats , author=. 2025 , month=

  55. [55]

    2024 , month=

    Untrusted smart models and trusted dumb models , author=. 2024 , month=

  56. [56]

    Building Blocks for Assurance Cases , year=

    Bloomfield, Robin and Netkachova, Kateryna , booktitle=. Building Blocks for Assurance Cases , year=

  57. [57]

    Thoughts on the conservative assumptions in

    Shlegeris, Buck , year=. Thoughts on the conservative assumptions in

  58. [58]

    2023 , eprint=

    Taken out of context: On measuring situational awareness in LLMs , author=. 2023 , eprint=

  59. [59]

    2024 , eprint=

    Looking Inward: Language Models Can Learn About Themselves by Introspection , author=. 2024 , eprint=

  60. [60]

    2023 , month=

    Auditing failures vs concentrated failures , author=. 2023 , month=

  61. [61]

    Me, Myself, and

    Rudolf Laine and Bilal Chughtai and Jan Betley and Kaivalya Hariharan and Mikita Balesni and J. Me, Myself, and. The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

  62. [62]

    2025 , eprint=

    A sketch of an AI control safety case , author=. 2025 , eprint=

  63. [63]

    2025 , eprint=

    Safety Cases: A Scalable Approach to Frontier AI Safety , author=. 2025 , eprint=

  64. [64]

    2024 , eprint=

    Training Large Language Models to Reason in a Continuous Latent Space , author=. 2024 , eprint=

  65. [65]

    2025 , eprint=

    Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach , author=. 2025 , eprint=

  66. [66]

    arXiv preprint arXiv:2502.01635 , year =

    The Scale of AI Agent Deployment: New Metrics and Perspectives , author =. arXiv preprint arXiv:2502.01635 , year =

  67. [67]

    2024 , eprint =

    Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation , author =. 2024 , eprint =

  68. [68]

    2025 , eprint=

    Fundamental Limitations in Defending LLM Finetuning APIs , author=. 2025 , eprint=

  69. [69]

    2021 , month=

    Eliciting latent knowledge: How to tell if your eyes deceive you , author=. 2021 , month=

  70. [70]

    Recursively Summarizing Books with Human Feedback

    Recursively Summarizing Books with Human Feedback , author=. arXiv preprint arXiv:2109.10862 , year=

  71. [71]

    Self-critiquing models for assisting human evaluators

    Self-critiquing models for assisting human evaluators , author=. arXiv preprint arXiv:2206.05802 , year=

  72. [72]

    2024 , month=

    Automation collapse , author=. 2024 , month=

  73. [73]

    2025 , eprint=

    Subversion Strategy Eval: Can language models statelessly strategize to subvert control protocols? , author=. 2025 , eprint=

  74. [74]

    DeepSeek-

    Guo, Yuxuan and Shao, Haotian and Liu, Aixin and Ruan, Chong and and Cao, Zihan and Feng, Bei and Wang, Yao and Han, Lei and Zheng, Xiangxin and Chen, Yunji , year=. DeepSeek-. 2501.08497 , archivePrefix=

  75. [75]

    2025 , eprint=

    Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation , author=. 2025 , eprint=

  76. [76]

    Goldowsky-Dill, Nicholas and Balesni, Mikita and Scheurer, Jérémy and Hobbhahn, Marius , year=. Claude

  77. [77]

    2025 , url=

    An Approach to Technical AGI Safety and Security , author=. 2025 , url=

  78. [78]

    2024 , month=

    If-Then Commitments for AI Risk Reduction , author=. 2024 , month=

  79. [79]

    2022 , eprint=

    Measuring Progress on Scalable Oversight for Large Language Models , author=. 2022 , eprint=

  80. [80]

    2025 , eprint=

    Measuring AI Ability to Complete Long Tasks , author=. 2025 , eprint=

Showing first 80 references.