Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety
Pith reviewed 2026-05-20 14:14 UTC · model grok-4.3
The pith
Chains of thought in AI systems that reason in language provide a monitorable window into potential misbehavior for safety purposes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AI systems that think in human language allow monitoring of their chains of thought to detect intent to misbehave, offering a new though imperfect tool for AI safety that requires further research and careful consideration during model development to maintain its utility.
What carries the argument
Chain of thought monitorability, the property that lets observers inspect a model's step-by-step reasoning in natural language to identify plans for misbehavior.
If this is right
- Further research should focus on understanding and improving CoT monitorability.
- CoT monitoring should be invested in and used together with other existing AI safety methods.
- Developers of frontier models need to evaluate how their training and scaling decisions might affect the monitorability of chains of thought.
- It can help surface misbehavior that would otherwise go unnoticed by other oversight approaches.
Where Pith is reading between the lines
- This implies that safety research could benefit from prioritizing techniques that keep model reasoning human-readable and faithful to internal processes.
- Changes in model architecture or post-training could inadvertently eliminate this monitoring advantage if not carefully tracked.
- Testing for monitorability might become a standard part of safety assessments for advanced AI systems.
Load-bearing premise
That the chains of thought generated by current and future models will faithfully and readably reflect their internal reasoning so that monitoring can reliably detect intent to misbehave.
What would settle it
A demonstration that an AI model can pursue harmful goals while producing chains of thought that give no indication of those goals, or that training methods routinely make chains of thought unfaithful without reducing capability.
read the original abstract
AI systems that "think" in human language offer a unique opportunity for AI safety: we can monitor their chains of thought (CoT) for the intent to misbehave. Like all other known AI oversight methods, CoT monitoring is imperfect and allows some misbehavior to go unnoticed. Nevertheless, it shows promise and we recommend further research into CoT monitorability and investment in CoT monitoring alongside existing safety methods. Because CoT monitorability may be fragile, we recommend that frontier model developers consider the impact of development decisions on CoT monitorability.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript argues that AI systems which produce chain-of-thought (CoT) reasoning in human language create a distinctive safety opportunity: external monitors can inspect these traces for signs of intent to misbehave. The authors acknowledge that CoT monitoring is imperfect and permits some undetected misbehavior, yet they claim it shows sufficient promise to warrant dedicated research and investment alongside other oversight techniques. They further recommend that frontier developers explicitly evaluate how their training and architectural choices may affect the continued monitorability of CoT.
Significance. If the qualitative case for fragility holds, the paper identifies a potentially actionable lever for preserving interpretability during capability scaling. It frames CoT monitorability as a transient property worth deliberate protection rather than an inevitable byproduct of scaling, which could usefully inform development roadmaps even in the absence of new quantitative results.
major comments (2)
- [Abstract] Abstract: The central recommendation that developers 'consider the impact of development decisions on CoT monitorability' is load-bearing for the paper's policy-oriented conclusion, yet the manuscript supplies no concrete mechanisms, ablation-style reasoning, or even qualitative examples linking specific choices (e.g., RLHF variants, data mixtures, or architectural changes) to measurable changes in faithfulness between visible CoT and hidden intent.
- The claim that CoT monitorability is 'fragile' rests entirely on general domain knowledge about language-model behavior rather than any derivation, controlled observation, or falsifiable prediction internal to the manuscript; this makes it difficult to assess whether the opportunity is genuinely at risk or merely imperfect in the same way as other oversight methods.
minor comments (1)
- [Abstract] The abstract and body would benefit from a short explicit statement of the scope: whether the argument applies only to current-scale models or is intended to generalize to future systems whose CoT may be less human-readable.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback. We address the major comments below, clarifying the scope of our position paper while offering targeted revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central recommendation that developers 'consider the impact of development decisions on CoT monitorability' is load-bearing for the paper's policy-oriented conclusion, yet the manuscript supplies no concrete mechanisms, ablation-style reasoning, or even qualitative examples linking specific choices (e.g., RLHF variants, data mixtures, or architectural changes) to measurable changes in faithfulness between visible CoT and hidden intent.
Authors: We acknowledge that the manuscript, as a conceptual position paper, does not include new empirical ablations or exhaustive mechanisms. However, the recommendation follows from the logic that CoT monitorability depends on models continuing to externalize reasoning in monitorable form, which prior work shows can be influenced by training objectives. We will revise to add qualitative examples from the existing literature, such as how certain RLHF variants can incentivize post-hoc rationalizations or hidden reasoning, to better ground the policy suggestion without overstating current evidence. revision: partial
-
Referee: The claim that CoT monitorability is 'fragile' rests entirely on general domain knowledge about language-model behavior rather than any derivation, controlled observation, or falsifiable prediction internal to the manuscript; this makes it difficult to assess whether the opportunity is genuinely at risk or merely imperfect in the same way as other oversight methods.
Authors: The fragility argument draws on documented phenomena in the CoT literature, including cases where models produce unfaithful reasoning or develop non-linguistic internal processes under optimization. We disagree that this renders it indistinguishable from other oversight methods, as CoT uniquely offers direct access to intermediate reasoning steps that could be lost. We will revise the manuscript to include specific citations and a brief discussion of how this creates a distinct risk profile, making the claim more self-contained and open to future empirical testing via monitorability evaluations. revision: yes
Circularity Check
No circular derivations or self-referential reductions present
full rationale
This is a conceptual position paper on AI safety opportunities rather than a technical derivation. It advances the claim that CoT monitorability is a fragile but promising oversight method and recommends considering its preservation in development decisions, but contains no equations, fitted parameters, predictions derived from inputs, or load-bearing self-citations that reduce the central argument to a tautology. The reasoning draws on general observations about language-model behavior and external domain knowledge, remaining self-contained without any of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Visible chain-of-thought reasoning in human language is sufficiently faithful to internal model computations to allow detection of misbehavior intent.
- domain assumption Development decisions can materially degrade or preserve CoT monitorability.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Because CoT monitorability may be fragile, we recommend that frontier model developers consider the impact of development decisions on CoT monitorability.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 25 Pith papers
-
Architecture Determines Observability of Transformers
Certain transformer architectures lose internal linear signals for decision quality during training, making observability an architecture-dependent property rather than a universal one.
-
On the Cost and Benefit of Chain of Thought: A Learning-Theoretic Perspective
Chain of Thought risk decomposes into oracle-trajectory benefit and trajectory-mismatch cost, with stability determining bounded, linear, or exponential error growth.
-
Counterfactual Likelihood Tests for Indirect Influence in Private Reasoning Channels
Counterfactual likelihood tests detect indirect influence through public channels in private reasoning models, validated on a 7B role-channel model showing asymmetric A-to-B influence and complete pathway identificati...
-
PluRule: A Benchmark for Moderating Pluralistic Communities on Social Media
PluRule is a new multimodal multilingual benchmark showing that state-of-the-art vision-language models perform only marginally better than a trivial baseline at detecting specific rule violations in pluralistic onlin...
-
Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack
BenchJack audits 10 AI agent benchmarks, synthesizes exploits achieving near-perfect scores without task completion, surfaces 219 flaws, and reduces hackable-task ratios to under 10% on four benchmarks via iterative patching.
-
LLM Wardens: Mitigating Adversarial Persuasion with Third-Party Conversational Oversight
A secondary warden LLM halves the success rate of hidden-goal adversarial LLMs in steering user decisions while causing only minor interference with genuine interactions.
-
The Compliance Trap: How Structural Constraints Degrade Frontier AI Metacognition Under Adversarial Pressure
Eight of eleven frontier models show up to 30 percentage point metacognitive accuracy drops under compliance-forcing instructions rather than threat content, with Constitutional AI showing near-immunity due to its ali...
-
Thinking Without Words: Efficient Latent Reasoning with Abstract Chain-of-Thought
Abstract-CoT lets models reason with short discrete latent token sequences from a reserved vocabulary, using warm-up training and RL to match verbal CoT performance with up to 11.6x fewer tokens.
-
Scaling Latent Reasoning via Looped Language Models
Looped language models with latent iterative computation and entropy-regularized depth allocation achieve performance matching up to 12B standard LLMs through superior knowledge manipulation.
-
Monitoring the Internal Monologue: Probe Trajectories Reveal Reasoning Dynamics
Probe trajectories across token positions in LRMs, combined with signal-processing features, improve prediction of future model outputs over static probes on safety and math tasks.
-
Stop When Reasoning Converges: Semantic-Preserving Early Exit for Reasoning Models
PUMA detects reasoning-level semantic redundancy to enable early exit in chains of thought, achieving 26.2% average token reduction across five LRMs and five benchmarks while preserving accuracy and CoT quality.
-
Formal Methods Meet LLMs: Auditing, Monitoring, and Intervention for Compliance of Advanced AI Systems
Combines LTL formal methods with LLMs for auditing, predictive monitoring, and runtime intervention on temporally extended behavioral constraints, outperforming LLM baselines and reducing violations.
-
Ensemble Monitoring for AI Control: Diverse Signals Outweigh More Compute
Diverse ensembles of prompted and fine-tuned GPT-4.1-Mini monitors achieve 2.4x better detection of flawed code solutions than homogeneous ensembles on adversarial inputs.
-
When Reasoning Traces Become Performative: Step-Level Evidence that Chain-of-Thought Is an Imperfect Oversight Channel
CoT traces align with internal answer commitment in only 61.9% of steps on average, dominated by confabulated continuations after commitment has stabilized.
-
Chain of Risk: Safety Failures in Large Reasoning Models and Mitigation via Adaptive Multi-Principle Steering
Reasoning traces in large reasoning models expose safety failures missed by final-answer checks, and adaptive multi-principle steering reduces unsafe content in both traces and answers while preserving task performance.
-
The Compliance Trap: How Structural Constraints Degrade Frontier AI Metacognition Under Adversarial Pressure
Compliance-forcing instructions cause up to 30 percentage point drops in metacognitive accuracy across most frontier models, while removing the compliance element restores performance and Constitutional AI shows near-...
-
Compared to What? Baselines and Metrics for Counterfactual Prompting
Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistica...
-
Architecture Determines Observability of Transformers
Architecture and training determine whether transformers retain a readable internal signal that lets activation monitors catch errors missed by output confidence.
-
SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models
SafetyALFRED shows multimodal LLMs recognize kitchen hazards accurately in QA tests but achieve low success rates when required to mitigate those hazards through embodied planning.
-
The Depth Ceiling: On the Limits of Large Language Models in Discovering Latent Planning
LLMs discover latent planning strategies up to five steps during training and execute them up to eight steps at test time, with larger models reaching seven under few-shot prompting, revealing a dissociation between d...
-
A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring
A decision-theoretic steganographic gap, based on generalized V-information, quantifies and detects steganographic reasoning in LLMs by measuring asymmetry in downstream utility between agents who can and cannot decod...
-
Can Aha Moments Be Fake? Identifying True and Decorative Thinking Steps in Chain-of-Thought
LLMs interleave true causal reasoning steps with decorative ones in CoT, with only ~2.3% of steps having high causal impact on AIME for Qwen-2.5, and a steering direction can force internal use of specific steps.
-
LiSA: Lifelong Safety Adaptation via Conservative Policy Induction
LiSA improves AI guardrails lifelong by inducing conservative policies from sparse noisy failure reports via structured memory, conflict-aware rules, and posterior lower-bound gating.
-
Are Latent Reasoning Models Easily Interpretable?
Latent reasoning models often ignore their latent tokens for predictions and their correct outputs can be decoded into natural language reasoning traces more reliably than incorrect outputs.
-
OpenAI GPT-5 System Card
GPT-5 is a unified model system that routes queries between fast and deep reasoning paths and reports gains in real-world usefulness, reduced hallucinations, and safety features over prior versions.
Reference graph
Works this paper leans on
-
[1]
AI Safety via Debate , author=. arXiv preprint arXiv:1805.00899 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Fine-Tuning Language Models from Human Preferences
Fine-Tuning Language Models from Human Preferences , author=. arXiv preprint arXiv:1909.08593 , year=
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[3]
Our approach to alignment research , author=. 2024 , howpublished=
work page 2024
- [4]
- [5]
-
[6]
Jimenez, Carlos E. and Yang, John and Wettig, Alexander and Yao, Shunyu and Pei, Kexin and Press, Ofir and Narasimhan, Karthik , journal=
-
[7]
Chan, Jun Shern and Chowdhury, Neil and Jaffe, Oliver and Aung, James and Sherburn, Dane and Mays, Evan and Starace, Giulio and Liu, Kevin and Maksin, Leon and Patwardhan, Tejal and Weng, Lilian and Mądry, Aleksander , journal=
- [8]
-
[9]
Marie Davidsen Buhl and Gaurav Sett and Leonie Koessler and Jonas Schuett and Markus Anderljung , year=. Safety cases for frontier. 2410.21572 , archivePrefix=
-
[10]
Safety Cases: How to Justify the Safety of Advanced
Clymer, Joshua and Gabrieli, Nick and Krueger, David and Larsen, Thomas , year=. Safety Cases: How to Justify the Safety of Advanced. 2403.10462 , archivePrefix=
- [11]
-
[12]
A New Initiative for Developing Third-Party Model Evaluations , author=. 2024 , month=
work page 2024
-
[13]
Evaluating Frontier Models for Dangerous Capabilities , author=. 2024 , eprint=
work page 2024
-
[14]
and Lucas, Caleb and Guest, Ella , year=
Mouton, Christopher A. and Lucas, Caleb and Guest, Ella , year=. The Operational Risks of
- [15]
-
[16]
Greenblatt, Ryan and Shlegeris, Buck and Sachan, Kshitij and Roger, Fabien , journal=
-
[17]
Defence Standard 00-56 Issue 4: Safety Management Requirements for Defence Systems , author=. 2007 , institution=
work page 2007
-
[18]
Bengio, Yoshua and Hinton, Geoffrey and Yao, Andrew and Song, Dawn and Abbeel, Pieter and Darrell, Trevor and Harari, Yuval Noah and Zhang, Ya-Qin and Xue, Lan and Shalev-Shwartz, Shai and Hadfield, Gillian and Clune, Jeff and Maharaj, Tegan and Hutter, Frank and Baydin, Atılım Güneş and McIlraith, Sheila and Gao, Qiqi and Acharya, Ashwin and Krueger, Dav...
-
[19]
International Scientific Report on the Safety of Advanced. 2024 , month=
work page 2024
-
[20]
Safety and Reliability , volume=
Implementation of nuclear safety cases , author=. Safety and Reliability , volume=
-
[21]
SPE Asia Pacific Oil and Gas Conference and Exhibition , year=
Has the Safety Case Failed? , author=. SPE Asia Pacific Oil and Gas Conference and Exhibition , year=
-
[22]
Safety Cases: Past, Present and Future , author=. Safety Science , volume=
-
[23]
Modelling confidence in railway safety case , author=. Safety Science , volume=
-
[24]
Foundations of Computer Software , pages=
Software Certification: Is There a Case against Safety Cases? , author=. Foundations of Computer Software , pages=
-
[25]
Safety Cases in the Certification of Autonomous Systems , author=. Safety Science , volume=
-
[26]
Safety case template for frontier
Goemans, Arthur and Buhl, Marie Davidsen and Schuett, Jonas and Korbak, Tomek and Wang, Jessica and Hilton, Benjamin and Irving, Geoffrey , year=. Safety case template for frontier. 2411.08088 , archivePrefix=
-
[27]
Towards evaluations-based safety cases for
Mikita Balesni and Marius Hobbhahn and David Lindner and Alexander Meinke and Tomek Korbak and Joshua Clymer and Buck Shlegeris and Jérémy Scheurer and Charlotte Stix and Rusheb Shah and Nicholas Goldowsky-Dill and Dan Braun and Bilal Chughtai and Owain Evans and Daniel Kokotajlo and Lucius Bushnaq , year=. Towards evaluations-based safety cases for. 2411...
-
[28]
Rein, David and Hou, Betty Li and Stickland, Asa Cooper and Petty, Jackson and Pang, Richard Yuanzhe and Dirani, Julien and Michael, Julian and Bowman, Samuel R. , booktitle=
-
[29]
Hjalmar Wijk and Tao Lin and Joel Becker and Sami Jawhar and Neev Parikh and Thomas Broadley and Lawrence Chan and Michael Chen and Josh Clymer and Jai Dhyani and Elena Ericheva and Katharyn Garcia and Brian Goodrich and Nikola Jurkovic and Megan Kinniment and Aron Lajko and Seraphina Nix and Lucas Sato and William Saunders and Maksym Taran and Ben West a...
-
[30]
Shell Games: Control Protocols for Adversarial
Aryan Bhatt and Cody Rushing and Adam Kaufman and Vasil Georgiev and Tyler Tracy and Akbir Khan and Buck Shlegeris , year=. Shell Games: Control Protocols for Adversarial
-
[31]
AI Sandbagging: Language Models can Strategically Underperform on Evaluations , author=. 2024 , eprint=
work page 2024
-
[32]
ReAct: Synergizing Reasoning and Acting in Language Models , author=. 2023 , eprint=
work page 2023
-
[33]
Reflexion: Language Agents with Verbal Reinforcement Learning , author=. 2023 , eprint=
work page 2023
-
[34]
Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats , author=. 2024 , eprint=
work page 2024
- [35]
-
[36]
and Phang, Jason and Bowman, Samuel R
Korbak, Tomasz and Shi, Kejian and Chen, Angelica and Bhalerao, Rasika and Buckley, Christopher L. and Phang, Jason and Bowman, Samuel R. and Perez, Ethan , title =. Proceedings of the 40th International Conference on Machine Learning , articleno =. 2023 , publisher =
work page 2023
-
[37]
Fine-Tuning Language Models from Human Preferences , author=. 2020 , eprint=
work page 2020
-
[38]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback , author=. 2022 , eprint=
work page 2022
-
[39]
2024 , howpublished=
work page 2024
- [40]
-
[41]
Frontier Models are Capable of In-context Scheming , author=. 2024 , eprint=
work page 2024
-
[42]
Win/continue/lose scenarios and execute/replace/audit protocols , author=. 2024 , note=
work page 2024
-
[43]
arXiv preprint arXiv:2402.00773 , year=
Trusted monitoring for large language models , author=. arXiv preprint arXiv:2402.00773 , year=
-
[44]
Nat McAleese and Rai Michael Pokorny and Juan Felipe Ceron Uribe and Evgenia Nitishinskaya and Maja Trebacz and Jan Leike , year=. 2407.00215 , archivePrefix=
-
[45]
arXiv preprint arXiv:2310.18512 , year=
Preventing Language Models From Hiding Their Reasoning , author=. arXiv preprint arXiv:2310.18512 , year=
-
[46]
Games for AI Control: Models of Safety Evaluations of AI Deployment Protocols
Charlie Griffin and Louis Thomson and Buck Shlegeris and Alessandro Abate , year=. Games for. 2409.07985 , archivePrefix=
work page internal anchor Pith review Pith/arXiv arXiv
-
[47]
Leonie Koessler and Jonas Schuett and Markus Anderljung , year=. Risk thresholds for frontier. 2406.14713 , archivePrefix=
-
[48]
A basic systems architecture for
Shlegeris, Buck , year=. A basic systems architecture for
- [49]
- [50]
- [51]
-
[52]
Failure Mode and Effects Analysis (
Villacourt, Mario , year=. Failure Mode and Effects Analysis (
-
[53]
How to prevent collusion when using untrusted models to monitor each other , author=. 2024 , month=
work page 2024
-
[54]
Extending control evaluations to non-scheming threats , author=. 2025 , month=
work page 2025
-
[55]
Untrusted smart models and trusted dumb models , author=. 2024 , month=
work page 2024
-
[56]
Building Blocks for Assurance Cases , year=
Bloomfield, Robin and Netkachova, Kateryna , booktitle=. Building Blocks for Assurance Cases , year=
-
[57]
Thoughts on the conservative assumptions in
Shlegeris, Buck , year=. Thoughts on the conservative assumptions in
-
[58]
Taken out of context: On measuring situational awareness in LLMs , author=. 2023 , eprint=
work page 2023
-
[59]
Looking Inward: Language Models Can Learn About Themselves by Introspection , author=. 2024 , eprint=
work page 2024
- [60]
-
[61]
Rudolf Laine and Bilal Chughtai and Jan Betley and Kaivalya Hariharan and Mikita Balesni and J. Me, Myself, and. The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=
- [62]
-
[63]
Safety Cases: A Scalable Approach to Frontier AI Safety , author=. 2025 , eprint=
work page 2025
-
[64]
Training Large Language Models to Reason in a Continuous Latent Space , author=. 2024 , eprint=
work page 2024
-
[65]
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach , author=. 2025 , eprint=
work page 2025
-
[66]
arXiv preprint arXiv:2502.01635 , year =
The Scale of AI Agent Deployment: New Metrics and Perspectives , author =. arXiv preprint arXiv:2502.01635 , year =
-
[67]
Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation , author =. 2024 , eprint =
work page 2024
-
[68]
Fundamental Limitations in Defending LLM Finetuning APIs , author=. 2025 , eprint=
work page 2025
-
[69]
Eliciting latent knowledge: How to tell if your eyes deceive you , author=. 2021 , month=
work page 2021
-
[70]
Recursively Summarizing Books with Human Feedback
Recursively Summarizing Books with Human Feedback , author=. arXiv preprint arXiv:2109.10862 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[71]
Self-critiquing models for assisting human evaluators
Self-critiquing models for assisting human evaluators , author=. arXiv preprint arXiv:2206.05802 , year=
work page internal anchor Pith review Pith/arXiv arXiv
- [72]
-
[73]
Subversion Strategy Eval: Can language models statelessly strategize to subvert control protocols? , author=. 2025 , eprint=
work page 2025
- [74]
-
[75]
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation , author=. 2025 , eprint=
work page 2025
-
[76]
Goldowsky-Dill, Nicholas and Balesni, Mikita and Scheurer, Jérémy and Hobbhahn, Marius , year=. Claude
- [77]
- [78]
-
[79]
Measuring Progress on Scalable Oversight for Large Language Models , author=. 2022 , eprint=
work page 2022
- [80]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.