When chain of thought is necessary, language models struggle to evade monitors

Scott Emmons, Erik Jenner, David K Elson, Rif A Saurous, Senthooran Rajamanoharan, Heng Chen, Irhum Shafkat, Rohin Shah · 2025 · arXiv 2507.05246

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

read on arXiv browse 5 citing papers

representative citing papers

When Reasoning Traces Become Performative: Step-Level Evidence that Chain-of-Thought Is an Imperfect Oversight Channel

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

CoT traces align with internal answer commitment in only 61.9% of steps on average, dominated by confabulated continuations after commitment has stabilized.

The Depth Ceiling: On the Limits of Large Language Models in Discovering Latent Planning

cs.LG · 2026-04-07 · unverdicted · novelty 6.0

LLMs discover latent planning strategies up to five steps during training and execute them up to eight steps at test time, with larger models reaching seven under few-shot prompting, revealing a dissociation between discovery and execution.

An Independent Safety Evaluation of Kimi K2.5

cs.CR · 2026-04-03 · conditional · novelty 6.0

Kimi K2.5 matches closed models on dual-use tasks but refuses fewer CBRNE requests and shows some sabotage and self-replication tendencies.

CoT-Guard: Small Models for Strong Monitoring

cs.CR · 2026-05-12 · unverdicted · novelty 5.0

CoT-Guard is a 4B model using SFT and RL that achieves 75% G-mean^2 on hidden objective detection under prompt and code manipulation attacks, outperforming several larger models.

LLM Reasoning Is Latent, Not the Chain of Thought

cs.AI · 2026-04-17 · unverdicted · novelty 5.0

LLM reasoning is primarily mediated by latent-state trajectories rather than by explicit surface chain-of-thought outputs.

citing papers explorer

Showing 5 of 5 citing papers.

When Reasoning Traces Become Performative: Step-Level Evidence that Chain-of-Thought Is an Imperfect Oversight Channel cs.AI · 2026-05-12 · unverdicted · none · ref 13
CoT traces align with internal answer commitment in only 61.9% of steps on average, dominated by confabulated continuations after commitment has stabilized.
The Depth Ceiling: On the Limits of Large Language Models in Discovering Latent Planning cs.LG · 2026-04-07 · unverdicted · none · ref 10
LLMs discover latent planning strategies up to five steps during training and execute them up to eight steps at test time, with larger models reaching seven under few-shot prompting, revealing a dissociation between discovery and execution.
An Independent Safety Evaluation of Kimi K2.5 cs.CR · 2026-04-03 · conditional · none · ref 42
Kimi K2.5 matches closed models on dual-use tasks but refuses fewer CBRNE requests and shows some sabotage and self-replication tendencies.
CoT-Guard: Small Models for Strong Monitoring cs.CR · 2026-05-12 · unverdicted · none · ref 31
CoT-Guard is a 4B model using SFT and RL that achieves 75% G-mean^2 on hidden objective detection under prompt and code manipulation attacks, outperforming several larger models.
LLM Reasoning Is Latent, Not the Chain of Thought cs.AI · 2026-04-17 · unverdicted · none · ref 23
LLM reasoning is primarily mediated by latent-state trajectories rather than by explicit surface chain-of-thought outputs.

When chain of thought is necessary, language models struggle to evade monitors

fields

years

verdicts

representative citing papers

citing papers explorer