When chain of thought is necessary, language models struggle to evade monitors

· 2025 · arXiv 2507.05246

11 Pith papers cite this work. Polarity classification is still indexing.

11 Pith papers citing it

read on arXiv browse 11 citing papers

citation-role summary

background 3 method 1

citation-polarity summary

background 2 support 1 use method 1

representative citing papers

Faithfulness Metrics Don't Measure Faithfulness: A Meta-Evaluation with Ground Truth

cs.CL · 2026-05-24 · unverdicted · novelty 8.0

Introduces BonaFide benchmark of 3,066 ground-truth labeled CoTs showing most faithfulness metrics perform near chance with biases and poor scaling to longer chains.

Do Thinking Tokens Help with Safety?

cs.LG · 2026-06-23 · unverdicted · novelty 7.0

Thinking tokens in reasoning models do not enable safety deliberation; refusal/compliance is strongly predictable from the first token and rarely changes during thinking.

CORE: Contrastive Reflection Enables Rapid Improvements in Reasoning

cs.AI · 2026-05-27 · unverdicted · novelty 7.0

CORE distills contrasts between successful and unsuccessful reasoning traces into compact natural-language insights that enable faster model self-improvement on reasoning tasks with fewer rollouts than parametric or other non-parametric baselines.

How Transparent is DiffusionGemma?

cs.LG · 2026-06-18 · unverdicted · novelty 6.0

DiffusionGemma matches Gemma 4 in variable transparency and monitorability after applying an interpretable token bottleneck, despite higher naive serial depth, and shows novel phenomena such as non-chronological reasoning.

Monitoring the Internal Monologue: Probe Trajectories Reveal Reasoning Dynamics

cs.CL · 2026-05-18 · unverdicted · novelty 6.0

Probe trajectories across token positions in LRMs, combined with signal-processing features, improve prediction of future model outputs over static probes on safety and math tasks.

When Reasoning Traces Become Performative: Step-Level Evidence that Chain-of-Thought Is an Imperfect Oversight Channel

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

CoT traces align with internal answer commitment in only 61.9% of steps on average, dominated by confabulated continuations after commitment has stabilized.

The Depth Ceiling: On the Limits of Large Language Models in Discovering Latent Planning

cs.LG · 2026-04-07 · unverdicted · novelty 6.0

LLMs discover latent planning strategies up to five steps during training and execute them up to eight steps at test time, with larger models reaching seven under few-shot prompting, revealing a dissociation between discovery and execution.

An Independent Safety Evaluation of Kimi K2.5

cs.CR · 2026-04-03 · conditional · novelty 6.0

Kimi K2.5 matches closed models on dual-use tasks but refuses fewer CBRNE requests and shows some sabotage and self-replication tendencies.

Think Fast: Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models

cs.AI · 2026-06-05 · unverdicted · novelty 5.0 · 2 refs

Frontier AI models' no-CoT 50% task-completion time horizons have doubled yearly over six years, reaching over 3 minutes for GPT-5.5 with projections to 25 minutes by 2030.

CoT-Guard: Small Models for Strong Monitoring

cs.CR · 2026-05-12 · unverdicted · novelty 5.0

CoT-Guard is a 4B model using SFT and RL that achieves 75% G-mean^2 on hidden objective detection under prompt and code manipulation attacks, outperforming several larger models.

LLM Reasoning Is Latent, Not the Chain of Thought

cs.AI · 2026-04-17 · unverdicted · novelty 5.0

LLM reasoning is primarily mediated by latent-state trajectories rather than by explicit surface chain-of-thought outputs.

citing papers explorer

Showing 1 of 1 citing paper after filters.

An Independent Safety Evaluation of Kimi K2.5 cs.CR · 2026-04-03 · conditional · none · ref 42
Kimi K2.5 matches closed models on dual-use tasks but refuses fewer CBRNE requests and shows some sabotage and self-replication tendencies.

When chain of thought is necessary, language models struggle to evade monitors

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer