hub Canonical reference

Reasoning Models Don't Always Say What They Think

Yanda Chen, Joe Benton, Ansh Radhakrishnan, Jonathan Uesato, Carson Denison, John Schulman · 2025 · cs.CL · arXiv 2505.05410

Canonical reference. 82% of citing Pith papers cite this work as background.

44 Pith papers citing it

Background 82% of classified citations

open full Pith review browse 44 citing papers arXiv PDF

abstract

Chain-of-thought (CoT) offers a potential boon for AI safety as it allows monitoring a model's CoT to try to understand its intentions and reasoning processes. However, the effectiveness of such monitoring hinges on CoTs faithfully representing models' actual reasoning processes. We evaluate CoT faithfulness of state-of-the-art reasoning models across 6 reasoning hints presented in the prompts and find: (1) for most settings and models tested, CoTs reveal their usage of hints in at least 1% of examples where they use the hint, but the reveal rate is often below 20%, (2) outcome-based reinforcement learning initially improves faithfulness but plateaus without saturating, and (3) when reinforcement learning increases how frequently hints are used (reward hacking), the propensity to verbalize them does not increase, even without training against a CoT monitor. These results suggest that CoT monitoring is a promising way of noticing undesired behaviors during training and evaluations, but that it is not sufficient to rule them out. They also suggest that in settings like ours where CoT reasoning is not necessary, test-time monitoring of CoTs is unlikely to reliably catch rare and catastrophic unexpected behaviors.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 11

citation-polarity summary

background 9 support 2

representative citing papers

Analyzing the Narration Gap in LLM-Solver Loops

cs.AI · 2026-06-17 · unverdicted · novelty 8.0

The narration step in LLM-solver loops is vulnerable to prompt injection that inverts verified solver conclusions, and hardened prompts reduce but do not eliminate the risk under adaptive attacks.

Hidden Thoughts Are Not Secret: Reasoning Trace Exposure in LLMs

cs.AI · 2026-05-30 · unverdicted · novelty 7.0

REP elicits hidden LLM reasoning traces via in-context shadow demonstrations, raising similarity to internal traces while retaining distillation utility across datasets and models.

On the Cost and Benefit of Chain of Thought: A Learning-Theoretic Perspective

cs.LG · 2026-05-20 · unverdicted · novelty 7.0

Chain of Thought risk decomposes into oracle-trajectory benefit and trajectory-mismatch cost, with stability determining bounded, linear, or exponential error growth.

The Readout Shortcut: Positional Number Copying Dominates Arithmetic CoT Readout in Small Language Models

cs.LG · 2026-05-20 · unverdicted · novelty 7.0

In 1-3B instruction-tuned LMs on GSM8K, arithmetic CoT readout is dominated by positional copying of the trailing number before the answer delimiter, accounting for 54-92 percentage points of accuracy.

SWE-Chain: Benchmarking Coding Agents on Chained Release-Level Package Upgrades

cs.SE · 2026-05-14 · unverdicted · novelty 7.0

SWE-Chain provides 155 chained version transitions and 1,660 requirements across 9 Python packages, where frontier agents resolve 44.8% of tasks on average and struggle to preserve functionality across releases.

Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack

cs.AI · 2026-05-12 · conditional · novelty 7.0

BenchJack audits 10 AI agent benchmarks, synthesizes exploits achieving near-perfect scores without task completion, surfaces 219 flaws, and reduces hackable-task ratios to under 10% on four benchmarks via iterative patching.

TIME: Temporally Intelligent Meta-reasoning Engine for Context-Triggered Explicit Reasoning

cs.LG · 2026-01-08 · unverdicted · novelty 7.0

TIME trains LLMs to trigger compact, context-triggered reasoning via time tags and tick events, improving TIMEBench scores while cutting explicit reasoning tokens by an order of magnitude.

Local Causal Attribution of Chain-of-Thought Reasoning

cs.LG · 2026-06-20 · unverdicted · novelty 6.0

AttriCoT is a black-box algorithm that attributes causal importance to units in a specific CoT trace via a structural causal model estimated with linear forward passes.

The Self-Correction Illusion: LLMs Correct Others but Not Themselves

cs.AI · 2026-06-04 · conditional · novelty 6.0

Relabeling an identical erroneous claim from the model's own thought role to an external chat role increases explicit correction rates by 23-93 percentage points across 13 model-domain cells, indicating a chat-template artifact rather than a cognitive deficit.

"I Strongly Suspect This Website Is a Scam": Benchmarking PII Leakage and Detection without Defense in Autonomous Web Agents

cs.CR · 2026-05-30 · unverdicted · novelty 6.0

New benchmark Scammer4U finds 54-93% critical PII leakage from frontier web agents on scam sites versus 0% on benign twins, plus a 30-point gap between verbalized suspicion and actual submission.

Monitoring the Internal Monologue: Probe Trajectories Reveal Reasoning Dynamics

cs.CL · 2026-05-18 · unverdicted · novelty 6.0

Probe trajectories across token positions in LRMs, combined with signal-processing features, improve prediction of future model outputs over static probes on safety and math tasks.

Multi-Stream LLMs: Unblocking Language Models with Parallel Streams of Thoughts, Inputs and Outputs

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

Language models trained on parallel streams of computation can overcome single-stream bottlenecks in autonomous agents by enabling simultaneous reading, thinking, and acting.

When Reasoning Traces Become Performative: Step-Level Evidence that Chain-of-Thought Is an Imperfect Oversight Channel

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

CoT traces align with internal answer commitment in only 61.9% of steps on average, dominated by confabulated continuations after commitment has stabilized.

Drop the Act: Probe-Filtered RL for Faithful Chain-of-Thought Reasoning

cs.LG · 2026-05-12 · conditional · novelty 6.0

ProFIL trains an activation probe on a frozen base model to zero advantages on theatrical post-commitment rollouts in GRPO, cutting theater 11-100%, raising faithful fractions, and shortening chains 4-19% without accuracy loss.

The Open-Box Fallacy: Why AI Deployment Needs a Calibrated Verification Regime

cs.AI · 2026-05-11 · unverdicted · novelty 6.0

AI deployment in high-stakes areas requires domain-scoped calibrated verification with monitoring and revocation, using a proposed six-component Verification Coverage standard instead of mechanistic interpretability.

Weighted Rules under the Stable Model Semantics

cs.AI · 2026-05-10 · unverdicted · novelty 6.0

Weighted rules extend stable model semantics to support probabilistic reasoning, model ranking, and statistical inference in answer set programs.

Compared to What? Baselines and Metrics for Counterfactual Prompting

cs.CL · 2026-05-01 · conditional · novelty 6.0

Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistical comparison.

Meta-Aligner: Bidirectional Preference-Policy Optimization for Multi-Objective LLMs Alignment

cs.LG · 2026-04-27 · unverdicted · novelty 6.0

Meta-Aligner introduces a meta-learner network that produces dynamic preference weights to enable bidirectional optimization between preferences and LLM policy responses for multi-objective alignment.

Reasoning Dynamics and the Limits of Monitoring Modality Reliance in Vision-Language Models

cs.CL · 2026-04-16 · unverdicted · novelty 6.0

VLMs show answer inertia in CoT reasoning and remain influenced by misleading textual cues even with sufficient visual evidence, making CoT an incomplete window into modality reliance.

Unreal Thinking: Chain-of-Thought Hijacking via Two-stage Backdoor

cs.CR · 2026-04-10 · unverdicted · novelty 6.0

A new backdoor technique called TSBH uses reverse tree search to create malicious chain-of-thought data and injects it in two stages to hijack LLM reasoning upon trigger activation.

Measuring and curing reasoning rigidity: from decorative chain-of-thought to genuine faithfulness

cs.CL · 2026-03-24 · unverdicted · novelty 6.0

SLRC quantifies genuine step necessity in LLM reasoning as a causal estimator, LC-CoSR training reduces rigidity with stability guarantees, and evaluations reveal a faithfulness-sycophancy paradox across frontier models.

Decoding the Critique Mechanism in Large Reasoning Models

cs.LG · 2026-03-17 · unverdicted · novelty 6.0

By injecting arithmetic mistakes into CoT reasoning, the paper identifies a hidden critique ability in LRMs and extracts a steerable critique vector that enhances self-correction across model scales.

A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring

cs.AI · 2026-02-26 · unverdicted · novelty 6.0

A decision-theoretic steganographic gap, based on generalized V-information, quantifies and detects steganographic reasoning in LLMs by measuring asymmetry in downstream utility between agents who can and cannot decode hidden content.

Listener-Rewarded Thinking in VLMs for Image Preferences

cs.CV · 2025-06-28 · unverdicted · novelty 6.0

Listener-augmented GRPO uses an independent frozen VLM to provide dense confidence scores on reasoning traces, yielding 67.4% accuracy on ImageReward, up to +6% OOD gains on 1.2M-vote human data, and fewer reasoning contradictions.

citing papers explorer

Showing 44 of 44 citing papers.

Analyzing the Narration Gap in LLM-Solver Loops cs.AI · 2026-06-17 · unverdicted · none · ref 11 · internal anchor
The narration step in LLM-solver loops is vulnerable to prompt injection that inverts verified solver conclusions, and hardened prompts reduce but do not eliminate the risk under adaptive attacks.
Hidden Thoughts Are Not Secret: Reasoning Trace Exposure in LLMs cs.AI · 2026-05-30 · unverdicted · none · ref 46 · internal anchor
REP elicits hidden LLM reasoning traces via in-context shadow demonstrations, raising similarity to internal traces while retaining distillation utility across datasets and models.
On the Cost and Benefit of Chain of Thought: A Learning-Theoretic Perspective cs.LG · 2026-05-20 · unverdicted · none · ref 24 · internal anchor
Chain of Thought risk decomposes into oracle-trajectory benefit and trajectory-mismatch cost, with stability determining bounded, linear, or exponential error growth.
The Readout Shortcut: Positional Number Copying Dominates Arithmetic CoT Readout in Small Language Models cs.LG · 2026-05-20 · unverdicted · none · ref 4 · internal anchor
In 1-3B instruction-tuned LMs on GSM8K, arithmetic CoT readout is dominated by positional copying of the trailing number before the answer delimiter, accounting for 54-92 percentage points of accuracy.
SWE-Chain: Benchmarking Coding Agents on Chained Release-Level Package Upgrades cs.SE · 2026-05-14 · unverdicted · none · ref 86 · internal anchor
SWE-Chain provides 155 chained version transitions and 1,660 requirements across 9 Python packages, where frontier agents resolve 44.8% of tasks on average and struggle to preserve functionality across releases.
Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack cs.AI · 2026-05-12 · conditional · none · ref 12 · internal anchor
BenchJack audits 10 AI agent benchmarks, synthesizes exploits achieving near-perfect scores without task completion, surfaces 219 flaws, and reduces hackable-task ratios to under 10% on four benchmarks via iterative patching.
TIME: Temporally Intelligent Meta-reasoning Engine for Context-Triggered Explicit Reasoning cs.LG · 2026-01-08 · unverdicted · none · ref 3 · internal anchor
TIME trains LLMs to trigger compact, context-triggered reasoning via time tags and tick events, improving TIMEBench scores while cutting explicit reasoning tokens by an order of magnitude.
Local Causal Attribution of Chain-of-Thought Reasoning cs.LG · 2026-06-20 · unverdicted · none · ref 6 · internal anchor
AttriCoT is a black-box algorithm that attributes causal importance to units in a specific CoT trace via a structural causal model estimated with linear forward passes.
The Self-Correction Illusion: LLMs Correct Others but Not Themselves cs.AI · 2026-06-04 · conditional · none · ref 5 · internal anchor
Relabeling an identical erroneous claim from the model's own thought role to an external chat role increases explicit correction rates by 23-93 percentage points across 13 model-domain cells, indicating a chat-template artifact rather than a cognitive deficit.
"I Strongly Suspect This Website Is a Scam": Benchmarking PII Leakage and Detection without Defense in Autonomous Web Agents cs.CR · 2026-05-30 · unverdicted · none · ref 118 · internal anchor
New benchmark Scammer4U finds 54-93% critical PII leakage from frontier web agents on scam sites versus 0% on benign twins, plus a 30-point gap between verbalized suspicion and actual submission.
Monitoring the Internal Monologue: Probe Trajectories Reveal Reasoning Dynamics cs.CL · 2026-05-18 · unverdicted · none · ref 10 · internal anchor
Probe trajectories across token positions in LRMs, combined with signal-processing features, improve prediction of future model outputs over static probes on safety and math tasks.
Multi-Stream LLMs: Unblocking Language Models with Parallel Streams of Thoughts, Inputs and Outputs cs.LG · 2026-05-12 · unverdicted · none · ref 2 · internal anchor
Language models trained on parallel streams of computation can overcome single-stream bottlenecks in autonomous agents by enabling simultaneous reading, thinking, and acting.
When Reasoning Traces Become Performative: Step-Level Evidence that Chain-of-Thought Is an Imperfect Oversight Channel cs.AI · 2026-05-12 · unverdicted · none · ref 8 · internal anchor
CoT traces align with internal answer commitment in only 61.9% of steps on average, dominated by confabulated continuations after commitment has stabilized.
Drop the Act: Probe-Filtered RL for Faithful Chain-of-Thought Reasoning cs.LG · 2026-05-12 · conditional · none · ref 4 · internal anchor
ProFIL trains an activation probe on a frozen base model to zero advantages on theatrical post-commitment rollouts in GRPO, cutting theater 11-100%, raising faithful fractions, and shortening chains 4-19% without accuracy loss.
The Open-Box Fallacy: Why AI Deployment Needs a Calibrated Verification Regime cs.AI · 2026-05-11 · unverdicted · none · ref 2 · internal anchor
AI deployment in high-stakes areas requires domain-scoped calibrated verification with monitoring and revocation, using a proposed six-component Verification Coverage standard instead of mechanistic interpretability.
Weighted Rules under the Stable Model Semantics cs.AI · 2026-05-10 · unverdicted · none · ref 54 · internal anchor
Weighted rules extend stable model semantics to support probabilistic reasoning, model ranking, and statistical inference in answer set programs.
Compared to What? Baselines and Metrics for Counterfactual Prompting cs.CL · 2026-05-01 · conditional · none · ref 48 · internal anchor
Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistical comparison.
Meta-Aligner: Bidirectional Preference-Policy Optimization for Multi-Objective LLMs Alignment cs.LG · 2026-04-27 · unverdicted · none · ref 3 · internal anchor
Meta-Aligner introduces a meta-learner network that produces dynamic preference weights to enable bidirectional optimization between preferences and LLM policy responses for multi-objective alignment.
Reasoning Dynamics and the Limits of Monitoring Modality Reliance in Vision-Language Models cs.CL · 2026-04-16 · unverdicted · none · ref 5 · internal anchor
VLMs show answer inertia in CoT reasoning and remain influenced by misleading textual cues even with sufficient visual evidence, making CoT an incomplete window into modality reliance.
Unreal Thinking: Chain-of-Thought Hijacking via Two-stage Backdoor cs.CR · 2026-04-10 · unverdicted · none · ref 14 · internal anchor
A new backdoor technique called TSBH uses reverse tree search to create malicious chain-of-thought data and injects it in two stages to hijack LLM reasoning upon trigger activation.
Measuring and curing reasoning rigidity: from decorative chain-of-thought to genuine faithfulness cs.CL · 2026-03-24 · unverdicted · none · ref 4 · internal anchor
SLRC quantifies genuine step necessity in LLM reasoning as a causal estimator, LC-CoSR training reduces rigidity with stability guarantees, and evaluations reveal a faithfulness-sycophancy paradox across frontier models.
Decoding the Critique Mechanism in Large Reasoning Models cs.LG · 2026-03-17 · unverdicted · none · ref 1 · internal anchor
By injecting arithmetic mistakes into CoT reasoning, the paper identifies a hidden critique ability in LRMs and extracts a steerable critique vector that enhances self-correction across model scales.
A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring cs.AI · 2026-02-26 · unverdicted · none · ref 1 · internal anchor
A decision-theoretic steganographic gap, based on generalized V-information, quantifies and detects steganographic reasoning in LLMs by measuring asymmetry in downstream utility between agents who can and cannot decode hidden content.
Listener-Rewarded Thinking in VLMs for Image Preferences cs.CV · 2025-06-28 · unverdicted · none · ref 6 · internal anchor
Listener-augmented GRPO uses an independent frozen VLM to provide dense confidence scores on reasoning traces, yielding 67.4% accuracy on ImageReward, up to +6% OOD gains on 1.2M-vote human data, and fewer reasoning contradictions.
The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity cs.AI · 2025-06-07 · unverdicted · none · ref 28 · internal anchor
LRMs exhibit complete accuracy collapse beyond certain puzzle complexities, with reasoning effort rising then declining, outperforming standard LLMs only on medium-complexity tasks.
Probabilistic Tiny Recursive Model cs.AI · 2026-05-19 · conditional · none · ref 21 · internal anchor
PTRM adds stochastic Gaussian noise to Tiny Recursive Model recursion for parallel trajectory exploration and Q-head selection, raising Sudoku-Extreme accuracy from 87.4% to 98.75% and Pencil Puzzle Bench from 62.6% to 91.2% without retraining.
What properties of reasoning supervision are associated with improved downstream model quality? cs.AI · 2026-05-13 · unverdicted · none · ref 5 · internal anchor
Intrinsic data metrics predict reasoning dataset utility for model fine-tuning, with different predictors working best for smaller versus larger models.
CoT-Guard: Small Models for Strong Monitoring cs.CR · 2026-05-12 · unverdicted · none · ref 23 · internal anchor
CoT-Guard is a 4B model using SFT and RL that achieves 75% G-mean^2 on hidden objective detection under prompt and code manipulation attacks, outperforming several larger models.
Evaluating the False Trust Engendered by LLM Explanations cs.HC · 2026-05-11 · unverdicted · none · ref 8 · 2 links · internal anchor
LLM reasoning traces and post-hoc explanations increase false trust in incorrect predictions, whereas contrastive dual explanations enhance users' ability to distinguish correct from incorrect AI outputs.
Medical Model Synthesis Architectures: A Case Study cs.AI · 2026-05-10 · unverdicted · none · ref 6 · internal anchor
MedMSA framework retrieves knowledge via language models then builds formal probabilistic models to produce uncertainty-weighted differential diagnoses from symptoms.
How Well Do LLMs Perform on the Simplest Long-Chain Reasoning Tasks: An Empirical Study on the Equivalence Class Problem cs.AI · 2026-05-07 · unverdicted · none · ref 28 · internal anchor
Non-reasoning LLMs fail the equivalence class problem while reasoning LLMs perform better but remain incomplete, with difficulty peaking at phase transition for the former and maximum diameter for the latter.
LLM Reasoning Is Latent, Not the Chain of Thought cs.AI · 2026-04-17 · unverdicted · none · ref 30 · internal anchor
LLM reasoning is primarily mediated by latent-state trajectories rather than by explicit surface chain-of-thought outputs.
Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges cs.LG · 2026-04-15 · unverdicted · none · ref 50 · internal anchor
The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under optimization pressure.
Does the Model Say What the Data Says? A Simple Heuristic for Model Data Alignment cs.LG · 2025-11-26 · unverdicted · none · ref 7 · internal anchor
A data-derived baseline using feature effects on binary outcomes provides a model-agnostic way to check if machine learning explanations align with the underlying data structure.
A Model Can Help Itself: Reward-Free Self-Training for LLM Reasoning cs.LG · 2025-10-21 · unverdicted · none · ref 4 · 2 links · internal anchor
SePT alternates self-generation of responses at controlled temperatures with training on the latest model outputs, yielding gains over a strong no-training baseline on six math reasoning benchmarks.
Bridging Legal Interpretation and Formal Logic: Faithfulness, Assumption, and the Future of AI Legal Reasoning cs.AI · 2026-05-13 · unverdicted · none · ref 19 · internal anchor
A neuro-symbolic system is proposed that uses formal logic to constrain LLM outputs so legal inferences stay faithful to source text.
LLMs Should Not Yet Be Credited with Decision Explanation cs.AI · 2026-05-01 · unverdicted · none · ref 35 · internal anchor
LLMs support decision prediction and rationale generation but lack evidence for genuine decision explanation, requiring stricter standards to avoid over-crediting.
Knowledge Distillation Must Account for What It Loses cs.LG · 2026-04-28 · unverdicted · none · ref 14 · 2 links · internal anchor
Knowledge distillation evaluations must report lost teacher capabilities via a Distillation Loss Statement rather than relying solely on task scores.
Risk Reporting for Developers' Internal AI Model Use cs.CY · 2026-04-27 · unverdicted · none · ref 9 · internal anchor
A harmonized risk reporting standard for internal frontier AI model use, structured around autonomous misbehavior and insider threats using means, motive, and opportunity factors.
Context Is What You Need: The Maximum Effective Context Window for Real World Limits of LLMs cs.CL · 2025-09-21 · unverdicted · none · ref 1 · internal anchor
The study defines Maximum Effective Context Window (MECW) and reports through large-scale testing that it is far smaller than advertised Maximum Context Window (MCW) and varies by task type.
LLMs in Qualitative Research: Opportunities, Limitations, and Practical Considerations cs.HC · 2026-05-15 · unverdicted · none · ref 16 · internal anchor
The paper outlines opportunities, limitations, and practical parameters for integrating LLMs into qualitative research while aligning with epistemological commitments like reflexivity and interpretive judgment.
Neurosymbolic Learning for Inference-Time Argumentation cs.AI · 2026-05-19 · unreviewed · ref 5 · internal anchor
LLMs as ASP Programmers: Self-Correction Enables Task-Agnostic Nonmonotonic Reasoning cs.AI · 2026-04-30 · unreviewed · ref 1 · internal anchor
RadAgent: A tool-using AI agent for stepwise interpretation of chest computed tomography cs.AI · 2026-04-16 · unreviewed · ref 21 · internal anchor

Reasoning Models Don't Always Say What They Think

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer