hub Canonical reference

Chain-of-thought reasoning in the wild is not always faithful.arXiv preprint:2503.08679

Arthur Conmy · 2025 · cs.AI · arXiv 2503.08679

Canonical reference. 80% of citing Pith papers cite this work as background.

23 Pith papers citing it

Background 80% of classified citations

open full Pith review browse 23 citing papers arXiv PDF

abstract

Recent studies indicate that when faced with explicit biases in prompts, models often omit mentioning these biases in their Chain-of-Thought (CoT) output, revealing that verbalized reasoning can give an incorrect picture of how models arrive at conclusions (unfaithfulness). In this work, we show that unfaithful CoT also occurs on naturally worded, non-adversarial prompts without adding artificial biases or editing model outputs. We find that when separately presented with the questions "Is X bigger than Y?" and "Is Y bigger than X?", models sometimes produce superficially coherent arguments to justify systematically answering Yes to both or No to both, despite the contradiction. We present preliminary evidence that this is due to models' implicit biases towards Yes or No, labeling this Implicit Post-Hoc Rationalization. Our results reveal rates up to 13% for production models, and while frontier models are more faithful, none are entirely so, including thinking models like DeepSeek R1 (0.37%) and Sonnet 3.7 with thinking (0.04%). We also investigate Unfaithful Illogical Shortcuts, where models use subtly illogical reasoning to make speculative answers to hard math problems seem rigorously proven. Our findings indicate that while CoT can be useful for assessing outputs, it is not a complete account of the internal process that produced the model's answer and should be used with caution in agentic or safety-critical settings.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4 other 1

citation-polarity summary

background 4 unclear 1

representative citing papers

What Makes Chain-of-Thought Work at Probe Time? Local Co-occurrence Rather Than Global Derivation

cs.AI · 2026-05-26 · unverdicted · novelty 7.0

CoT probe-time gains arise primarily from lexical activation and short-range token co-occurrence rather than sentence-level logical derivation.

The Readout Shortcut: Positional Number Copying Dominates Arithmetic CoT Readout in Small Language Models

cs.LG · 2026-05-20 · unverdicted · novelty 7.0

In 1-3B instruction-tuned LMs on GSM8K, arithmetic CoT readout is dominated by positional copying of the trailing number before the answer delimiter, accounting for 54-92 percentage points of accuracy.

Deep Reasoning in General Purpose Agents via Structured Meta-Cognition

cs.CL · 2026-05-12 · unverdicted · novelty 7.0

DOLORES, an agent using a formal language for meta-reasoning to construct adaptive scaffolds on the fly, outperforms prior scaffolding methods by 24.8% on average across four hard benchmarks and multiple model sizes.

Scaling Latent Reasoning via Looped Language Models

cs.CL · 2025-10-29 · unverdicted · novelty 7.0

Looped language models with latent iterative computation and entropy-regularized depth allocation achieve performance matching up to 12B standard LLMs through superior knowledge manipulation.

Detecting Unfaithful Chain-of-Thought via Circuit-Guided Internal-External Discrepancy

cs.AI · 2026-05-25 · unverdicted · novelty 6.0

CIE-Scorer detects unfaithful CoT by tracing compact sentence-level circuits, building internal-external reasoning graphs, and scoring their discrepancy with Fused Gromov-Wasserstein distance, reporting SOTA results on FaithCoT-Bench with reduced circuit cost.

Understanding and Mitigating Premature Confidence for Better LLM Reasoning

cs.AI · 2026-05-23 · unverdicted · novelty 6.0

Premature confidence in LLM chains of thought predicts flawed reasoning and is mitigated by progressive confidence shaping, a label-free RL objective that yields accuracy gains on arithmetic, math, and science tasks.

Faithfulness as Information Flow: Evaluating and Training Faithful Chain-of-Thought Reasoning

cs.LG · 2026-05-22 · unverdicted · novelty 6.0

Faithful chain-of-thought routes answer-relevant information through the CoT path, measured via sufficiency, completeness and necessity with entropy, masked-KL and gradient diagnostics, and improved by information-flow interventions during verifier-based RL.

Diagnosing Multi-step Reasoning Failures in Black-box LLMs via Stepwise Confidence Attribution

cs.CL · 2026-05-19 · unverdicted · novelty 6.0

SCA applies the Information Bottleneck principle via NIBS and GIBS methods to identify erroneous steps in black-box LLM reasoning and boosts self-correction success by up to 13.5%.

When Reasoning Traces Become Performative: Step-Level Evidence that Chain-of-Thought Is an Imperfect Oversight Channel

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

CoT traces align with internal answer commitment in only 61.9% of steps on average, dominated by confabulated continuations after commitment has stabilized.

Confidence-Aware Alignment Makes Reasoning LLMs More Reliable

cs.AI · 2026-05-08 · unverdicted · novelty 6.0

CASPO trains LLMs via iterative direct preference optimization so that token-level confidence tracks step-wise correctness, then applies Confidence-aware Thought pruning at inference to improve both reliability and speed on reasoning benchmarks.

Compared to What? Baselines and Metrics for Counterfactual Prompting

cs.CL · 2026-05-01 · conditional · novelty 6.0

Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistical comparison.

ReSS: Learning Reasoning Models for Tabular Data Prediction via Symbolic Scaffold

cs.AI · 2026-04-15 · unverdicted · novelty 6.0 · 2 refs

ReSS extracts decision paths from trees as scaffolds to guide LLM reasoning generation, fine-tunes the LLM on the resulting dataset with scaffold-invariant augmentation, and reports up to 10% gains on medical and financial tabular benchmarks with new faithfulness metrics.

When AI Persuades: Adversarial Explanation Attacks on Human Trust in AI-Assisted Decision Making

cs.AI · 2026-02-03 · unverdicted · novelty 6.0

Adversarial explanation attacks preserve nearly all human trust in wrong AI outputs by using persuasive framing, shown in a study varying reasoning, evidence, style, and format with over 200 participants.

Interpretability from the Ground Up: Stakeholder-Centric Design of Automated Scoring in Educational Assessments

cs.CL · 2025-11-21 · unverdicted · novelty 6.0

AnalyticScore applies new FGTI interpretability principles to text-based scoring and achieves accuracy within 0.06 QWK of uninterpretable state-of-the-art while matching human featurization on the ASAP-SAS dataset.

Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs

cs.AI · 2025-06-17 · unverdicted · novelty 6.0

RLVR incentivizes correct reasoning in base LLMs, extending reasoning boundaries on math and coding tasks as shown by CoT-Pass@K evaluations and a theoretical incentive framework.

Conversable Complexity: Agentic LLM Collectives as Interpretable Substrates

cs.CL · 2026-07-01 · unverdicted · novelty 5.0

Agentic LLM collectives are proposed as natural-language-interpretable computational substrates for ALife research.

Evaluating the False Trust Engendered by LLM Explanations

cs.HC · 2026-05-11 · unverdicted · novelty 5.0 · 2 refs

LLM reasoning traces and post-hoc explanations increase false trust in incorrect predictions, whereas contrastive dual explanations enhance users' ability to distinguish correct from incorrect AI outputs.

LLM Reasoning Is Latent, Not the Chain of Thought

cs.AI · 2026-04-17 · unverdicted · novelty 5.0

LLM reasoning is primarily mediated by latent-state trajectories rather than by explicit surface chain-of-thought outputs.

KnowRL: Exploring Knowledgeable Reinforcement Learning for Factuality

cs.AI · 2025-06-24 · unverdicted · novelty 5.0

KnowRL integrates a knowledge-verification factuality reward into RL training to enforce fact-based reasoning steps and lower hallucination rates in LLMs.

Phi-4-reasoning Technical Report

cs.AI · 2025-04-30 · unverdicted · novelty 4.0

A 14B reasoning model trained via supervised fine-tuning on selected prompts and o3-mini traces, plus outcome RL, outperforms larger open models like DeepSeek-R1-Distill-Llama-70B on math, coding, planning and related benchmarks.

A Survey of Reinforcement Learning for Large Reasoning Models

cs.CL · 2025-09-10 · accept · novelty 3.0

A survey compiling RL methods, challenges, data resources, and applications for enhancing reasoning in large language models and large reasoning models since DeepSeek-R1.

Can Aha Moments Be Fake? Towards Quantifying Decorative and True Thinking in Chain-of-Thought

cs.LG · 2025-10-28

TinyTroupe: An LLM-powered Multiagent Persona Simulation Toolkit

cs.MA · 2025-07-13

citing papers explorer

Showing 23 of 23 citing papers.

What Makes Chain-of-Thought Work at Probe Time? Local Co-occurrence Rather Than Global Derivation cs.AI · 2026-05-26 · unverdicted · none · ref 20 · internal anchor
CoT probe-time gains arise primarily from lexical activation and short-range token co-occurrence rather than sentence-level logical derivation.
The Readout Shortcut: Positional Number Copying Dominates Arithmetic CoT Readout in Small Language Models cs.LG · 2026-05-20 · unverdicted · none · ref 2 · internal anchor
In 1-3B instruction-tuned LMs on GSM8K, arithmetic CoT readout is dominated by positional copying of the trailing number before the answer delimiter, accounting for 54-92 percentage points of accuracy.
Deep Reasoning in General Purpose Agents via Structured Meta-Cognition cs.CL · 2026-05-12 · unverdicted · none · ref 88 · internal anchor
DOLORES, an agent using a formal language for meta-reasoning to construct adaptive scaffolds on the fly, outperforms prior scaffolding methods by 24.8% on average across four hard benchmarks and multiple model sizes.
Scaling Latent Reasoning via Looped Language Models cs.CL · 2025-10-29 · unverdicted · none · ref 74 · internal anchor
Looped language models with latent iterative computation and entropy-regularized depth allocation achieve performance matching up to 12B standard LLMs through superior knowledge manipulation.
Detecting Unfaithful Chain-of-Thought via Circuit-Guided Internal-External Discrepancy cs.AI · 2026-05-25 · unverdicted · none · ref 4 · internal anchor
CIE-Scorer detects unfaithful CoT by tracing compact sentence-level circuits, building internal-external reasoning graphs, and scoring their discrepancy with Fused Gromov-Wasserstein distance, reporting SOTA results on FaithCoT-Bench with reduced circuit cost.
Understanding and Mitigating Premature Confidence for Better LLM Reasoning cs.AI · 2026-05-23 · unverdicted · none · ref 2 · internal anchor
Premature confidence in LLM chains of thought predicts flawed reasoning and is mitigated by progressive confidence shaping, a label-free RL objective that yields accuracy gains on arithmetic, math, and science tasks.
Faithfulness as Information Flow: Evaluating and Training Faithful Chain-of-Thought Reasoning cs.LG · 2026-05-22 · unverdicted · none · ref 8 · internal anchor
Faithful chain-of-thought routes answer-relevant information through the CoT path, measured via sufficiency, completeness and necessity with entropy, masked-KL and gradient diagnostics, and improved by information-flow interventions during verifier-based RL.
Diagnosing Multi-step Reasoning Failures in Black-box LLMs via Stepwise Confidence Attribution cs.CL · 2026-05-19 · unverdicted · none · ref 38 · internal anchor
SCA applies the Information Bottleneck principle via NIBS and GIBS methods to identify erroneous steps in black-box LLM reasoning and boosts self-correction success by up to 13.5%.
When Reasoning Traces Become Performative: Step-Level Evidence that Chain-of-Thought Is an Imperfect Oversight Channel cs.AI · 2026-05-12 · unverdicted · none · ref 2 · internal anchor
CoT traces align with internal answer commitment in only 61.9% of steps on average, dominated by confabulated continuations after commitment has stabilized.
Confidence-Aware Alignment Makes Reasoning LLMs More Reliable cs.AI · 2026-05-08 · unverdicted · none · ref 1 · internal anchor
CASPO trains LLMs via iterative direct preference optimization so that token-level confidence tracks step-wise correctness, then applies Confidence-aware Thought pruning at inference to improve both reliability and speed on reasoning benchmarks.
Compared to What? Baselines and Metrics for Counterfactual Prompting cs.CL · 2026-05-01 · conditional · none · ref 47 · internal anchor
Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistical comparison.
ReSS: Learning Reasoning Models for Tabular Data Prediction via Symbolic Scaffold cs.AI · 2026-04-15 · unverdicted · none · ref 3 · 2 links · internal anchor
ReSS extracts decision paths from trees as scaffolds to guide LLM reasoning generation, fine-tunes the LLM on the resulting dataset with scaffold-invariant augmentation, and reports up to 10% gains on medical and financial tabular benchmarks with new faithfulness metrics.
When AI Persuades: Adversarial Explanation Attacks on Human Trust in AI-Assisted Decision Making cs.AI · 2026-02-03 · unverdicted · none · ref 5 · internal anchor
Adversarial explanation attacks preserve nearly all human trust in wrong AI outputs by using persuasive framing, shown in a study varying reasoning, evidence, style, and format with over 200 participants.
Interpretability from the Ground Up: Stakeholder-Centric Design of Automated Scoring in Educational Assessments cs.CL · 2025-11-21 · unverdicted · none · ref 2 · internal anchor
AnalyticScore applies new FGTI interpretability principles to text-based scoring and achieves accuracy within 0.06 QWK of uninterpretable state-of-the-art while matching human featurization on the ASAP-SAS dataset.
Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs cs.AI · 2025-06-17 · unverdicted · none · ref 1 · internal anchor
RLVR incentivizes correct reasoning in base LLMs, extending reasoning boundaries on math and coding tasks as shown by CoT-Pass@K evaluations and a theoretical incentive framework.
Conversable Complexity: Agentic LLM Collectives as Interpretable Substrates cs.CL · 2026-07-01 · unverdicted · none · ref 83 · internal anchor
Agentic LLM collectives are proposed as natural-language-interpretable computational substrates for ALife research.
Evaluating the False Trust Engendered by LLM Explanations cs.HC · 2026-05-11 · unverdicted · none · ref 52 · 2 links · internal anchor
LLM reasoning traces and post-hoc explanations increase false trust in incorrect predictions, whereas contrastive dual explanations enhance users' ability to distinguish correct from incorrect AI outputs.
LLM Reasoning Is Latent, Not the Chain of Thought cs.AI · 2026-04-17 · unverdicted · none · ref 29 · internal anchor
LLM reasoning is primarily mediated by latent-state trajectories rather than by explicit surface chain-of-thought outputs.
KnowRL: Exploring Knowledgeable Reinforcement Learning for Factuality cs.AI · 2025-06-24 · unverdicted · none · ref 1 · internal anchor
KnowRL integrates a knowledge-verification factuality reward into RL training to enforce fact-based reasoning steps and lower hallucination rates in LLMs.
Phi-4-reasoning Technical Report cs.AI · 2025-04-30 · unverdicted · none · ref 8 · internal anchor
A 14B reasoning model trained via supervised fine-tuning on selected prompts and o3-mini traces, plus outcome RL, outperforms larger open models like DeepSeek-R1-Distill-Llama-70B on math, coding, planning and related benchmarks.
A Survey of Reinforcement Learning for Large Reasoning Models cs.CL · 2025-09-10 · accept · none · ref 20 · internal anchor
A survey compiling RL methods, challenges, data resources, and applications for enhancing reasoning in large language models and large reasoning models since DeepSeek-R1.
Can Aha Moments Be Fake? Towards Quantifying Decorative and True Thinking in Chain-of-Thought cs.LG · 2025-10-28 · unreviewed · ref 1 · internal anchor
TinyTroupe: An LLM-powered Multiagent Persona Simulation Toolkit cs.MA · 2025-07-13 · unreviewed · ref 1 · internal anchor

Chain-of-thought reasoning in the wild is not always faithful.arXiv preprint:2503.08679

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer