citation dossier

Reasoning models don’t always say what they think

Yanda Chen, Joe Benton, Ansh Radhakrishnan, Jonathan Uesato, Carson Denison, John Schul- man, Arushi Somani, Peter Hase, Misha Wagner, Fabien Roger, Vlad Mikulik, Samuel R Bowman, Jan Leike, Jared Kaplan, and Ethan Perez · 2025 · arXiv 2505.05410

19Pith papers citing it

20reference links

cs.AItop field · 9 papers

UNVERDICTEDtop verdict bucket · 17 papers

This arXiv-backed work is queued for full Pith review when it crosses the high-inbound sweep. That review runs reader · skeptic · desk-editor · referee · rebuttal · circularity · lean confirmation · RS check · pith extraction.

read on arXiv PDF

why this work matters in Pith

Pith has found this work in 19 reviewed papers. Its strongest current cluster is cs.AI (9 papers). The largest review-status bucket among citing papers is UNVERDICTED (17 papers). For highly cited works, this page shows a dossier first and a bounded explorer second; it never tries to render every citing paper at once.

representative citing papers

LLMs as ASP Programmers: Self-Correction Enables Task-Agnostic Nonmonotonic Reasoning

cs.AI · 2026-04-30 · unverdicted · novelty 7.0

LLM+ASP framework enables task-agnostic nonmonotonic reasoning by having LLMs generate and self-correct ASP programs using solver feedback, outperforming SMT alternatives on diverse benchmarks.

Multi-Stream LLMs: Unblocking Language Models with Parallel Streams of Thoughts, Inputs and Outputs

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

Language models trained on parallel streams of computation can overcome single-stream bottlenecks in autonomous agents by enabling simultaneous reading, thinking, and acting.

When Reasoning Traces Become Performative: Step-Level Evidence that Chain-of-Thought Is an Imperfect Oversight Channel

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

CoT traces align with internal answer commitment in only 61.9% of steps on average, dominated by confabulated continuations after commitment has stabilized.

Drop the Act: Probe-Filtered RL for Faithful Chain-of-Thought Reasoning

cs.LG · 2026-05-12 · conditional · novelty 6.0

ProFIL trains an activation probe on a frozen base model to zero advantages on theatrical post-commitment rollouts in GRPO, cutting theater 11-100%, raising faithful fractions, and shortening chains 4-19% without accuracy loss.

Evaluating the False Trust engendered by LLM Explanations

cs.HC · 2026-05-11 · unverdicted · novelty 6.0

A user study finds that LLM reasoning traces and post-hoc explanations create false trust by increasing acceptance of incorrect answers, whereas contrastive dual explanations improve users' ability to detect errors.

The Open-Box Fallacy: Why AI Deployment Needs a Calibrated Verification Regime

cs.AI · 2026-05-11 · unverdicted · novelty 6.0

AI deployment in high-stakes areas requires domain-scoped calibrated verification with monitoring and revocation, using a proposed six-component Verification Coverage standard instead of mechanistic interpretability.

Weighted Rules under the Stable Model Semantics

cs.AI · 2026-05-10 · unverdicted · novelty 6.0

Weighted rules extend stable model semantics to support probabilistic reasoning, model ranking, and statistical inference in answer set programs.

Compared to What? Baselines and Metrics for Counterfactual Prompting

cs.CL · 2026-05-01 · conditional · novelty 6.0

Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistical comparison.

Meta-Aligner: Bidirectional Preference-Policy Optimization for Multi-Objective LLMs Alignment

cs.LG · 2026-04-27 · unverdicted · novelty 6.0

Meta-Aligner introduces a meta-learner network that produces dynamic preference weights to enable bidirectional optimization between preferences and LLM policy responses for multi-objective alignment.

RadAgent: A tool-using AI agent for stepwise interpretation of chest computed tomography

cs.AI · 2026-04-16 · unverdicted · novelty 6.0

RadAgent generates stepwise, tool-augmented chest CT reports with traceable decisions, improving accuracy, robustness, and adding a 37% faithfulness score absent in standard 3D VLMs.

Reasoning Dynamics and the Limits of Monitoring Modality Reliance in Vision-Language Models

cs.CL · 2026-04-16 · unverdicted · novelty 6.0

VLMs show answer inertia in CoT reasoning and remain influenced by misleading textual cues even with sufficient visual evidence, making CoT an incomplete window into modality reliance.

Unreal Thinking: Chain-of-Thought Hijacking via Two-stage Backdoor

cs.CR · 2026-04-10 · unverdicted · novelty 6.0

A new backdoor technique called TSBH uses reverse tree search to create malicious chain-of-thought data and injects it in two stages to hijack LLM reasoning upon trigger activation.

Medical Model Synthesis Architectures: A Case Study

cs.AI · 2026-05-10 · unverdicted · novelty 5.0

MedMSA framework retrieves knowledge via language models then builds formal probabilistic models to produce uncertainty-weighted differential diagnoses from symptoms.

How Well Do LLMs Perform on the Simplest Long-Chain Reasoning Tasks: An Empirical Study on the Equivalence Class Problem

cs.AI · 2026-05-07 · unverdicted · novelty 5.0

Non-reasoning LLMs fail the equivalence class problem while reasoning LLMs perform better but remain incomplete, with difficulty peaking at phase transition for the former and maximum diameter for the latter.

LLM Reasoning Is Latent, Not the Chain of Thought

cs.AI · 2026-04-17 · unverdicted · novelty 5.0

LLM reasoning is primarily mediated by latent-state trajectories rather than by explicit surface chain-of-thought outputs.

Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges

cs.LG · 2026-04-15 · unverdicted · novelty 5.0

The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under optimization pressure.

LLMs Should Not Yet Be Credited with Decision Explanation

cs.AI · 2026-05-01 · unverdicted · novelty 4.0

LLMs support decision prediction and rationale generation but lack evidence for genuine decision explanation, requiring stricter standards to avoid over-crediting.

Knowledge Distillation Must Account for What It Loses

cs.LG · 2026-04-28 · unverdicted · novelty 4.0 · 2 refs

Knowledge distillation evaluations must report lost teacher capabilities via a Distillation Loss Statement rather than relying solely on task scores.

Risk Reporting for Developers' Internal AI Model Use

cs.CY · 2026-04-27 · unverdicted · novelty 4.0

A harmonized risk reporting standard for internal frontier AI model use, structured around autonomous misbehavior and insider threats using means, motive, and opportunity factors.

citing papers explorer

Showing 19 of 19 citing papers.

LLMs as ASP Programmers: Self-Correction Enables Task-Agnostic Nonmonotonic Reasoning cs.AI · 2026-04-30 · unverdicted · none · ref 1
LLM+ASP framework enables task-agnostic nonmonotonic reasoning by having LLMs generate and self-correct ASP programs using solver feedback, outperforming SMT alternatives on diverse benchmarks.
Multi-Stream LLMs: Unblocking Language Models with Parallel Streams of Thoughts, Inputs and Outputs cs.LG · 2026-05-12 · unverdicted · none · ref 2
Language models trained on parallel streams of computation can overcome single-stream bottlenecks in autonomous agents by enabling simultaneous reading, thinking, and acting.
When Reasoning Traces Become Performative: Step-Level Evidence that Chain-of-Thought Is an Imperfect Oversight Channel cs.AI · 2026-05-12 · unverdicted · none · ref 8
CoT traces align with internal answer commitment in only 61.9% of steps on average, dominated by confabulated continuations after commitment has stabilized.
Drop the Act: Probe-Filtered RL for Faithful Chain-of-Thought Reasoning cs.LG · 2026-05-12 · conditional · none · ref 4
ProFIL trains an activation probe on a frozen base model to zero advantages on theatrical post-commitment rollouts in GRPO, cutting theater 11-100%, raising faithful fractions, and shortening chains 4-19% without accuracy loss.
Evaluating the False Trust engendered by LLM Explanations cs.HC · 2026-05-11 · unverdicted · none · ref 8
A user study finds that LLM reasoning traces and post-hoc explanations create false trust by increasing acceptance of incorrect answers, whereas contrastive dual explanations improve users' ability to detect errors.
The Open-Box Fallacy: Why AI Deployment Needs a Calibrated Verification Regime cs.AI · 2026-05-11 · unverdicted · none · ref 2
AI deployment in high-stakes areas requires domain-scoped calibrated verification with monitoring and revocation, using a proposed six-component Verification Coverage standard instead of mechanistic interpretability.
Weighted Rules under the Stable Model Semantics cs.AI · 2026-05-10 · unverdicted · none · ref 54
Weighted rules extend stable model semantics to support probabilistic reasoning, model ranking, and statistical inference in answer set programs.
Compared to What? Baselines and Metrics for Counterfactual Prompting cs.CL · 2026-05-01 · conditional · none · ref 48
Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistical comparison.
Meta-Aligner: Bidirectional Preference-Policy Optimization for Multi-Objective LLMs Alignment cs.LG · 2026-04-27 · unverdicted · none · ref 3
Meta-Aligner introduces a meta-learner network that produces dynamic preference weights to enable bidirectional optimization between preferences and LLM policy responses for multi-objective alignment.
RadAgent: A tool-using AI agent for stepwise interpretation of chest computed tomography cs.AI · 2026-04-16 · unverdicted · none · ref 21
RadAgent generates stepwise, tool-augmented chest CT reports with traceable decisions, improving accuracy, robustness, and adding a 37% faithfulness score absent in standard 3D VLMs.
Reasoning Dynamics and the Limits of Monitoring Modality Reliance in Vision-Language Models cs.CL · 2026-04-16 · unverdicted · none · ref 5
VLMs show answer inertia in CoT reasoning and remain influenced by misleading textual cues even with sufficient visual evidence, making CoT an incomplete window into modality reliance.
Unreal Thinking: Chain-of-Thought Hijacking via Two-stage Backdoor cs.CR · 2026-04-10 · unverdicted · none · ref 14
A new backdoor technique called TSBH uses reverse tree search to create malicious chain-of-thought data and injects it in two stages to hijack LLM reasoning upon trigger activation.
Medical Model Synthesis Architectures: A Case Study cs.AI · 2026-05-10 · unverdicted · none · ref 6
MedMSA framework retrieves knowledge via language models then builds formal probabilistic models to produce uncertainty-weighted differential diagnoses from symptoms.
How Well Do LLMs Perform on the Simplest Long-Chain Reasoning Tasks: An Empirical Study on the Equivalence Class Problem cs.AI · 2026-05-07 · unverdicted · none · ref 28
Non-reasoning LLMs fail the equivalence class problem while reasoning LLMs perform better but remain incomplete, with difficulty peaking at phase transition for the former and maximum diameter for the latter.
LLM Reasoning Is Latent, Not the Chain of Thought cs.AI · 2026-04-17 · unverdicted · none · ref 30
LLM reasoning is primarily mediated by latent-state trajectories rather than by explicit surface chain-of-thought outputs.
Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges cs.LG · 2026-04-15 · unverdicted · none · ref 50
The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under optimization pressure.
LLMs Should Not Yet Be Credited with Decision Explanation cs.AI · 2026-05-01 · unverdicted · none · ref 35
LLMs support decision prediction and rationale generation but lack evidence for genuine decision explanation, requiring stricter standards to avoid over-crediting.
Knowledge Distillation Must Account for What It Loses cs.LG · 2026-04-28 · unverdicted · none · ref 14 · 2 links
Knowledge distillation evaluations must report lost teacher capabilities via a Distillation Loss Statement rather than relying solely on task scores.
Risk Reporting for Developers' Internal AI Model Use cs.CY · 2026-04-27 · unverdicted · none · ref 9
A harmonized risk reporting standard for internal frontier AI model use, structured around autonomous misbehavior and insider threats using means, motive, and opportunity factors.

Reasoning models don’t always say what they think

why this work matters in Pith

fields

years

verdicts

representative citing papers

citing papers explorer