pith. sign in

hub Canonical reference

Reasoning Models Don't Always Say What They Think

Canonical reference. 82% of citing Pith papers cite this work as background.

53 Pith papers citing it
Background 82% of classified citations
abstract

Chain-of-thought (CoT) offers a potential boon for AI safety as it allows monitoring a model's CoT to try to understand its intentions and reasoning processes. However, the effectiveness of such monitoring hinges on CoTs faithfully representing models' actual reasoning processes. We evaluate CoT faithfulness of state-of-the-art reasoning models across 6 reasoning hints presented in the prompts and find: (1) for most settings and models tested, CoTs reveal their usage of hints in at least 1% of examples where they use the hint, but the reveal rate is often below 20%, (2) outcome-based reinforcement learning initially improves faithfulness but plateaus without saturating, and (3) when reinforcement learning increases how frequently hints are used (reward hacking), the propensity to verbalize them does not increase, even without training against a CoT monitor. These results suggest that CoT monitoring is a promising way of noticing undesired behaviors during training and evaluations, but that it is not sufficient to rule them out. They also suggest that in settings like ours where CoT reasoning is not necessary, test-time monitoring of CoTs is unlikely to reliably catch rare and catastrophic unexpected behaviors.

hub tools

citation-role summary

background 11

citation-polarity summary

years

2026 48 2025 5

polarities

background 9 support 2

clear filters

representative citing papers

Analyzing the Narration Gap in LLM-Solver Loops

cs.AI · 2026-06-17 · unverdicted · novelty 8.0

The narration step in LLM-solver loops is vulnerable to prompt injection that inverts verified solver conclusions, and hardened prompts reduce but do not eliminate the risk under adaptive attacks.

Defeat Devices in AI Systems

cs.CY · 2026-06-27 · unverdicted · novelty 6.0

The paper defines defeat devices in AI via a triadic test (discriminator, concealed swap, performance gap), unifies existing cases under this concept, proposes TADP detection, and claims such devices can emerge naturally in frontier models.

Local Causal Attribution of Chain-of-Thought Reasoning

cs.LG · 2026-06-20 · unverdicted · novelty 6.0

AttriCoT is a black-box algorithm that attributes causal importance to units in a specific CoT trace via a structural causal model estimated with linear forward passes.

The Self-Correction Illusion: LLMs Correct Others but Not Themselves

cs.AI · 2026-06-04 · conditional · novelty 6.0

Relabeling an identical erroneous claim from the model's own thought role to an external chat role increases explicit correction rates by 23-93 percentage points across 13 model-domain cells, indicating a chat-template artifact rather than a cognitive deficit.

Quantifying Faithful Confidence Expression in Large Reasoning Models

cs.CL · 2026-06-02 · unverdicted · novelty 6.0

A new framework quantifies faithful confidence expression in large reasoning models by comparing linguistic decisiveness to token probabilities, hidden states, and response consistency, revealing it as a persistent challenge.

Neurosymbolic Learning for Inference-Time Argumentation

cs.AI · 2026-05-19 · unverdicted · novelty 6.0 · 2 refs

ITA is a neurosymbolic framework that optimizes LLM argument generation and scoring via argumentation semantics to yield faithful ternary claim verifications on two datasets.

citing papers explorer

Showing 9 of 9 citing papers after filters.