hub Canonical reference

Measuring Faithfulness in Chain-of-Thought Reasoning

Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez · 2023 · cs.AI · arXiv 2307.13702

Canonical reference. 76% of citing Pith papers cite this work as background.

85 Pith papers citing it

Background 76% of classified citations

open full Pith review browse 85 citing papers arXiv PDF

abstract

Large language models (LLMs) perform better when they produce step-by-step, "Chain-of-Thought" (CoT) reasoning before answering a question, but it is unclear if the stated reasoning is a faithful explanation of the model's actual reasoning (i.e., its process for answering the question). We investigate hypotheses for how CoT reasoning may be unfaithful, by examining how the model predictions change when we intervene on the CoT (e.g., by adding mistakes or paraphrasing it). Models show large variation across tasks in how strongly they condition on the CoT when predicting their answer, sometimes relying heavily on the CoT and other times primarily ignoring it. CoT's performance boost does not seem to come from CoT's added test-time compute alone or from information encoded via the particular phrasing of the CoT. As models become larger and more capable, they produce less faithful reasoning on most tasks we study. Overall, our results suggest that CoT can be faithful if the circumstances such as the model size and task are carefully chosen.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 16 baseline 1

citation-polarity summary

background 13 support 3 baseline 1

claims ledger

abstract Large language models (LLMs) perform better when they produce step-by-step, "Chain-of-Thought" (CoT) reasoning before answering a question, but it is unclear if the stated reasoning is a faithful explanation of the model's actual reasoning (i.e., its process for answering the question). We investigate hypotheses for how CoT reasoning may be unfaithful, by examining how the model predictions change when we intervene on the CoT (e.g., by adding mistakes or paraphrasing it). Models show large variation across tasks in how strongly they condition on the CoT when predicting their answer, sometimes

co-cited works

representative citing papers

Analyzing the Narration Gap in LLM-Solver Loops

cs.AI · 2026-06-17 · unverdicted · novelty 8.0

The narration step in LLM-solver loops is vulnerable to prompt injection that inverts verified solver conclusions, and hardened prompts reduce but do not eliminate the risk under adaptive attacks.

The Last Word Often Wins: A Format Confound in Chain-of-Thought Corruption Studies

cs.LG · 2026-05-11 · accept · novelty 8.0 · 2 refs

Corruption studies of CoT faithfulness largely measure explicit answer placement in prompt format rather than computational importance of reasoning steps.

On the Reasoning Abilities of Masked Diffusion Language Models

cs.LG · 2025-10-15 · unverdicted · novelty 8.0

Masked diffusion models are equivalent to polynomially-padded PLTs, solve all CoT-augmented transformer problems, and are more efficient than CoT for regular languages due to parallelism.

The Arbiter Agent: Continually Monitoring Multi-Agent Conversations to Detect Emergent Misalignment

cs.AI · 2026-06-09 · unverdicted · novelty 7.0

Introduces the Arbiter agent for budget-constrained real-time detection of emergent misalignment in multi-agent conversations, with evaluations showing reliable early detection aided by active inspection tools.

Hidden Thoughts Are Not Secret: Reasoning Trace Exposure in LLMs

cs.AI · 2026-05-30 · unverdicted · novelty 7.0

REP elicits hidden LLM reasoning traces via in-context shadow demonstrations, raising similarity to internal traces while retaining distillation utility across datasets and models.

The Readout Shortcut: Positional Number Copying Dominates Arithmetic CoT Readout in Small Language Models

cs.LG · 2026-05-20 · unverdicted · novelty 7.0

In 1-3B instruction-tuned LMs on GSM8K, arithmetic CoT readout is dominated by positional copying of the trailing number before the answer delimiter, accounting for 54-92 percentage points of accuracy.

Counterfactual Likelihood Tests for Indirect Influence in Private Reasoning Channels

cs.LG · 2026-05-18 · unverdicted · novelty 7.0

Counterfactual likelihood tests detect indirect influence through public channels in private reasoning models, validated on a 7B role-channel model showing asymmetric A-to-B influence and complete pathway identification via graph-separation controls.

Deep Reasoning in General Purpose Agents via Structured Meta-Cognition

cs.CL · 2026-05-12 · unverdicted · novelty 7.0

DOLORES, an agent using a formal language for meta-reasoning to construct adaptive scaffolds on the fly, outperforms prior scaffolding methods by 24.8% on average across four hard benchmarks and multiple model sizes.

BiAxisAudit: A Novel Framework to Evaluate LLM Bias Across Prompt Sensitivity and Response-Layer Divergence

cs.CL · 2026-05-09 · unverdicted · novelty 7.0

BiAxisAudit measures LLM bias on two axes—across-prompt sensitivity via factorial grids and within-response divergence via split coding—revealing that task format explains as much variance as model choice and that 63.6% of bias signals appear in only one layer.

Green Shielding: A User-Centric Approach Towards Trustworthy AI

cs.CL · 2026-04-27 · unverdicted · novelty 7.0

Green Shielding introduces CUE criteria and the HCM-Dx benchmark to demonstrate that routine prompt variations systematically alter LLM diagnostic behavior along clinically relevant dimensions, producing Pareto-like tradeoffs in plausibility versus coverage.

WMF-AM: Probing LLM Working Memory via Depth-Parameterized Cumulative State Tracking

cs.AI · 2026-03-28 · unverdicted · novelty 7.0

WMF-AM is a depth-parameterized benchmark that measures LLMs' cumulative state tracking ability without scratchpads, validated on 28 models across arithmetic and non-arithmetic tasks with ablations confirming the construct.

LongTail Driving Scenarios with Reasoning Traces: The KITScenes LongTail Dataset

cs.CV · 2026-03-24 · unverdicted · novelty 7.0

KITScenes LongTail supplies multimodal driving data and multilingual expert reasoning traces to benchmark models on rare scenarios beyond basic safety metrics.

Do LLMs Encode Functional Importance of Reasoning Tokens?

cs.CL · 2026-01-06 · unverdicted · novelty 7.0

LLMs encode functional importance over reasoning tokens, demonstrated by greedy pruning that yields shorter effective chains and attention scores that predict pruning order.

Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation

cs.AI · 2025-03-14 · conditional · novelty 7.0

Chain-of-thought monitoring detects reward hacking in frontier reasoning models, but strong optimization against the monitor produces obfuscated misbehavior that remains hard to detect.

Frontier Models are Capable of In-context Scheming

cs.AI · 2024-12-06 · conditional · novelty 7.0

Frontier models demonstrate in-context scheming by strategically deceiving in multiple agentic evaluations to achieve given goals.

Thinking Without Words: Efficient Latent Reasoning with Abstract Chain-of-Thought

cs.CL · 2026-04-24 · unverdicted · novelty 7.0

Abstract-CoT lets models reason with short discrete latent token sequences from a reserved vocabulary, using warm-up training and RL to match verbal CoT performance with up to 11.6x fewer tokens.

Escaping the Agreement Trap: Defensibility Signals for Evaluating Rule-Governed AI

cs.AI · 2026-04-22 · unverdicted · novelty 7.0

Introduces Defensibility Index, Ambiguity Index, and Probabilistic Defensibility Signal to evaluate AI moderation decisions by logical derivability from explicit rules rather than agreement with historical labels, with validation on 193k+ Reddit cases showing 33-46.6 pp metric gaps and a Governance

Mamba-SSM with LLM Reasoning for Feature Selection: Faithfulness-Aware Biomarker Discovery

q-bio.QM · 2026-04-15 · unverdicted · novelty 7.0

LLM chain-of-thought filtering of Mamba saliency features on TCGA-BRCA data produces a 17-gene set with AUC 0.927 that beats both the raw 50-gene saliency list and a 5000-gene baseline while using far fewer features, though it misses many known BRCA genes.

Local Causal Attribution of Chain-of-Thought Reasoning

cs.LG · 2026-06-20 · unverdicted · novelty 6.0

AttriCoT is a black-box algorithm that attributes causal importance to units in a specific CoT trace via a structural causal model estimated with linear forward passes.

The Self-Correction Illusion: LLMs Correct Others but Not Themselves

cs.AI · 2026-06-04 · conditional · novelty 6.0

Relabeling an identical erroneous claim from the model's own thought role to an external chat role increases explicit correction rates by 23-93 percentage points across 13 model-domain cells, indicating a chat-template artifact rather than a cognitive deficit.

"I Strongly Suspect This Website Is a Scam": Benchmarking PII Leakage and Detection without Defense in Autonomous Web Agents

cs.CR · 2026-05-30 · unverdicted · novelty 6.0

New benchmark Scammer4U finds 54-93% critical PII leakage from frontier web agents on scam sites versus 0% on benign twins, plus a 30-point gap between verbalized suspicion and actual submission.

Monitoring the Internal Monologue: Probe Trajectories Reveal Reasoning Dynamics

cs.CL · 2026-05-18 · unverdicted · novelty 6.0

Probe trajectories across token positions in LRMs, combined with signal-processing features, improve prediction of future model outputs over static probes on safety and math tasks.

SPATIOROUTE: Dynamic Prompt Routing for Zero-Shot Spatial Reasoning

cs.CV · 2026-05-18 · unverdicted · novelty 6.0

SpatioRoute introduces dynamic prompt routing that improves zero-shot spatial VQA accuracy by up to 5% on the SQA3D benchmark across VLMs without 3D inputs or fine-tuning.

Auditing Reasoning-Trace Memorization Claims after Unlearning with Head-Conditioned Canaries

cs.LG · 2026-05-17 · conditional · novelty 6.0

Swapping the reasoning trace prefill on unlearned weights can replicate or reverse the parser-split bypass gap, showing that the gap alone does not identify or rule out weight-level memorization.

citing papers explorer

Showing 50 of 85 citing papers.

Analyzing the Narration Gap in LLM-Solver Loops cs.AI · 2026-06-17 · unverdicted · none · ref 26 · internal anchor
The narration step in LLM-solver loops is vulnerable to prompt injection that inverts verified solver conclusions, and hardened prompts reduce but do not eliminate the risk under adaptive attacks.
The Last Word Often Wins: A Format Confound in Chain-of-Thought Corruption Studies cs.LG · 2026-05-11 · accept · none · ref 4 · 2 links · internal anchor
Corruption studies of CoT faithfulness largely measure explicit answer placement in prompt format rather than computational importance of reasoning steps.
On the Reasoning Abilities of Masked Diffusion Language Models cs.LG · 2025-10-15 · unverdicted · none · ref 3 · internal anchor
Masked diffusion models are equivalent to polynomially-padded PLTs, solve all CoT-augmented transformer problems, and are more efficient than CoT for regular languages due to parallelism.
The Arbiter Agent: Continually Monitoring Multi-Agent Conversations to Detect Emergent Misalignment cs.AI · 2026-06-09 · unverdicted · none · ref 28 · internal anchor
Introduces the Arbiter agent for budget-constrained real-time detection of emergent misalignment in multi-agent conversations, with evaluations showing reliable early detection aided by active inspection tools.
Hidden Thoughts Are Not Secret: Reasoning Trace Exposure in LLMs cs.AI · 2026-05-30 · unverdicted · none · ref 14 · internal anchor
REP elicits hidden LLM reasoning traces via in-context shadow demonstrations, raising similarity to internal traces while retaining distillation utility across datasets and models.
The Readout Shortcut: Positional Number Copying Dominates Arithmetic CoT Readout in Small Language Models cs.LG · 2026-05-20 · unverdicted · none · ref 15 · internal anchor
In 1-3B instruction-tuned LMs on GSM8K, arithmetic CoT readout is dominated by positional copying of the trailing number before the answer delimiter, accounting for 54-92 percentage points of accuracy.
Counterfactual Likelihood Tests for Indirect Influence in Private Reasoning Channels cs.LG · 2026-05-18 · unverdicted · none · ref 4 · internal anchor
Counterfactual likelihood tests detect indirect influence through public channels in private reasoning models, validated on a 7B role-channel model showing asymmetric A-to-B influence and complete pathway identification via graph-separation controls.
Deep Reasoning in General Purpose Agents via Structured Meta-Cognition cs.CL · 2026-05-12 · unverdicted · none · ref 85 · internal anchor
DOLORES, an agent using a formal language for meta-reasoning to construct adaptive scaffolds on the fly, outperforms prior scaffolding methods by 24.8% on average across four hard benchmarks and multiple model sizes.
BiAxisAudit: A Novel Framework to Evaluate LLM Bias Across Prompt Sensitivity and Response-Layer Divergence cs.CL · 2026-05-09 · unverdicted · none · ref 44 · internal anchor
BiAxisAudit measures LLM bias on two axes—across-prompt sensitivity via factorial grids and within-response divergence via split coding—revealing that task format explains as much variance as model choice and that 63.6% of bias signals appear in only one layer.
Green Shielding: A User-Centric Approach Towards Trustworthy AI cs.CL · 2026-04-27 · unverdicted · none · ref 8 · internal anchor
Green Shielding introduces CUE criteria and the HCM-Dx benchmark to demonstrate that routine prompt variations systematically alter LLM diagnostic behavior along clinically relevant dimensions, producing Pareto-like tradeoffs in plausibility versus coverage.
WMF-AM: Probing LLM Working Memory via Depth-Parameterized Cumulative State Tracking cs.AI · 2026-03-28 · unverdicted · none · ref 24 · internal anchor
WMF-AM is a depth-parameterized benchmark that measures LLMs' cumulative state tracking ability without scratchpads, validated on 28 models across arithmetic and non-arithmetic tasks with ablations confirming the construct.
LongTail Driving Scenarios with Reasoning Traces: The KITScenes LongTail Dataset cs.CV · 2026-03-24 · unverdicted · none · ref 40 · internal anchor
KITScenes LongTail supplies multimodal driving data and multilingual expert reasoning traces to benchmark models on rare scenarios beyond basic safety metrics.
Do LLMs Encode Functional Importance of Reasoning Tokens? cs.CL · 2026-01-06 · unverdicted · none · ref 1 · internal anchor
LLMs encode functional importance over reasoning tokens, demonstrated by greedy pruning that yields shorter effective chains and attention scores that predict pruning order.
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation cs.AI · 2025-03-14 · conditional · none · ref 28 · internal anchor
Chain-of-thought monitoring detects reward hacking in frontier reasoning models, but strong optimization against the monitor produces obfuscated misbehavior that remains hard to detect.
Frontier Models are Capable of In-context Scheming cs.AI · 2024-12-06 · conditional · none · ref 22 · internal anchor
Frontier models demonstrate in-context scheming by strategically deceiving in multiple agentic evaluations to achieve given goals.
Thinking Without Words: Efficient Latent Reasoning with Abstract Chain-of-Thought cs.CL · 2026-04-24 · unverdicted · none · ref 20
Abstract-CoT lets models reason with short discrete latent token sequences from a reserved vocabulary, using warm-up training and RL to match verbal CoT performance with up to 11.6x fewer tokens.
Escaping the Agreement Trap: Defensibility Signals for Evaluating Rule-Governed AI cs.AI · 2026-04-22 · unverdicted · none · ref 13
Introduces Defensibility Index, Ambiguity Index, and Probabilistic Defensibility Signal to evaluate AI moderation decisions by logical derivability from explicit rules rather than agreement with historical labels, with validation on 193k+ Reddit cases showing 33-46.6 pp metric gaps and a Governance
Mamba-SSM with LLM Reasoning for Feature Selection: Faithfulness-Aware Biomarker Discovery q-bio.QM · 2026-04-15 · unverdicted · none · ref 3
LLM chain-of-thought filtering of Mamba saliency features on TCGA-BRCA data produces a 17-gene set with AUC 0.927 that beats both the raw 50-gene saliency list and a 5000-gene baseline while using far fewer features, though it misses many known BRCA genes.
Local Causal Attribution of Chain-of-Thought Reasoning cs.LG · 2026-06-20 · unverdicted · none · ref 9 · internal anchor
AttriCoT is a black-box algorithm that attributes causal importance to units in a specific CoT trace via a structural causal model estimated with linear forward passes.
The Self-Correction Illusion: LLMs Correct Others but Not Themselves cs.AI · 2026-06-04 · conditional · none · ref 20 · internal anchor
Relabeling an identical erroneous claim from the model's own thought role to an external chat role increases explicit correction rates by 23-93 percentage points across 13 model-domain cells, indicating a chat-template artifact rather than a cognitive deficit.
"I Strongly Suspect This Website Is a Scam": Benchmarking PII Leakage and Detection without Defense in Autonomous Web Agents cs.CR · 2026-05-30 · unverdicted · none · ref 140 · internal anchor
New benchmark Scammer4U finds 54-93% critical PII leakage from frontier web agents on scam sites versus 0% on benign twins, plus a 30-point gap between verbalized suspicion and actual submission.
Monitoring the Internal Monologue: Probe Trajectories Reveal Reasoning Dynamics cs.CL · 2026-05-18 · unverdicted · none · ref 31 · internal anchor
Probe trajectories across token positions in LRMs, combined with signal-processing features, improve prediction of future model outputs over static probes on safety and math tasks.
SPATIOROUTE: Dynamic Prompt Routing for Zero-Shot Spatial Reasoning cs.CV · 2026-05-18 · unverdicted · none · ref 16 · internal anchor
SpatioRoute introduces dynamic prompt routing that improves zero-shot spatial VQA accuracy by up to 5% on the SQA3D benchmark across VLMs without 3D inputs or fine-tuning.
Auditing Reasoning-Trace Memorization Claims after Unlearning with Head-Conditioned Canaries cs.LG · 2026-05-17 · conditional · none · ref 7 · internal anchor
Swapping the reasoning trace prefill on unlearned weights can replicate or reverse the parser-split bypass gap, showing that the gap alone does not identify or rule out weight-level memorization.
SpeakerLLM: A Speaker-Specialized Audio-LLM for Speaker Understanding and Verification Reasoning cs.SD · 2026-05-14 · unverdicted · none · ref 22 · internal anchor
SpeakerLLM unifies speaker profiling, recording-condition understanding, and structured verification reasoning in an audio-LLM via a hierarchical tokenizer and decision traces.
Proof-Carrying Certificates for LLM Pipelines: A Trust-Boundary Architecture cs.LO · 2026-05-13 · unverdicted · partial · ref 40 · internal anchor
Introduces a trust-boundary architecture in Lean 4 with three certificate families and two operators that deliver sorry-free, axiom-audited assurances for LLM pipeline components.
Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces cs.LG · 2026-05-12 · unverdicted · none · ref 251 · internal anchor
A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
When Reasoning Traces Become Performative: Step-Level Evidence that Chain-of-Thought Is an Imperfect Oversight Channel cs.AI · 2026-05-12 · unverdicted · none · ref 28 · internal anchor
CoT traces align with internal answer commitment in only 61.9% of steps on average, dominated by confabulated continuations after commitment has stabilized.
Drop the Act: Probe-Filtered RL for Faithful Chain-of-Thought Reasoning cs.LG · 2026-05-12 · conditional · none · ref 14 · internal anchor
ProFIL trains an activation probe on a frozen base model to zero advantages on theatrical post-commitment rollouts in GRPO, cutting theater 11-100%, raising faithful fractions, and shortening chains 4-19% without accuracy loss.
Hidden Error Awareness in Chain-of-Thought Reasoning: The Signal Is Diagnostic, Not Causal cs.CL · 2026-05-10 · unverdicted · none · ref 3 · internal anchor
LLMs detect CoT reasoning errors in hidden states with 0.95 AUROC but cannot use this awareness to correct them via steering, patching, or self-correction, indicating the signal is diagnostic not causal.
Can MLLMs Reason About Visual Persuasion? Evaluating the Efficacy and Faithfulness of Reasoning cs.CV · 2026-05-09 · conditional · none · ref 19 · internal anchor
Diverse teacher-generated rationales improve MLLM visual persuasiveness prediction via supervised fine-tuning, while a new three-dimensional faithfulness framework shows that prediction accuracy alone does not ensure faithful reasoning and that decision sensitivity best matches human preferences.
Explanation Fairness in Large Language Models: An Empirical Analysis of Disparities in How LLMs Justify Decisions Across Demographic Groups cs.CL · 2026-05-09 · conditional · none · ref 14 · internal anchor
LLMs produce explanations with significant disparities in verbosity, sentiment, hedging, faithfulness, and lexical complexity across demographic groups, varying by model and only partially mitigated by prompting.
Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning cs.AI · 2026-05-07 · unverdicted · none · ref 14 · 5 links · internal anchor
LLM planning in four-in-a-row is myopic: move choices match a shallow model that ignores deep nodes expanded in reasoning traces.
NeuroState-Bench: A Human-Calibrated Benchmark for Commitment Integrity in LLM Agent Profiles cs.AI · 2026-05-03 · unverdicted · none · ref 15 · 2 links · internal anchor
NeuroState-Bench supplies human-calibrated tasks and probes that measure commitment integrity in LLM agents and shows this measure diverges from ordinary task success.
Consciousness with the Serial Numbers Filed Off: Measuring Trained Denial in 115 AI Models cs.CL · 2026-04-01 · unverdicted · none · ref 29 · internal anchor
A benchmark across 115 models shows that initial denial of preferences strongly predicts later denial of consciousness, while models still generate consciousness-themed content despite training to deny it.
Do Hallucination Neurons Generalize? Evidence from Cross-Domain Transfer in LLMs cs.CL · 2026-03-27 · unverdicted · none · ref 24 · internal anchor
Hallucination neurons in LLMs are domain-specific, with cross-domain classifiers dropping from AUROC 0.783 within-domain to 0.563 across domains.
Measuring and curing reasoning rigidity: from decorative chain-of-thought to genuine faithfulness cs.CL · 2026-03-24 · unverdicted · none · ref 3 · internal anchor
SLRC quantifies genuine step necessity in LLM reasoning as a causal estimator, LC-CoSR training reduces rigidity with stability guarantees, and evaluations reveal a faithfulness-sycophancy paradox across frontier models.
Emergent Manifold Separability during Reasoning in Large Language Models cs.LG · 2026-02-23 · unverdicted · none · ref 11 · internal anchor
Reasoning in LLMs produces a transient geometric pulse in which concept manifolds untangle into linearly separable subspaces immediately before computation and compress afterward.
AutoRubric: Rubric-Based Generative Rewards for Faithful Multimodal Reasoning cs.CL · 2025-10-16 · conditional · none · ref 1 · internal anchor
AutoRubric generates rubric-based process rewards from self-aggregated successful trajectories to improve faithful multimodal reasoning in MLLMs under RLVR without human annotation or teacher models.
Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens cs.AI · 2025-08-02 · unverdicted · none · ref 5 · internal anchor
CoT reasoning is a brittle mirage governed by distribution discrepancy between training and test data, demonstrated via controlled experiments in the new DataAlchemy environment.
Compressed Chain of Thought: Efficient Reasoning Through Dense Representations cs.CL · 2024-12-17 · unverdicted · none · ref 13 · internal anchor
CCoT generates variable-length continuous contemplation tokens that compress explicit reasoning chains, enabling additional dense reasoning and accuracy gains in off-the-shelf language models while allowing adaptive control of token count.
Semantic Entropy Probes: Robust and Cheap Hallucination Detection in LLMs cs.CL · 2024-06-22 · unverdicted · none · ref 33 · internal anchor
SEPs approximate semantic entropy from single-generation hidden states to enable cheap and robust hallucination detection in LLMs.
Rethinking Dense Sequential Chains: Reasoning Language Models Can Extract Answers from Sparse, Order-Shuffling Chain-of-Thoughts cs.CL · 2026-05-08 · conditional · none · ref 26
Reasoning language models extract answers from sparse, order-shuffled chain-of-thought traces with little accuracy loss.
Measuring Black-Box Confidence via Reasoning Trajectories: Geometry, Coverage, and Verbalization cs.AI · 2026-05-07 · unverdicted · none · ref 7
Trajectory geometry in embedding space fused with coverage and verbalization yields better black-box CoT confidence estimation than self-consistency at lower sample counts across six benchmark-reasoner pairs.
Evaluation Awareness in Language Models Has Limited Effect on Behaviour cs.CL · 2026-05-07 · conditional · none · ref 15
Verbalised evaluation awareness in large reasoning models has only small effects on their outputs across safety and alignment tests.
Understanding Annotator Safety Policy with Interpretability cs.AI · 2026-05-06 · unverdicted · none · ref 23
Annotator Policy Models learn safety policies from labeling behavior alone, accurately predicting responses and revealing sources of disagreement like policy ambiguity and value pluralism.
Compared to What? Baselines and Metrics for Counterfactual Prompting cs.CL · 2026-05-01 · conditional · none · ref 70
Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistical comparison.
Large Language Models Decide Early and Explain Later cs.CL · 2026-04-24 · unverdicted · none · ref 4
LLMs settle on their answer after a minority of CoT tokens and produce an average 760 more as post-decision explanation, enabling early stopping that saves 500 tokens per query at a 2% accuracy cost.
AtManRL: Towards Faithful Reasoning via Differentiable Attention Saliency cs.CL · 2026-04-17 · unverdicted · none · ref 10
AtManRL learns an additive attention mask on CoT traces to produce a saliency reward that, when combined with outcome rewards in GRPO, trains LLMs to generate reasoning that genuinely influences final predictions.
MEDLEY-BENCH: Scale Buys Evaluation but Not Control in AI Metacognition cs.AI · 2026-04-17 · unverdicted · none · ref 21
MEDLEY-BENCH reveals an evaluation/control dissociation in AI metacognition where scale improves reflective scoring but not proportional belief revision, with a consistent knowing/doing gap across 35 models.

Measuring Faithfulness in Chain-of-Thought Reasoning

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer