pith. machine review for the scientific record. sign in

arxiv: 2512.23032 · v2 · submitted 2025-12-28 · 💻 cs.CL · cs.AI· cs.LG

Recognition: unknown

Is Chain-of-Thought Really Not Explainability? Chain-of-Thought Can Be Faithful without Hint Verbalization

Authors on Pith no claims yet
classification 💻 cs.CL cs.AIcs.LG
keywords faithfulhintmetricunfaithfulnessbiasingcausalchain-of-thoughtcots
0
0 comments X
read the original abstract

Recent work, using the Biasing Features metric, labels a CoT as unfaithful if it omits a prompt-injected hint that affected the prediction. We argue this metric adopts a narrow notion of faithfulness and confuses unfaithfulness with incompleteness, the lossy compression needed to turn distributed transformer computation into a linear natural language narrative. On multi-hop reasoning tasks with instruct-tuned and reasoning models, many CoTs flagged as unfaithful by Biasing Features are judged faithful by other metrics, exceeding 50% in some models. With a new faithful@k metric, we show that larger inference-time budgets greatly increase hint verbalization (up to 90% in some settings), suggesting much apparent unfaithfulness is due to tight token limits. Using Causal Mediation Analysis, we further show that even non-verbalized hints can causally mediate prediction changes through the CoT. We therefore caution against relying solely on hint-based evaluations and advocate a broader interpretability toolkit, including causal mediation and corruption-based metrics. We do not claim all CoTs are faithful, only that the absence of hint words alone does not prove unfaithfulness.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Refunded but Rewarded: The Double Dip Attack on Cashback Reward Engines

    cs.CR 2026-04 accept novelty 7.0

    Cashback reward engines allow double-dipping on rewards after refunds due to missing adjustments or timing gaps, as demonstrated by experiments on six real issuers.

  2. Compared to What? Baselines and Metrics for Counterfactual Prompting

    cs.CL 2026-05 conditional novelty 6.0

    Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistica...

  3. LLM Reasoning Is Latent, Not the Chain of Thought

    cs.AI 2026-04 unverdicted novelty 5.0

    LLM reasoning is primarily mediated by latent-state trajectories rather than by explicit surface chain-of-thought outputs.