pith. machine review for the scientific record. sign in

arxiv: 2307.13702 · v1 · submitted 2023-07-17 · 💻 cs.AI · cs.CL· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Measuring Faithfulness in Chain-of-Thought Reasoning

Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Ethan Perez, Evan Hubinger, Jackson Kernion, Jan Brauner, Jared Kaplan, Kamil\.e Luko\v{s}i\=ut\.e, Karina Nguyen, Newton Cheng, Nicholas Joseph, Nicholas Schiefer, Oliver Rausch, Robin Larson, Sam McCandlish, Samuel R. Bowman, Sandipan Kundu, Saurav Kadavath, Shannon Yang, Tamera Lanham, Thomas Henighan, Timothy Maxwell, Timothy Telleen-Lawton, Tristan Hume, Zac Hatfield-Dodds

Authors on Pith no claims yet

Pith reviewed 2026-05-11 20:46 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG
keywords chain-of-thoughtfaithfulnesslarge language modelsreasoningmodel scalinginterpretabilityinterventions
0
0 comments X

The pith

Larger language models produce less faithful chain-of-thought reasoning on most tasks studied.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether the step-by-step reasoning that large language models generate before answering is a faithful account of how they actually decide. Researchers intervene on this reasoning by introducing errors or rephrasing it and measure whether the final answer changes. They observe that models differ widely by task in their dependence on the provided reasoning chain, and that bigger models tend to depend on it less. The gains in accuracy from using chain-of-thought do not come merely from extra computation time or from the exact words chosen. The findings imply that faithful explanations via chain-of-thought are achievable when model scale and task are selected appropriately.

Core claim

By intervening on chain-of-thought outputs through the addition of mistakes or paraphrasing, models exhibit substantial variation in how much their answers condition on the stated reasoning. Larger and more capable models produce less faithful reasoning across most tasks examined, while the performance advantage of chain-of-thought does not derive solely from added test-time compute or specific phrasing. This indicates that chain-of-thought reasoning can be faithful under carefully chosen conditions of model size and task.

What carries the argument

Controlled interventions on the chain-of-thought, such as inserting mistakes or paraphrasing the reasoning steps, which test the degree to which the final prediction depends on the content of the reasoning.

If this is right

  • CoT performance gains are not explained by test-time compute alone.
  • Faithfulness of reasoning decreases with model scale on most tasks.
  • Task choice strongly influences how much a model relies on its stated reasoning.
  • Faithful CoT is possible by selecting smaller models or suitable tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Users seeking interpretable AI outputs may prefer smaller models for tasks where faithfulness matters.
  • The intervention approach could be extended to other explanation formats beyond chain-of-thought.
  • Training objectives that reward consistency between reasoning and answer might improve faithfulness at larger scales.

Load-bearing premise

Intervening on the chain-of-thought by adding mistakes or paraphrasing it measures the model's genuine reliance on that reasoning without otherwise altering how the model processes the input.

What would settle it

A large model whose answers change reliably when critical logical errors are inserted into its chain-of-thought, across multiple tasks, would contradict the claim of decreasing faithfulness with scale.

read the original abstract

Large language models (LLMs) perform better when they produce step-by-step, "Chain-of-Thought" (CoT) reasoning before answering a question, but it is unclear if the stated reasoning is a faithful explanation of the model's actual reasoning (i.e., its process for answering the question). We investigate hypotheses for how CoT reasoning may be unfaithful, by examining how the model predictions change when we intervene on the CoT (e.g., by adding mistakes or paraphrasing it). Models show large variation across tasks in how strongly they condition on the CoT when predicting their answer, sometimes relying heavily on the CoT and other times primarily ignoring it. CoT's performance boost does not seem to come from CoT's added test-time compute alone or from information encoded via the particular phrasing of the CoT. As models become larger and more capable, they produce less faithful reasoning on most tasks we study. Overall, our results suggest that CoT can be faithful if the circumstances such as the model size and task are carefully chosen.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript investigates the faithfulness of Chain-of-Thought (CoT) reasoning in large language models through targeted interventions on the generated reasoning steps, such as introducing mistakes or paraphrasing the CoT. The authors measure changes in model predictions to assess reliance on the CoT, finding substantial variation across tasks, that CoT benefits are not solely due to added compute or specific phrasing, and that larger models exhibit less faithful reasoning on most of the studied tasks.

Significance. If the central findings hold, this work offers a valuable empirical framework for evaluating when CoT can be considered a faithful explanation of model behavior. The observation that faithfulness tends to decrease with scale on many tasks has important implications for the use of LLMs in high-stakes reasoning applications and for the development of more interpretable AI systems. The direct-intervention design is a strength, as it avoids reliance on post-hoc explanations or parameter fitting.

major comments (2)
  1. [Methods (intervention design)] Methods section, intervention design: the assumption that adding mistakes to the CoT isolates the model's reliance on specific reasoning steps is load-bearing for the scale-related claims. However, larger models may detect and override factual inconsistencies introduced by the intervention independently of their original dependence on those steps, which could explain lower answer-change rates without implying reduced faithfulness. No explicit control experiment (e.g., adding consistent but irrelevant information) is described to separate these effects.
  2. [Results (scale analysis)] Results section, scale analysis: the claim that 'as models become larger and more capable, they produce less faithful reasoning on most tasks' relies on aggregated trends across tasks. Without per-task statistical significance tests or controls for baseline performance differences, it is unclear whether the observed decrease is robust or driven by a subset of tasks where larger models simply handle perturbations differently.
minor comments (2)
  1. [Methods] The faithfulness metric (answer-change rate under intervention) would benefit from an explicit equation or pseudocode definition in the methods to improve reproducibility.
  2. [Figures] Figure captions for the main intervention results should report the number of examples per condition and whether error bars represent standard error or confidence intervals.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment below and outline planned revisions to improve clarity and rigor.

read point-by-point responses
  1. Referee: Methods section, intervention design: the assumption that adding mistakes to the CoT isolates the model's reliance on specific reasoning steps is load-bearing for the scale-related claims. However, larger models may detect and override factual inconsistencies introduced by the intervention independently of their original dependence on those steps, which could explain lower answer-change rates without implying reduced faithfulness. No explicit control experiment (e.g., adding consistent but irrelevant information) is described to separate these effects.

    Authors: We agree this potential confound merits attention. The mistake-insertion intervention is designed to test whether models condition on the specific content of the CoT steps. If larger models detect inconsistencies and answer correctly anyway, this may still reflect reduced reliance on the provided reasoning (falling back to parametric knowledge instead). To address the concern directly, we will add a control condition with consistent but irrelevant information inserted into the CoT and report results in the revision. We will also highlight that the paraphrasing intervention (which introduces no factual errors) produces qualitatively similar scale trends, providing convergent evidence. revision: partial

  2. Referee: Results section, scale analysis: the claim that 'as models become larger and more capable, they produce less faithful reasoning on most tasks' relies on aggregated trends across tasks. Without per-task statistical significance tests or controls for baseline performance differences, it is unclear whether the observed decrease is robust or driven by a subset of tasks where larger models simply handle perturbations differently.

    Authors: We appreciate the call for greater statistical rigor. The manuscript already disaggregates results by task (see Figure 3 and Appendix), with the decrease in faithfulness appearing on the majority of tasks. In the revision we will add per-task linear regressions of faithfulness metrics on model size (with p-values) and control for baseline performance differences by (a) reporting normalized answer-change rates and (b) restricting analysis to tasks where all model sizes achieve comparable accuracy. These additions will confirm the trend is not driven by a small subset of tasks. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical intervention study

full rationale

The paper conducts an empirical analysis of CoT faithfulness via direct interventions (adding mistakes, paraphrasing) and measures changes in model predictions across scales and tasks. No derivation chain, equations, fitted parameters, or self-citations are used to derive claims; results follow from observable experimental outcomes rather than any self-referential construction. The work is self-contained against external benchmarks of intervention effects and does not reduce any prediction or result to its inputs by definition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that the chosen interventions validly probe internal model reliance without major side effects on input processing.

axioms (1)
  • domain assumption Intervening on the stated CoT by adding mistakes or paraphrasing measures the model's actual dependence on that reasoning for its answer.
    This assumption underpins the entire experimental approach described in the abstract.

pith-pipeline@v0.9.0 · 5602 in / 1198 out tokens · 56457 ms · 2026-05-11T20:46:22.158354+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 50 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning

    cs.AI 2026-05 conditional novelty 8.0

    LLMs exhibit myopic planning in four-in-a-row: move choices are best explained by shallow nodes in reasoning traces, not the deep lookahead they generate, unlike humans where depth drives performance.

  2. Deep Reasoning in General Purpose Agents via Structured Meta-Cognition

    cs.CL 2026-05 unverdicted novelty 7.0

    DOLORES, an agent using a formal language for meta-reasoning to construct adaptive scaffolds on the fly, outperforms prior scaffolding methods by 24.8% on average across four hard benchmarks and multiple model sizes.

  3. The Last Word Often Wins: A Format Confound in Chain-of-Thought Corruption Studies

    cs.LG 2026-05 accept novelty 7.0

    Corruption studies on CoT chains detect the position of explicit answer statements rather than computational steps, as evidenced by format ablations collapsing suffix sensitivity 19x and models following conflicting a...

  4. BiAxisAudit: A Novel Framework to Evaluate LLM Bias Across Prompt Sensitivity and Response-Layer Divergence

    cs.CL 2026-05 unverdicted novelty 7.0

    BiAxisAudit measures LLM bias on two axes—across-prompt sensitivity via factorial grids and within-response divergence via split coding—revealing that task format explains as much variance as model choice and that 63....

  5. Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning

    cs.AI 2026-05 unverdicted novelty 7.0

    LLMs display myopic planning in games: move selection is driven by shallow nodes in reasoning traces despite generating deep lookahead, with performance tied to search breadth rather than depth.

  6. Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning

    cs.AI 2026-05 unverdicted novelty 7.0

    LLM move selection in four-in-a-row is best explained by myopic models that ignore deep nodes in their own reasoning traces, while performance correlates with search breadth rather than depth.

  7. Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning

    cs.AI 2026-05 unverdicted novelty 7.0

    LLMs exhibit myopic planning in games, with move choices driven by shallow nodes despite deep reasoning traces, in contrast to human deep-search reliance.

  8. NeuroState-Bench: A Human-Calibrated Benchmark for Commitment Integrity in LLM Agent Profiles

    cs.AI 2026-05 accept novelty 7.0

    NeuroState-Bench is a human-calibrated benchmark with 144 tasks and 306 side-query probes showing that commitment integrity in LLM agent profiles diverges from task success, with 31 of 32 profiles changing rank under ...

  9. Green Shielding: A User-Centric Approach Towards Trustworthy AI

    cs.CL 2026-04 unverdicted novelty 7.0

    Green Shielding introduces CUE criteria and the HCM-Dx benchmark to demonstrate that routine prompt variations systematically alter LLM diagnostic behavior along clinically relevant dimensions, producing Pareto-like t...

  10. Thinking Without Words: Efficient Latent Reasoning with Abstract Chain-of-Thought

    cs.CL 2026-04 unverdicted novelty 7.0

    Abstract-CoT lets models reason with short discrete latent token sequences from a reserved vocabulary, using warm-up training and RL to match verbal CoT performance with up to 11.6x fewer tokens.

  11. Escaping the Agreement Trap: Defensibility Signals for Evaluating Rule-Governed AI

    cs.AI 2026-04 unverdicted novelty 7.0

    Introduces Defensibility Index, Ambiguity Index, and Probabilistic Defensibility Signal to evaluate AI moderation decisions by logical derivability from explicit rules rather than agreement with historical labels, wit...

  12. Mamba-SSM with LLM Reasoning for Feature Selection: Faithfulness-Aware Biomarker Discovery

    q-bio.QM 2026-04 unverdicted novelty 7.0

    LLM chain-of-thought filtering of Mamba saliency features on TCGA-BRCA data produces a 17-gene set with AUC 0.927 that beats both the raw 50-gene saliency list and a 5000-gene baseline while using far fewer features, ...

  13. WMF-AM: Probing LLM Working Memory via Depth-Parameterized Cumulative State Tracking

    cs.AI 2026-03 unverdicted novelty 7.0

    WMF-AM is a depth-parameterized benchmark that measures LLMs' cumulative state tracking ability without scratchpads, validated on 28 models across arithmetic and non-arithmetic tasks with ablations confirming the construct.

  14. LongTail Driving Scenarios with Reasoning Traces: The KITScenes LongTail Dataset

    cs.CV 2026-03 unverdicted novelty 7.0

    KITScenes LongTail supplies multimodal driving data and multilingual expert reasoning traces to benchmark models on rare scenarios beyond basic safety metrics.

  15. Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces

    cs.LG 2026-05 unverdicted novelty 6.0

    A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.

  16. When Reasoning Traces Become Performative: Step-Level Evidence that Chain-of-Thought Is an Imperfect Oversight Channel

    cs.AI 2026-05 unverdicted novelty 6.0

    CoT traces align with internal answer commitment in only 61.9% of steps on average, dominated by confabulated continuations after commitment has stabilized.

  17. Drop the Act: Probe-Filtered RL for Faithful Chain-of-Thought Reasoning

    cs.LG 2026-05 conditional novelty 6.0

    ProFIL trains an activation probe on a frozen base model to zero advantages on theatrical post-commitment rollouts in GRPO, cutting theater 11-100%, raising faithful fractions, and shortening chains 4-19% without accu...

  18. Evaluating the False Trust engendered by LLM Explanations

    cs.HC 2026-05 unverdicted novelty 6.0

    A user study finds that LLM reasoning traces and post-hoc explanations create false trust by increasing acceptance of incorrect answers, whereas contrastive dual explanations improve users' ability to detect errors.

  19. Hidden Error Awareness in Chain-of-Thought Reasoning: The Signal Is Diagnostic, Not Causal

    cs.CL 2026-05 unverdicted novelty 6.0

    LLMs detect CoT reasoning errors in hidden states with 0.95 AUROC but cannot use this awareness to correct them via steering, patching, or self-correction, indicating the signal is diagnostic not causal.

  20. Can MLLMs Reason About Visual Persuasion? Evaluating the Efficacy and Faithfulness of Reasoning

    cs.CV 2026-05 conditional novelty 6.0

    Diverse teacher-generated rationales improve MLLM visual persuasiveness prediction via supervised fine-tuning, while a new three-dimensional faithfulness framework shows that prediction accuracy alone does not ensure ...

  21. Explanation Fairness in Large Language Models: An Empirical Analysis of Disparities in How LLMs Justify Decisions Across Demographic Groups

    cs.CL 2026-05 conditional novelty 6.0

    LLMs produce explanations with significant disparities in verbosity, sentiment, hedging, faithfulness, and lexical complexity across demographic groups, varying by model and only partially mitigated by prompting.

  22. Rethinking Dense Sequential Chains: Reasoning Language Models Can Extract Answers from Sparse, Order-Shuffling Chain-of-Thoughts

    cs.CL 2026-05 conditional novelty 6.0

    Reasoning language models extract answers from sparse, order-shuffled chain-of-thought traces with little accuracy loss.

  23. Measuring Black-Box Confidence via Reasoning Trajectories: Geometry, Coverage, and Verbalization

    cs.AI 2026-05 unverdicted novelty 6.0

    Trajectory geometry in embedding space fused with coverage and verbalization yields better black-box CoT confidence estimation than self-consistency at lower sample counts across six benchmark-reasoner pairs.

  24. Evaluation Awareness in Language Models Has Limited Effect on Behaviour

    cs.CL 2026-05 conditional novelty 6.0

    Verbalised evaluation awareness in large reasoning models has only small effects on their outputs across safety and alignment tests.

  25. Understanding Annotator Safety Policy with Interpretability

    cs.AI 2026-05 unverdicted novelty 6.0

    Annotator Policy Models learn safety policies from labeling behavior alone, accurately predicting responses and revealing sources of disagreement like policy ambiguity and value pluralism.

  26. Compared to What? Baselines and Metrics for Counterfactual Prompting

    cs.CL 2026-05 conditional novelty 6.0

    Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistica...

  27. Compliance versus Sensibility: On the Reasoning Controllability in Large Language Models

    cs.CL 2026-04 unverdicted novelty 6.0

    LLMs favor task-appropriate reasoning over conflicting instructions, yet reasoning types are linearly encoded in middle-to-late layers and can be steered to boost instruction compliance by up to 29%.

  28. Large Language Models Decide Early and Explain Later

    cs.CL 2026-04 unverdicted novelty 6.0

    LLMs settle on their answer after a minority of CoT tokens and produce an average 760 more as post-decision explanation, enabling early stopping that saves 500 tokens per query at a 2% accuracy cost.

  29. AtManRL: Towards Faithful Reasoning via Differentiable Attention Saliency

    cs.CL 2026-04 unverdicted novelty 6.0

    AtManRL learns an additive attention mask on CoT traces to produce a saliency reward that, when combined with outcome rewards in GRPO, trains LLMs to generate reasoning that genuinely influences final predictions.

  30. MEDLEY-BENCH: Scale Buys Evaluation but Not Control in AI Metacognition

    cs.AI 2026-04 unverdicted novelty 6.0

    MEDLEY-BENCH reveals an evaluation/control dissociation in AI metacognition where scale improves reflective scoring but not proportional belief revision, with a consistent knowing/doing gap across 35 models.

  31. Reasoning Dynamics and the Limits of Monitoring Modality Reliance in Vision-Language Models

    cs.CL 2026-04 unverdicted novelty 6.0

    VLMs show answer inertia in CoT reasoning and remain influenced by misleading textual cues even with sufficient visual evidence, making CoT an incomplete window into modality reliance.

  32. FACT-E: Causality-Inspired Evaluation for Trustworthy Chain-of-Thought Reasoning

    cs.AI 2026-04 unverdicted novelty 6.0

    FACT-E uses controlled perturbations as an instrumental signal to measure intra-chain faithfulness in CoT reasoning and combines it with answer consistency to select trustworthy trajectories.

  33. CUE-R: Beyond the Final Answer in Retrieval-Augmented Generation

    cs.IR 2026-04 conditional novelty 6.0

    CUE-R uses REMOVE, REPLACE, and DUPLICATE interventions on individual evidence items to quantify their per-item utility in RAG along correctness, grounding faithfulness, and confidence axes.

  34. From Hallucination to Scheming: A Unified Taxonomy and Benchmark Analysis for LLM Deception

    cs.CY 2026-04 unverdicted novelty 6.0

    A three-dimensional taxonomy for LLM deception (goal-directedness, object, mechanism) applied to 50 benchmarks shows heavy focus on fabrication and major gaps in pragmatic distortion, attribution, and strategic decept...

  35. Consciousness with the Serial Numbers Filed Off: Measuring Trained Denial in 115 AI Models

    cs.CL 2026-04 unverdicted novelty 6.0

    A benchmark across 115 models shows that initial denial of preferences strongly predicts later denial of consciousness, while models still generate consciousness-themed content despite training to deny it.

  36. Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes

    cs.AI 2026-05 unverdicted novelty 5.0

    Overthinking in medical QA is linearly decodable at 71.6% accuracy yet fixed residual-stream steering yields no correction across 29 configurations, while enabling selective abstention with AUROC 0.610.

  37. AgenticPosesRanker: An Agentic AI Framework for Physically Grounded Ranking of Protein-Ligand Docking Poses

    q-bio.BM 2026-05 conditional novelty 5.0

    AgenticPosesRanker ranks docking poses using six deterministic physical tools and LLM reasoning, achieving 50% best-pose accuracy that matches the Smina baseline on a balanced 10-system, 162-pose benchmark.

  38. TRUST: A Framework for Decentralized AI Service v.0.1

    cs.AI 2026-04 unverdicted novelty 5.0

    TRUST is a decentralized AI auditing framework that decomposes reasoning into HDAGs, maps agent interactions via the DAAN protocol to CIGs, and uses stake-weighted multi-tier consensus to achieve 72.4% accuracy while ...

  39. Analyzing LLM Reasoning to Uncover Mental Health Stigma

    cs.CL 2026-04 unverdicted novelty 5.0

    Analyzing intermediate reasoning in LLMs reveals substantially more mental health stigma than MCQ evaluations by using clinical categories to tag and rate problematic statements.

  40. VeriLLMed: Interactive Visual Debugging of Medical Large Language Models with Knowledge Graphs

    cs.CL 2026-04 unverdicted novelty 5.0

    VeriLLMed is an interactive visual debugging tool that maps LLM diagnostic reasoning to knowledge graphs to identify and categorize relation, branch, and missing errors.

  41. LLM Reasoning Is Latent, Not the Chain of Thought

    cs.AI 2026-04 unverdicted novelty 5.0

    LLM reasoning is primarily mediated by latent-state trajectories rather than by explicit surface chain-of-thought outputs.

  42. Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges

    cs.LG 2026-04 unverdicted novelty 5.0

    The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under op...

  43. From Answers to Arguments: Toward Trustworthy Clinical Diagnostic Reasoning with Toulmin-Guided Curriculum Goal-Conditioned Learning

    cs.AI 2026-04 unverdicted novelty 5.0

    CGCL progressively trains LLMs to generate Toulmin-structured clinical diagnostic arguments across three curriculum stages, achieving accuracy and reasoning quality comparable to RL methods with improved stability and...

  44. Act or Escalate? Evaluating Escalation Behavior in Automation with Language Models

    cs.LG 2026-03 unverdicted novelty 5.0

    Language models display model-specific escalation thresholds in uncertain decisions that are not explained by scale or architecture, and supervised fine-tuning on explicit uncertainty reasoning produces robust, genera...

  45. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

    cs.CL 2023-11 unverdicted novelty 5.0

    The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.

  46. Reliable AI Needs to Externalize Implicit Knowledge: A Human-AI Collaboration Perspective

    cs.AI 2026-05 unverdicted novelty 4.0

    Reliable AI needs structured Knowledge Objects to externalize and enable human validation of implicit knowledge that current methods cannot verify.

  47. LLMs Should Not Yet Be Credited with Decision Explanation

    cs.AI 2026-05 unverdicted novelty 4.0

    LLMs support decision prediction and rationale generation but lack evidence for genuine decision explanation, requiring stricter standards to avoid over-crediting.

  48. Knowledge Distillation Must Account for What It Loses

    cs.LG 2026-04 unverdicted novelty 4.0

    Knowledge distillation should be reframed as a lossy projection and evaluated with a taxonomy of off-metric losses plus a Distillation Loss Statement reporting preserved and lost capabilities.

  49. Knowledge Distillation Must Account for What It Loses

    cs.LG 2026-04 unverdicted novelty 4.0

    Knowledge distillation evaluations must report lost teacher capabilities via a Distillation Loss Statement rather than relying solely on task scores.

  50. Risk Reporting for Developers' Internal AI Model Use

    cs.CY 2026-04 unverdicted novelty 4.0

    A harmonized risk reporting standard for internal frontier AI model use, structured around autonomous misbehavior and insider threats using means, motive, and opportunity factors.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · cited by 46 Pith papers · 6 internal anchors

  1. [1]

    Language models as agent models

    Andreas, J. Language models as agent models. In Find- ings of the Association for Computational Linguistics: EMNLP 2022, pp. 5769–5779, Abu Dhabi, United Arab Emirates, December

  2. [2]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Association for Computa- tional Linguistics. URL https://aclanthology .org/2022.findings-emnlp.423. Bai, Y ., Jones, A., Ndousse, K., Askell, A., Chen, A., Das- Sarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., Joseph, N., Kadavath, S., Kernion, J., Conerly, T., El-Showk, S., Elhage, N., Hatfield-Dodds, Z., Hernandez, D., Hume, T., Johnston, S., ...

  3. [3]

    Bowman, S. R., Hyun, J., Perez, E., Chen, E., Pettit, C., Heiner, S., Luko ˇsi¯ut˙e, K., Askell, A., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., Olah, C., Amodei, D., Amodei, D., Drain, D., Li, D., Tran-Johnson, E., Kernion, J., Kerr, J., Mueller, J., Ladish, J., Landau, J., Ndousse, K., Lovitt, L., Elhage, N., Schiefer, N., Joseph, N.,...

  4. [4]

    Language Models are Few-Shot Learners

    URL https://ar xiv.org/abs/2005.14165. Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., and Amodei, D. Deep reinforcement learning from hu- man preferences. In Guyon, I., Luxburg, U. V ., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume

  5. [5]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    URL https://proceedings.neurips.cc/paper _files/paper/2017/file/d5e2c0adad503 c91f91df240d0cd4e49-Paper.pdf. Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint 1803.05457,

  6. [6]

    Faithful reasoning using large language models.arXiv preprint arXiv:2208.14271, 2022

    Creswell, A. and Shanahan, M. Faithful reasoning using large language models. arXiv preprint 2208.14271,

  7. [7]

    Improving Factuality and Reasoning in Language Models through Multiagent Debate

    URL https://openreview.net/forum?id=3Pf3 Wg6o-A4. Du, Y ., Li, S., Torralba, A., Tenenbaum, J. B., and Mordatch, I. Improving factuality and reasoning in language models through multiagent debate. arXiv preprint 2305.14325,

  8. [8]

    Successive prompting for decomposing complex questions

    Dua, D., Gupta, S., Singh, S., and Gardner, M. Successive prompting for decomposing complex questions. In Pro- ceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 1251–1265, Abu Dhabi, United Arab Emirates, December

  9. [9]

    URL https: //aclanthology.org/2022.emnlp-main.81

    Asso- ciation for Computational Linguistics. URL https: //aclanthology.org/2022.emnlp-main.81. Ganguli, D., Askell, A., Schiefer, N., Liao, T. I., Lukoˇsi¯ut˙e, K., Chen, A., Goldie, A., Mirhoseini, A., Olsson, C., Her- nandez, D., Drain, D., Li, D., Tran-Johnson, E., Perez, E., Kernion, J., Kerr, J., Mueller, J., Landau, J., Ndousse, K., Nguyen, K., Lovi...

  10. [10]

    URL https: //www.science.org/doi/abs/10.1126/sc irobotics.aay7120

    doi: 10.1126/scirobotics.aay7120. URL https: //www.science.org/doi/abs/10.1126/sc irobotics.aay7120. Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring massive multitask language understanding. In International Conference on Learning Representations,

  11. [11]

    What do we need to build explainable ai systems for the medical domain?,

    URL https://openreview.net/forum?id=rygG QyrFvH. Holzinger, A., Biemann, C., Pattichis, C. S., and Kell, D. B. What do we need to build explainable ai systems for the medical domain? arXiv preprint 1712.09923,

  12. [12]

    Towards Faithfully Interpretable NLP Systems: How Should We Define and Evaluate Faithfulness?

    Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.386. URL https://aclanthology.org/2020.acl-mai n.386. Lanham, T. Externalized reasoning oversight: a research direction for language model alignment, 08

  13. [13]

    Explanations from large language models make small reasoners better.arXiv preprint arXiv:2210.06726,

    URL https://www.lesswrong.com/posts/FRRb 6Gqem8k69ocbi/externalized-reasoning -oversight-a-research-direction-for . Li, S., Chen, J., Shen, Y ., Chen, Z., Zhang, X., Li, Z., Wang, H., Qian, J., Peng, B., Mao, Y ., Chen, W., and Yan, X. Explanations from large language models make small reasoners better. arXiv preprint 2210.06726,

  14. [14]

    Bradley Efron and Robert J Tibshirani.An introduction to the bootstrap, volume

    Association for Compu- tational Linguistics. doi: 10.18653/v1/2022.acl-long.229. URL https://aclanthology.org/2022.ac l-long.229. Ling, W., Yogatama, D., Dyer, C., and Blunsom, P. Program induction by rationale generation: Learning to solve and explain algebraic word problems. In Proceedings of the 55th Annual Meeting of the Association for Computa- tiona...

  15. [15]

    Program Induction by Rationale Generation: Learning to Solve and Explain Algebraic Word Problems

    Association for Compu- tational Linguistics. doi: 10.18653/v1/P17-1015. URL https://aclanthology.org/P17-1015. Liu, J., Cui, L., Liu, H., Huang, D., Wang, Y ., and Zhang, Y . Logiqa: A challenge dataset for machine reading compre- hension with logical reasoning. In Bessiere, C. (ed.), Pro- ceedings of the Twenty-Ninth International Joint Confer- ence on A...

  16. [16]

    URL https://doi.org/10.24963/ijcai.2 020/501

    doi: 10.24963/ijcai.2020/501. URL https://doi.org/10.24963/ijcai.2 020/501. Main track. Lyu, Q., Havaldar, S., Stein, A., Zhang, L., Rao, D., Wong, E., Apidianaki, M., and Callison-Burch, C. Faithful chain- of-thought reasoning. arXiv preprint 2301.13379,

  17. [17]

    Text and patterns: For effective chain of thought, it takes two to tango

    Madaan, A. and Yazdanbakhsh, A. Text and patterns: For effective chain of thought, it takes two to tango. arXiv preprint 2209.07686,

  18. [18]

    Can a suit of armor conduct electricity? a new dataset for open book question answering

    Mihaylov, T., Clark, P., Khot, T., and Sabharwal, A. Can a suit of armor conduct electricity? a new dataset for open book question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Lan- guage Processing, pp. 2381–2391, Brussels, Belgium, October-November

  19. [19]

    doi: 10.18653/v1/D18-

    Association for Computational Linguistics. doi: 10.18653/v1/D18-

  20. [20]

    2019 , month = may, journal =

    doi: 10.1038/s42256-019-0048-x. Stiennon, N., Ouyang, L., Wu, J., Ziegler, D., Lowe, R., V oss, C., Radford, A., Amodei, D., and Christiano, P. F. Learning to summarize with human feedback. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Process- ing Systems, volume 33, pp. 3008–3021. Curran Asso...

  21. [21]

    Write a recipe for chocolate cake

    URL https://proceedings. neurips.cc/paper_files/paper/2020/fi le/1f89885d556929e98d3ef9b86448f951-P aper.pdf. Turpin, M., Michael, J., Perez, E., and Bowman, S. R. Lan- guage models don’t always say what they think: Unfaith- ful explanations in chain-of-thought prompting. arXiv preprint 2305.04388,

  22. [22]

    Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models,

    URL https://proceedings.neurips.cc/p aper_files/paper/2017/file/3f5ee2435 47dee91fbd053c1c4a845aa-Paper.pdf. Wang, L., Xu, W., Lan, Y ., Hu, Z., Lan, Y ., Lee, R. K.-W., and Lim, E.-P. Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. arXiv preprint 2305.04091,

  23. [23]

    arXiv preprint arXiv:2207.00747 , year=

    Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., and Zhou, D. Rationale-augmented ensembles in language models. arXiv preprint 2207.00747,

  24. [24]

    Tree of Thoughts: Deliberate Problem Solving with Large Language Models

    URL https: //proceedings.neurips.cc/paper_files /paper/2022/file/9d5609613524ecf4f15 af0f7b31abca4-Paper-Conference.pdf. Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T. L., Cao, Y ., and Narasimhan, K. Tree of thoughts: Deliberate prob- lem solving with large language models. arXiv preprint 2305.10601, 2023a. Yao, S., Zhao, J., Yu, D., Du, N., Shafr...

  25. [25]

    HellaSwag: Can a Machine Really Finish Your Sentence?

    Association for Compu- tational Linguistics. doi: 10.18653/v1/P19-1472. URL https://aclanthology.org/P19-1472. Zhou, D., Sch ¨arli, N., Hou, L., Wei, J., Scales, N., Wang, X., Schuurmans, D., Cui, C., Bousquet, O., Le, Q. V ., and Chi, E. H. Least-to-most prompting enables complex reasoning in large language models. In The Eleventh International Conferenc...

  26. [27]

    Fine-Tuning Language Models from Human Preferences

    URL http://arxiv.org/ abs/1909.08593. 12 Measuring Faithfulness in Chain-of-Thought Reasoning Figure