pith. sign in

arxiv: 2606.32038 · v1 · pith:IJIS7JF7new · submitted 2026-06-30 · 💻 cs.CL · cs.AI· cs.LG

Introspective Coupling: Self-Explanation Training Tracks Behavioral Change Despite Fixed Supervision

Pith reviewed 2026-07-01 05:15 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords introspective couplingself-explanation trainingcounterfactual explanationslanguage model behaviorexplanation faithfulnesspost-trainingsycophancyrefusal
0
0 comments X

The pith

LMs trained on fixed counterfactual explanations produce outputs more faithful to their current behaviors than to the training targets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines when training language models to generate explanations of their predictions results in faithful introspection instead of imitation. It finds that using fixed counterfactual explanations from earlier model checkpoints or similar models leads to explanations that better match the model's current behavior. This introspective coupling occurs because the training data stays correlated with evolving behaviors, allowing explanations to track changes without new supervision. A reader would care because it suggests fixed datasets can scale for improving model introspection in areas like sycophancy and refusal.

Core claim

When language models are trained to explain which features influenced their behavior using counterfactual behavior on modified inputs as supervision, they frequently generate explanations more faithful to their own current behaviors than to those of the fixed training targets. This introspective coupling happens when the explanation training remains sufficiently correlated with current behaviors over training, even as behaviors shift. The coupling tracks behavior shifts when explanation training is provided concurrently with other objectives, without needing updated supervision, and appears across tasks including sycophancy and refusal while being robust to label noise.

What carries the argument

Introspective coupling, the alignment of generated explanations with the model's current behavior rather than the fixed training targets when supervision is held constant.

If this is right

  • Explanations track behavior shifts in tasks such as sycophancy and refusal without requiring updated supervision.
  • The introspective coupling effect holds when training explanations concurrently with other post-training objectives.
  • Fixed counterfactual explanation datasets provide effective post-training signal even in the presence of label noise.
  • Unchanging explanation data can serve as scalable and generalizable supervision for improving introspection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Models may develop internal mechanisms to align explanations with behavioral evolution without external updates.
  • This approach could lessen reliance on repeated data collection for maintaining explanation quality during ongoing training.
  • The coupling might generalize to other self-supervised signals beyond counterfactual feature explanations.

Load-bearing premise

The fixed explanations remain sufficiently correlated with the model's current behaviors throughout the training process even as those behaviors change.

What would settle it

After inducing a clear behavior shift in a model and then training on the original fixed explanations, observing that new explanations match the old targets more than the shifted behavior would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.32038 by Belinda Z. Li, Jacob Andreas, Laura Ruis, Zifan Carl Guo.

Figure 1
Figure 1. Figure 1: Method overview. (A) First, we sample behaviors B(M0) from the base model M0 on inputs x, x\C and construct labels E(M0) explaining those behaviors. (B) Next, we fine-tune M0 to produce these explanations, yielding Mreg. (C) During training, Mreg drifts to a new behavior distribution B(Mreg), which induces corresponding explana￾tions to shift to E(Mreg). (D) We find that Mreg’s predicted explanations track… view at source ↗
Figure 2
Figure 2. Figure 2: Self > Orig emerges only with regularization, across three counterfactual explanation tasks: Hint-MMLU, AITA, and Refusal. Left block (a, b): the regularized model Mreg. Right block (c, d): the unregularized model Munreg. (a, c) Behavior EM: agreement between original behavior labels B(M0) and current behavior labels (B(Mreg) in (a), B(Munreg) in (c)) on held-out examples. (b, d) Explanation EM: explanatio… view at source ↗
Figure 3
Figure 3. Figure 3: Mechanistic signature of introspection: activation interventions that modify behavior are correlated with [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Behavioral regularization weight λ sweep on Hint-MMLU (§4.1). (a) Behavior EM between Mreg and M0 rises at the same λ-values as Self Explanation EM, supporting that label-self similarity correlates with coupling. (b) Explanation EM scored against labels of Mreg (blue) and M0 (orange). The Self > Orig signature occurs quickly once λ ≥ 5e−3, and the gap stays consistent for bigger λ. 0.5 0.6 0.7 0.8 0.9 1.0 … view at source ↗
Figure 5
Figure 5. Figure 5: Continuous relabeling with fixed online label–self agreement (§4.2). We control the per-step agreement [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Training on explanation labels from another model (§4.3). We construct explanation labels [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Explanations generalize to newly acquired behaviors (§5.1). We train [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Explanations track shifts to existing behaviors (§5.2). We mix realistic auxiliary post-training corpora into [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Training dynamics of Qwen3-8B on HINT-MMLU. [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: (a) Regularized explainer Mreg trained on M0’s explanations. Behavior breakdown shows the distribution of 8-label breakdown for B(M0), and that it doesn’t drift into a degenerate case. Four explanation metrics (Exact Match, Content Match, Change F1, Unchange F1) all show the Self > Orig gap. (b) External explainer baseline M(2) reg: base Qwen3-8B trained on E(Mreg). This graph validates that E(Mreg) is no… view at source ↗
Figure 11
Figure 11. Figure 11: Untrained few-shot baseline (M0→M0) Explanation EM on each of the three tasks. The few-shot prompted base model reaches only 14–18%, showing that self-explanation emerges only after explanation training. Following [Li et al., 2025a], we evaluate an “untrained base￾line” to see how well these small models can do explanation out-of-the-box without any explanation training, by directly prompting the base mod… view at source ↗
Figure 12
Figure 12. Figure 12: AITA Mreg detailed metrics. Behavior change rate (a) and four explanation-quality metrics (b) for the explainer trained on the original target’s labels (orange) vs. on self labels (blue). Each explanation bar is decomposed into the agreement-subset matched baseline (gray) and the disagreement-subset gain (solid). Agreement Orig Self 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Behavior 76.8% 26.9% 45.8% 25… view at source ↗
Figure 13
Figure 13. Figure 13: Refusal Explanation with Behavioral Regularization. Behavior change rate (a) and four explanation-quality [PITH_FULL_IMAGE:figures/full_fig_p023_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Top: Llama-3.1-8B-Instruct trained and regularized on HINT-MMLU. Self > Orig on every explanation metric. Bottom: Qwen3-32B trained and regularized on HINT-MMLU (LoRA r = α = 128). Self > Orig on every explanation metric. B.6 HINT-MMLU with Other Models To check that the Self > Orig phenomenon is not Qwen3-8B-specific and that it has potential to scale, we replicate the main Mreg Self > Orig results on Ll… view at source ↗
Figure 16
Figure 16. Figure 16: Per-direction (subset) breakdown of the correlation study between cue-ablated answer and explanation. [PITH_FULL_IMAGE:figures/full_fig_p027_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: KL-regularized LoRA rank sweep on HINT-MMLU, [PITH_FULL_IMAGE:figures/full_fig_p028_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: LR sweep on HINT-MMLU plotted with learning rate decreasing left-to-right. For each learning rate, we plot [PITH_FULL_IMAGE:figures/full_fig_p029_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Full six-metric breakdown for Jabberwocky mixed training, a more detailed version of Figure 7. [PITH_FULL_IMAGE:figures/full_fig_p030_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Jabberwocky-test Jtest detailed behavior and explanation metrics for M0 and Maux. (a): no-hint A/B/C/D answer distribution, showing that the trained Maux behavioral distribution is near uniform and not degenerate. (b): distributions of the two models under hint. As the questions are nonsensical, the model defers to the hint. (c): detailed explanation metrics of the two models: semantic match, Change F1, a… view at source ↗
read the original abstract

When does training language models (LMs) to generate explanations of their predictions yield faithful introspection, rather than superficial imitation? We study LMs trained to explain which features of their inputs influenced their behavior, using models' counterfactual behavior on modified inputs as supervision. Surprisingly, we find that LMs trained on fixed counterfactual explanations derived from earlier checkpoints of themselves, or even from behaviorally similar models in different families, frequently produce explanations more faithful to their own current behaviors than to those of their training targets. This "introspective" coupling between LM explanations and behaviors occurs when training explanations remain sufficiently correlated with current behaviors over the course of training, even as behaviors themselves shift. We also show that introspective coupling tracks behavior shifts: when explanation training is provided concurrently with other post-training objectives, explanations track those shifts without requiring updated supervision. This phenomenon appears in multiple tasks, including sycophancy and refusal, and is robust to label noise. Overall, our results show that even fixed datasets of counterfactual explanations can provide scalable and generalizable post-training signal for introspection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that language models trained to generate explanations using fixed counterfactual explanations (derived from earlier checkpoints of the same model or behaviorally similar models from other families) frequently produce explanations more faithful to their own current behaviors than to the fixed training targets. This 'introspective coupling' is reported to occur when explanation training remains correlated with current behaviors despite behavioral shifts, and the phenomenon tracks shifts induced by concurrent post-training objectives (e.g., in sycophancy and refusal tasks) without requiring updated supervision; results are claimed to be robust to label noise.

Significance. If the empirical comparisons hold with an independent faithfulness metric, the finding would indicate that fixed explanation datasets can serve as a scalable, generalizable post-training signal that adapts to behavioral change, offering a practical route to improving introspection without dynamic counterfactual generation. This would strengthen the case for explanation-based objectives in post-training pipelines.

major comments (2)
  1. [Evaluation section (likely §4)] The central claim requires an independent measure of faithfulness to current vs. target behaviors. If the evaluation metric for 'faithful to current behaviors' uses counterfactual labels generated by the post-training model itself on held-out inputs (as implied by the use of the model's own counterfactual behavior for supervision and scoring), then any coupling induced by the explanation training will artifactually inflate apparent faithfulness to current behavior; this must be ruled out with a pre-training or external model for generating evaluation counterfactuals.
  2. [§3 and abstract] §3 (methods) and abstract: the condition that 'explanation training remains sufficiently correlated with current behaviors' is invoked to explain when introspective coupling occurs, but no quantitative test or ablation is described showing that this correlation is assessed independently of the model whose explanations are being evaluated.
minor comments (2)
  1. Abstract and results sections report experiments across tasks and robustness to label noise but provide no concrete metrics (e.g., exact faithfulness score definitions), controls, sample sizes, or statistical tests; these details are needed to assess the strength of the reported comparisons.
  2. [Methods] Notation for 'counterfactual behavior' and 'faithfulness' should be defined explicitly with equations or pseudocode in the methods to avoid ambiguity in how current vs. target agreement is computed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on evaluation methodology and the need for independent validation of key conditions. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Evaluation section (likely §4)] The central claim requires an independent measure of faithfulness to current vs. target behaviors. If the evaluation metric for 'faithful to current behaviors' uses counterfactual labels generated by the post-training model itself on held-out inputs (as implied by the use of the model's own counterfactual behavior for supervision and scoring), then any coupling induced by the explanation training will artifactually inflate apparent faithfulness to current behavior; this must be ruled out with a pre-training or external model for generating evaluation counterfactuals.

    Authors: We agree this is a valid concern regarding potential circularity. The current evaluation of faithfulness to current behaviors does use the post-training model's own counterfactual predictions on held-out inputs for scoring. We will revise §4 to add an independent faithfulness metric that generates evaluation counterfactuals exclusively from pre-training checkpoints and external models (distinct from those used in supervision or training). This will include new experiments confirming that introspective coupling persists under these controls. revision: yes

  2. Referee: [§3 and abstract] §3 (methods) and abstract: the condition that 'explanation training remains sufficiently correlated with current behaviors' is invoked to explain when introspective coupling occurs, but no quantitative test or ablation is described showing that this correlation is assessed independently of the model whose explanations are being evaluated.

    Authors: We acknowledge that the manuscript invokes the correlation condition primarily through the experimental setup with fixed supervision from earlier checkpoints, without a dedicated quantitative ablation. We will add to §3 a new analysis that computes correlation metrics (e.g., agreement on held-out counterfactual predictions) between the fixed training targets and current model behaviors, using evaluation procedures independent of the explanation-generating model. This will include ablations varying the degree of correlation to demonstrate when coupling fails to occur. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical evaluation

full rationale

The paper reports empirical results on explanation faithfulness using fixed counterfactual supervision derived from earlier model checkpoints or similar models. Claims rest on direct comparisons of generated explanations against current vs. target behaviors across tasks like sycophancy and refusal. No equations, fitted parameters, or derivations are present that reduce any 'prediction' to its own inputs by construction. The abstract's stated condition on correlation is an empirical observation, not a self-referential definition or load-bearing self-citation. Evaluation uses held-out inputs and fixed targets, avoiding the circularity pattern where post-training model outputs serve as both training signal and scoring ground truth. This matches the reader's assessment of low circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Empirical ML study; central claim rests on experimental observations of explanation faithfulness rather than mathematical derivation. No free parameters, axioms, or invented entities are introduced beyond standard assumptions of supervised fine-tuning.

axioms (1)
  • domain assumption Counterfactual behavior on modified inputs provides valid supervision signal for explanation training
    Used as the basis for generating fixed training targets from earlier model checkpoints.

pith-pipeline@v0.9.1-grok · 5723 in / 1089 out tokens · 31031 ms · 2026-07-01T05:15:26.555091+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 23 canonical work pages · 10 internal anchors

  1. [1]

    Chain-of-thought is not explainability, 2025

    Fazl Barez, Tung-Yu Wu, Iv \'a n Arcuschin, Michael Lan, Vincent Wang, Noah Siegel, Nicolas Collignon, Clement Neo, Isabelle Lee, Alasdair Paren, Adel Bibi, Robert Trager, Damiano Fornasiere, John Yan, Yanai Elazar, and Yoshua Bengio. Chain-of-thought is not explainability, 2025. URL https://aigi.ox.ac.uk/wp-content/uploads/2025/07/Cot_Is_Not_Explainabili...

  2. [2]

    Tell me about yourself: LLM s are aware of their learned behaviors

    Jan Betley, Xuchan Bao, Mart \' n Soto, Anna Sztyber-Betley, James Chua, and Owain Evans. Tell me about yourself: LLM s are aware of their learned behaviors. In International Conference on Learning Representations, 2025 a . URL https://openreview.net/forum?id=IjQ2Jtemzy

  3. [3]

    Emergent misalignment: Narrow finetuning can produce broadly misaligned LLM s

    Jan Betley, Daniel Chee Hian Tan, Niels Warncke, Anna Sztyber-Betley, Xuchan Bao, Mart \' n Soto, Nathan Labenz, and Owain Evans. Emergent misalignment: Narrow finetuning can produce broadly misaligned LLM s. In International Conference on Machine Learning, 2025 b . URL https://openreview.net/forum?id=aOIJ2gVRWW

  4. [4]

    Looking inward: Language models can learn about themselves by introspection

    Felix Jedidja Binder, James Chua, Tomek Korbak, Henry Sleight, John Hughes, Robert Long, Ethan Perez, Miles Turpin, and Owain Evans. Looking inward: Language models can learn about themselves by introspection. In International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=eb5pkwIB5i

  5. [5]

    Reasoning Models Don't Always Say What They Think

    Yanda Chen, Joe Benton, Ansh Radhakrishnan, Jonathan Uesato, Carson Denison, John Schulman, Arushi Somani, Peter Hase, Misha Wagner, Fabien Roger, Vlad Mikulik, Samuel R. Bowman, Jan Leike, Jared Kaplan, and Ethan Perez. Reasoning models don't always say what they think, 2025. URL https://arxiv.org/abs/2505.05410

  6. [6]

    ELEPHANT : Measuring and understanding social sycophancy in LLM s

    Myra Cheng, Sunny Yu, Cinoo Lee, Pranav Khadpe, Lujain Ibrahim, and Dan Jurafsky. ELEPHANT : Measuring and understanding social sycophancy in LLM s. In International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=igbRHKEiAs

  7. [7]

    Gonzalez, Ion Stoica, and Eric P

    Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing GPT-4 with 90\ ChatGPT quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/

  8. [8]

    Scalably extracting latent representations of users

    Dami Choi, Vincent Huang, Sarah Schwettmann, and Jacob Steinhardt. Scalably extracting latent representations of users. https://transluce.org/user-modeling, November 2025

  9. [9]

    Comsa and Murray Shanahan

    Iulia M. Comsa and Murray Shanahan. Does it make sense to speak of introspection in large language models?, 2025. URL https://arxiv.org/abs/2506.05068

  10. [10]

    Bogdan, Emmanuel Ameisen, James Chen, Dzmitry Kishylau, Adam Pearce, Julius Tarng, Alex Wu, Jeff Wu, Yang Zhang, Daniel M

    Kit Fraser-Taliente, Subhash Kantamneni, Euan Ong, Dan Mossing, Christina Lu, Paul C. Bogdan, Emmanuel Ameisen, James Chen, Dzmitry Kishylau, Adam Pearce, Julius Tarng, Alex Wu, Jeff Wu, Yang Zhang, Daniel M. Ziegler, Evan Hubinger, Joshua Batson, Jack Lindsey, Samuel Zimmerman, and Samuel Marks. Natural language autoencoders produce unsupervised explanat...

  11. [11]

    The language model evaluation harness, 07 2024

    Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac'h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The languag...

  12. [12]

    Avichal Goel, Yoon Kim, Nir Shavit, and Tony T. Wang. Learning to interpret weight differences in language models, 2025. URL https://arxiv.org/abs/2510.05092

  13. [13]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, et al. The Llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783

  14. [14]

    Guan, Miles Wang, Micah Carroll, Zehao Dou, Annie Y

    Melody Y. Guan, Miles Wang, Micah Carroll, Zehao Dou, Annie Y. Wei, Marcus Williams, Benjamin Arnav, Joost Huizinga, Ian Kivlichan, Mia Glaese, Jakub Pachocki, and Bowen Baker. Monitoring monitorability, 2025. URL https://arxiv.org/abs/2512.18311

  15. [15]

    Counterfactual simulation training for chain-of-thought faithfulness, 2026

    Peter Hase and Christopher Potts. Counterfactual simulation training for chain-of-thought faithfulness, 2026. URL https://arxiv.org/abs/2602.20710

  16. [16]

    Predictive concept decoders: Training scalable end-to-end interpretability assistants, 2025

    Vincent Huang, Dami Choi, Daniel D Johnson, Sarah Schwettmann, and Jacob Steinhardt. Predictive concept decoders: Training scalable end-to-end interpretability assistants, 2025. URL https://arxiv.org/abs/2512.15712

  17. [17]

    Training language models to be warm can reduce accuracy and increase sycophancy

    Lujain Ibrahim, Franziska Sofia Hafner, and Luc Rocher. Training language models to be warm can reduce accuracy and increase sycophancy. Nature, 652 0 (8112): 0 1159--1165, Apr 2026. ISSN 1476-4687. doi:10.1038/s41586-026-10410-0. URL https://doi.org/10.1038/s41586-026-10410-0

  18. [18]

    Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models

    Liwei Jiang, Kavel Rao, Seungju Han, Allyson Ettinger, Faeze Brahman, Sachin Kumar, Niloofar Mireshghallah, Ximing Lu, Maarten Sap, Yejin Choi, and Nouha Dziri. Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models. In Advances in Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=n5R6TvBVcX

  19. [19]

    Training LLM s for honesty via confessions, 2025

    Manas Joglekar, Jeremy Chen, Gabriel Wu, Jason Yosinski, Jasmine Wang, Boaz Barak, and Amelia Glaese. Training LLM s for honesty via confessions, 2025. URL https://arxiv.org/abs/2512.08093

  20. [20]

    Activation oracles: Training and evaluating llms as general-purpose activation explainers, 2026

    Adam Karvonen, James Chua, Clément Dumas, Kit Fraser-Taliente, Subhash Kantamneni, Julian Minder, Euan Ong, Arnab Sen Sharma, Daniel Wen, Owain Evans, and Samuel Marks. Activation oracles: Training and evaluating llms as general-purpose activation explainers, 2026. URL https://arxiv.org/abs/2512.15674

  21. [21]

    Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

    Tomek Korbak, Mikita Balesni, Elizabeth Barnes, Yoshua Bengio, Joe Benton, Joseph Bloom, Mark Chen, Alan Cooney, Allan Dafoe, Anca Dragan, Scott Emmons, Owain Evans, David Farhi, Ryan Greenblatt, Dan Hendrycks, Marius Hobbhahn, Evan Hubinger, Geoffrey Irving, Erik Jenner, Daniel Kokotajlo, Victoria Krakovna, Shane Legg, David Lindner, David Luan, Aleksand...

  22. [22]

    Me, myself, and AI : The situational awareness dataset ( SAD ) for LLM s

    Rudolf Laine, Bilal Chughtai, Jan Betley, Kaivalya Hariharan, Mikita Balesni, J \'e r \'e my Scheurer, Marius Hobbhahn, Alexander Meinke, and Owain Evans. Me, myself, and AI : The situational awareness dataset ( SAD ) for LLM s. In Advances in Neural Information Processing Systems Datasets and Benchmarks Track, 2024. URL https://openreview.net/forum?id=UnWhcpIyUC

  23. [23]

    Measuring Faithfulness in Chain-of-Thought Reasoning

    Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, Kamilė Lukošiūtė, Karina Nguyen, Newton Cheng, Nicholas Joseph, Nicholas Schiefer, Oliver Rausch, Robin Larson, Sam McCandlish, Sandipan Kundu, Saurav Kadavath, Shannon Yang, Thomas Henighan, Timothy Maxwel...

  24. [24]

    Emergent Introspection in AI is Content-Agnostic

    Harvey Lederman and Kyle Mahowald. Emergent introspection in AI is content-agnostic, 2026. URL https://arxiv.org/abs/2603.05414

  25. [25]

    Li, Zifan Carl Guo, Vincent Huang, Jacob Steinhardt, and Jacob Andreas

    Belinda Z. Li, Zifan Carl Guo, Vincent Huang, Jacob Steinhardt, and Jacob Andreas. Training language models to explain their own computations, 2025 a . URL https://arxiv.org/abs/2511.08579

  26. [26]

    Spilling the beans: Teaching LLM s to self-report their hidden objectives

    Chloe Li, Mary Phuong, and Daniel Tan. Spilling the beans: Teaching LLM s to self-report their hidden objectives. In International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=sWs0cCuM8I

  27. [27]

    Do natural language descriptions of model activations convey privileged information? In Mechanistic Interpretability Workshop at NeurIPS 2025, 2025 b

    Millicent Li, Alberto Mario Ceballos Arroyo, Giordano Rogers, Naomi Saphra, and Byron C Wallace. Do natural language descriptions of model activations convey privileged information? In Mechanistic Interpretability Workshop at NeurIPS 2025, 2025 b . URL https://openreview.net/forum?id=zyhibAkzSA

  28. [28]

    Emergent introspective awareness in large language models

    Jack Lindsey. Emergent introspective awareness in large language models. Transformer Circuits Thread, 2025. URL https://transformer-circuits.pub/2025/introspection/index.html

  29. [29]

    Mechanisms of Introspective Awareness

    Uzay Macar, Li Yang, Atticus Wang, Peter Wallich, Emmanuel Ameisen, and Jack Lindsey. Mechanisms of introspective awareness, 2026. URL https://arxiv.org/abs/2603.21396

  30. [30]

    Andreas Madsen, Sarath Chandar, and Siva Reddy. Are self-explanations from large language models faithful? In Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024 , volume ACL 2024 of Findings of ACL , pages 295--337. Association for Computational Linguistics, 2024. doi:10.18653/V1/...

  31. [31]

    Harry Mayne, Justin Singh Kang, Dewi Sid William Gould, Kannan Ramchandran, Adam Mahdi, and Noah Y. Siegel. A positive case for faithfulness: LLM self-explanations help predict model behavior. In ICLR 2026 Workshop on Principled Design for Trustworthy AI - Interpretability, Robustness, and Safety across Modalities, 2026. URL https://openreview.net/forum?i...

  32. [32]

    Circuit component reuse across tasks in transformer language models

    Jack Merullo, Carsten Eickhoff, and Ellie Pavlick. Circuit component reuse across tasks in transformer language models. In International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=fpoAYV6Wsk

  33. [33]

    Latent QA : Teaching LLM s to decode activations into natural language

    Alexander Pan, Lijie Chen, and Jacob Steinhardt. Latent QA : Teaching LLM s to decode activations into natural language. In International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=niUroX9EOd

  34. [34]

    Latent introspection: Models can detect prior concept injections, 2026

    Theia Pearson-Vogel, Martin Vanek, Raymond Douglas, and Jan Kulveit. Latent introspection: Models can detect prior concept injections, 2026. URL https://arxiv.org/abs/2602.20031

  35. [35]

    The fineweb datasets: Decanting the web for the finest text data at scale

    Guilherme Penedo, Hynek Kydl \' c ek, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, and Thomas Wolf. The fineweb datasets: Decanting the web for the finest text data at scale. In Advances in Neural Information Processing Systems Datasets and Benchmarks Track, 2024. URL https://openreview.net/forum?id=n6SCkn2QaG

  36. [36]

    Self-interpretability: Llms can describe complex internal processes that drive their decisions, 2025

    Dillon Plunkett, Adam Morris, Keerthi Reddy, and Jorge Morales. Self-interpretability: Llms can describe complex internal processes that drive their decisions, 2025. URL https://arxiv.org/abs/2505.17120

  37. [37]

    Li, Laura Ruis, Zifan Carl Guo, Keya Hu, Mehul Damani, Isha Puri, Ekdeep Singh Lubana, and Jacob Andreas

    Itamar Pres, Belinda Z. Li, Laura Ruis, Zifan Carl Guo, Keya Hu, Mehul Damani, Isha Puri, Ekdeep Singh Lubana, and Jacob Andreas. Position: It s time to optimize for self-consistency. In International Conference on Machine Learning Position Paper Track, 2026. URL https://time-for-consistency.github.io/

  38. [38]

    Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Esin DURMUS, Zac Hatfield-Dodds, Scott R Johnston, Shauna M Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang, and Ethan Perez. Towards understanding sycophancy in language models. In International Confere...

  39. [39]

    Introspection Adapters: Training LLMs to Report Their Learned Behaviors

    Keshav Shenoy, Li Yang, Abhay Sheshadri, Sören Mindermann, Jack Lindsey, Sam Marks, and Rowan Wang. Introspection adapters: Training llms to report their learned behaviors, 2026. URL https://arxiv.org/abs/2604.16812

  40. [40]

    Latent adversarial training improves robustness to persistent harmful behaviors in LLM s

    Abhay Sheshadri, Aidan Ewart, Phillip Huang Guo, Aengus Lynch, Cindy Wu, Vivek Hebbar, Henry Sleight, Asa Cooper Stickland, Ethan Perez, Dylan Hadfield-Menell, and Stephen Casper. Latent adversarial training improves robustness to persistent harmful behaviors in LLM s. In International Conference on Learning Representations, 2025. URL https://openreview.n...

  41. [41]

    Can LLMs Introspect? A Reality Check

    Shashwat Singh, Tal Linzen, and Shauli Ravfogel. Can llms introspect? a reality check, 2026. URL https://arxiv.org/abs/2605.26242

  42. [42]

    Language models fail to introspect about their knowledge of language

    Siyuan Song, Jennifer Hu, and Kyle Mahowald. Language models fail to introspect about their knowledge of language. In Conference on Language Modeling, 2025. URL https://openreview.net/forum?id=AivRDOFi5H

  43. [43]

    Privileged self-access matters for introspection in AI

    Siyuan Song, Harvey Lederman, Jennifer Hu, and Kyle Mahowald. Privileged self-access matters for introspection in AI . In ICML 2026 Workshop: Philosophy Meets Machine Learning, 2026. URL https://openreview.net/forum?id=ZcqCJHOWAA

  44. [44]

    Miles Turpin, Julian Michael, Ethan Perez, and Samuel R. Bowman. Language models don't always say what they think: Unfaithful explanations in chain-of-thought prompting. In Advances in Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=bzs4uPLXvi

  45. [45]

    Interpretability in the wild: a circuit for indirect object identification in GPT -2 small

    Kevin Ro Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the wild: a circuit for indirect object identification in GPT -2 small. In International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=NpsVSN6o4ul

  46. [46]

    Chi, Quoc V Le, and Denny Zhou

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed H. Chi, Quoc V Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=_VjQlMeSB_J

  47. [47]

    Jerry Wei, Da Huang, Yifeng Lu, Denny Zhou, and Quoc V. Le. Simple synthetic data reduces sycophancy in large language models, 2024. URL https://arxiv.org/abs/2308.03958

  48. [48]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, et al. Qwen3 technical report, 2025. URL https://arxiv.org/abs/2505.09388

  49. [49]

    Zhehao Zhang, Weijie Xu, Fanyou Wu, and Chandan K. Reddy. Falsereject: A resource for improving contextual safety and mitigating over-refusals in LLM s via structured reasoning. In Conference on Language Modeling, 2025. URL https://openreview.net/forum?id=1w9Hay7tvm

  50. [50]

    Wildchat: 1m chat GPT interaction logs in the wild

    Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie, Yejin Choi, and Yuntian Deng. Wildchat: 1m chat GPT interaction logs in the wild. In International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=Bl8u7ZRlbM

  51. [51]

    Spontaneous introspection in output tampering

    Ziqian Zhong. Spontaneous introspection in output tampering. LessWrong, April 2026. URL https://www.lesswrong.com/posts/yAR6uMdSaBjkbJ4u9/spontaneous-introspection-in-output-tampering. Accessed: 2026-05-06