Forecasting Future Behavior as a Learning Task

Asa Cooper Stickland; Mosh Levy; Yoav Goldberg

arxiv: 2606.11445 · v1 · pith:WAY2MQTRnew · submitted 2026-06-09 · 💻 cs.AI

Forecasting Future Behavior as a Learning Task

Mosh Levy , Yoav Goldberg , Asa Cooper Stickland This is my paper

Pith reviewed 2026-06-27 12:53 UTC · model grok-4.3

classification 💻 cs.AI

keywords behavior forecastinglarge reasoning modelsreasoning trajectoriesfine-tuningmodel behavior predictionAI system trusttrajectory analysis

0 comments

The pith

Behavior Forecasters trained on LRM trajectories predict future behavior more accurately than frontier models reading the same trajectories.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes to treat forecasting an LRM's future behavior from its reasoning trajectory as a supervised learning task instead of relying on human-readable explanations. Training data is generated automatically by querying the target LRM multiple times on the same inputs, allowing a smaller model to learn predictions such as answer repetition on re-runs or changes after input edits. Across three reasoning datasets, these trained forecasters outperform GPT-5.4 and Claude Opus-4.6 used as naive readers of the same trajectories while requiring only a single forward pass at much lower cost. Strong results depend on initializing the forecaster from the target LRM and performing end-to-end fine-tuning. The findings indicate that trajectories encode signals about future behavior that direct reading does not capture.

Core claim

We propose an alternative that bypasses the explanation step: treat behavior forecasting as a learnable task and train Behavior Forecasters that operates on a single reasoning trajectory to make the same forecasts one would typically seek from an explanation. The forecaster's training data is obtained by querying the LRM with no human annotation, and its inference is done in a single forward pass. We instantiate this approach on two tasks: how likely the LRM is to repeat its answer on re-runs, and how removing parts of the input changes its answer. We evaluate this approach on both tasks across three diverse reasoning datasets and find that trained Behavior Forecasters are more accurate than

What carries the argument

Behavior Forecaster: a smaller model initialized from the target LRM, fine-tuned end-to-end to output behavior predictions such as repetition likelihood from a single reasoning trajectory in one forward pass.

If this is right

Trained Behavior Forecasters achieve higher accuracy than GPT-5.4 or Claude Opus-4.6 reading the same trajectories as naive readers.
Forecasting requires only a single forward pass at a small fraction of the inference cost of frontier models.
Fine-tuning the backbone end-to-end and initializing it from the target LRM are each necessary for the performance gains.
The reasoning trajectory encodes information about the LRM's future behavior beyond what naive reading conveys.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could be extended to forecast additional behaviors such as hallucination rates or output length without new human labels.
Behavior Forecasters could function as lightweight monitors that run alongside deployed LRMs to anticipate problems before they appear.
Training forecasters on open-weight models would allow third parties to build behavior predictors without access to the original proprietary systems.

Load-bearing premise

The assumption that fine-tuning the backbone end-to-end and initializing it from the target LRM are each necessary for strong performance and that the trajectory carries information about future behavior beyond what naive reading conveys.

What would settle it

An experiment in which a forecaster not initialized from the target LRM or trained without end-to-end fine-tuning matches the accuracy of the proposed forecasters on the repetition-probability and input-sensitivity tasks.

Figures

Figures reproduced from arXiv: 2606.11445 by Asa Cooper Stickland, Mosh Levy, Yoav Goldberg.

**Figure 2.** Figure 2: A trained Behavior Forecaster is more accurate than two frontier naive readers and the [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

read the original abstract

Trust in an AI system is often anchored by explanations of how it works, which one then uses to forecast its behavior on new inputs. For large reasoning models (LRMs), this conventional route is particularly difficult to follow: explanation methods for single token generations do not naturally generalize to long trajectories, and the trajectories themselves are often not faithful when read as natural language. We propose an alternative that bypasses the explanation step: treat behavior forecasting as a learnable task and train Behavior Forecasters that operates on a single reasoning trajectory to make the same forecasts one would typically seek from an explanation. The forecaster's training data is obtained by querying the LRM with no human annotation, and its inference is done in a single forward pass. We instantiate this approach on two tasks: how likely the LRM is to repeat its answer on re-runs, and how removing parts of the input changes its answer. We evaluate this approach on both tasks across three diverse reasoning datasets and find that trained Behavior Forecasters are more accurate than GPT-5.4 and Claude Opus-4.6 reading the same trajectories as naive readers, at a small fraction of their inference cost. We find that fine-tuning the backbone end-to-end and initializing it from the target LRM are each necessary for strong performance. These results show that the reasoning trajectory carries information about the LRM's future behavior that goes beyond what naive reading conveys.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows you can train a forecaster from the target LRM on its own trajectories to predict repeatability and input sensitivity better than prompting GPT-5.4 or Claude Opus-4.6, with init and end-to-end tuning both required.

read the letter

The core move here is to skip explanations entirely and treat forecasting an LRM's future behavior as a supervised task. Labels come from extra queries to the same model, the forecaster runs in one pass on a single trajectory, and it is initialized from the target model then fine-tuned end-to-end. They test this on repeatability under re-runs and sensitivity to input edits, across three reasoning datasets.

The necessity claims are useful: both the initialization from the target LRM and the end-to-end fine-tuning step are reported as required for the performance gain. That directly tests whether the trajectory itself carries extra signal beyond just having access to the model weights.

The abstract gives no numbers, no dataset sizes, no prompting details for the baseline models, and no ablation tables beyond the necessity result. Without those, it is hard to judge whether the reported outperformance is large enough to matter or whether the experimental controls are tight. The evaluation claims rest on an unspecified design.

The framing is clean and avoids the faithfulness problems that come with reading trajectories as explanations. It is aimed at people who need cheap, repeatable forecasts of model behavior for reliability work. The paper deserves a serious referee because the setup is coherent, the necessity experiments address a key alternative explanation, and the full experimental details would let readers assess whether the gains hold up.

Referee Report

0 major / 1 minor

Summary. The paper claims that behavior forecasting for large reasoning models can be framed as a supervised learning task where labels are generated by querying the target model itself. A Behavior Forecaster is trained on reasoning trajectories to predict future behaviors such as answer repetition and sensitivity to input changes. The trained forecaster outperforms larger models like GPT-5.4 and Claude Opus-4.6 on the same task when given the trajectories, at lower cost, and both initialization from the target model and end-to-end fine-tuning are necessary for good performance. This is demonstrated on three reasoning datasets for two tasks.

Significance. If the results hold, this work provides evidence that reasoning trajectories contain information about future model behavior that is not easily accessible through naive reading by even larger models. It offers a practical, scalable method for forecasting LRM behavior without human annotation or complex explanation techniques. The emphasis on self-supervised label generation and the necessity of model-specific training are positive aspects that strengthen the contribution.

minor comments (1)

[Abstract] The abstract states the outperformance claim but does not include any specific metrics, dataset details, or experimental controls, which would help readers assess the results quickly.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the supportive summary and recommendation of minor revision. No specific major comments were raised in the report, so we have no points requiring detailed rebuttal at this stage. We will address any minor issues identified during the revision process.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes a supervised learning setup in which labels for future LRM behavior are generated by separate queries to the target model (no human annotation), after which a distinct forecaster model—initialized from the LRM and fine-tuned end-to-end—is trained to predict those labels from a single trajectory. This process does not reduce any claimed prediction to its own inputs by construction, nor does it rely on self-citation chains, uniqueness theorems, or smuggled ansatzes. The evaluation compares the trained forecaster against larger models prompted on the same trajectories, and the paper states that both initialization and end-to-end fine-tuning were found necessary. No load-bearing step matches any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that trajectories encode usable signals about future behavior and on the empirical necessity of specific fine-tuning choices.

axioms (1)

domain assumption Reasoning trajectories contain information about the LRM's future behavior that exceeds what naive reading extracts
Invoked to justify training forecasters instead of using explanations

invented entities (1)

Behavior Forecaster no independent evidence
purpose: Model trained to predict LRM behavior from a single trajectory
New model class introduced in the paper

pith-pipeline@v0.9.1-grok · 5778 in / 1173 out tokens · 19089 ms · 2026-06-27T12:53:39.893160+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

168 extracted references · 7 canonical work pages

[1]

arXiv preprint arXiv:2512.12777 , year=

State over Tokens: Characterizing the Role of Reasoning Tokens , author=. arXiv preprint arXiv:2512.12777 , year=

arXiv
[2]

Nature , volume=

Machine behaviour , author=. Nature , volume=. 2019 , publisher=

2019
[3]

Journal of Social Computing , year=

Generative Models as a Complex Systems Science: How Can We Make Sense of Large Language Model Behavior? , author=. Journal of Social Computing , year=
[4]

Beyond Accuracy: Behavioral Testing of

Ribeiro, Marco Tulio and Wu, Tongshuang and Guestrin, Carlos and Singh, Sameer , booktitle=. Beyond Accuracy: Behavioral Testing of. 2020 , doi=

2020
[5]

ICML Workshop on Human Interpretability in Machine Learning , year=

On the Robustness of Interpretability Methods , author=. ICML Workshop on Human Interpretability in Machine Learning , year=
[6]

International Conference on Learning Representations , year=

Self-Consistency Improves Chain of Thought Reasoning in Language Models , author=. International Conference on Learning Representations , year=
[7]

Evaluating Explainable

Hase, Peter and Bansal, Mohit , booktitle=. Evaluating Explainable. 2020 , doi=

2020
[8]

2019 , howpublished=

The Bitter Lesson , author=. 2019 , howpublished=

2019
[9]

arXiv preprint arXiv:1702.08608 , year=

Towards a rigorous science of interpretable machine learning , author=. arXiv preprint arXiv:1702.08608 , year=

Pith/arXiv arXiv
[10]

Artificial Intelligence , volume=

Explanation in Artificial Intelligence: Insights from the Social Sciences , author=. Artificial Intelligence , volume=. 2019 , doi=

2019
[11]

arXiv preprint arXiv:2512.07810 , year=

Auditing Games for Sandbagging , author=. arXiv preprint arXiv:2512.07810 , year=

arXiv
[12]

arXiv preprint arXiv:2512.18311 , year=

Monitoring Monitorability , author=. arXiv preprint arXiv:2512.18311 , year=

arXiv
[13]

2025 , eprint=

Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation , author=. 2025 , eprint=

2025
[14]

2509.13334 , archivePrefix=

Anand Swaroop and Akshat Nallani and Saksham Uboweja and Adiliia Uzdenova and Michael Nguyen and Kevin Zhu and Sunishchal Dev and Ashwinee Panda and Vasu Sharma and Maheep Chaudhary , year=. 2509.13334 , archivePrefix=

arXiv
[15]

CoRR , volume =

Joshua Goodman , title =. CoRR , volume =. 2001 , url =

2001
[16]

Patterns , volume=

Explainability pitfalls: Beyond dark patterns in explainable AI , author=. Patterns , volume=. 2024 , publisher=

2024
[17]

Jmir Ai , volume=

How explainable artificial intelligence can increase or decrease clinicians' trust in AI applications in health care: systematic review , author=. Jmir Ai , volume=. 2024 , publisher=

2024
[18]

2025 , note =

International AI Safety Report: First Key Update: Capabilities and Risk Implications , institution =. 2025 , note =

2025
[19]

Cognitive science , volume=

Finding structure in time , author=. Cognitive science , volume=. 1990 , publisher=

1990
[20]

2025 , eprint=

Large language models can learn and generalize steganographic chain-of-thought under process supervision , author=. 2025 , eprint=

2025
[21]

1980 , publisher=

Metaphors we live by , author=. 1980 , publisher=

1980
[22]

Transactions of the Association for Computational Linguistics , volume =

Wang, Ling and Zhao, Qi and Li, Ming , title =. Transactions of the Association for Computational Linguistics , volume =. 2024 , doi =

2024
[23]

2024 , eprint=

Let's Think Dot by Dot: Hidden Computation in Transformer Language Models , author=. 2024 , eprint=

2024
[24]

arXiv preprint arXiv:2502.06807 , year =

Competitive Programming with Large Reasoning Models , author =. arXiv preprint arXiv:2502.06807 , year =

arXiv
[25]

2025 , note =

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities , author =. 2025 , note =

2025
[26]

arXiv preprint arXiv:2506.22992 , year=

MARBLE: A Hard Benchmark for Multimodal Spatial Reasoning and Planning , author=. arXiv preprint arXiv:2506.22992 , year=

arXiv
[27]

Joshua T. Goodman. A bit of progress in language modeling. Computer Speech & Language. 2001. doi:10.1006/csla.2001.0174

work page doi:10.1006/csla.2001.0174 2001
[28]

CoRR , volume =

Rebecca Hwa , title =. CoRR , volume =. 1999 , url =

1999
[29]

Supervised Grammar Induction using Training Data with Limited Constituent Information

Hwa, Rebecca. Supervised Grammar Induction using Training Data with Limited Constituent Information. Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics. 1999

1999
[30]

, title =

Jurafsky, Daniel and Martin, James H. , title =
[31]

, author=

The Use of Participles and Gerunds. , author=. Online Submission , year=
[32]

2024 , url=

OpenAI o1 System Card , author=. 2024 , url=

2024
[33]

arXiv preprint arXiv:2411.14405 , year=

Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions , author=. arXiv preprint arXiv:2411.14405 , year=

arXiv
[34]

arXiv preprint arXiv:2501.12948 , year=

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning , author=. arXiv preprint arXiv:2501.12948 , year=

Pith/arXiv arXiv
[35]

Qwen3.5: Accelerating Productivity with Native Multimodal Agents , url =

Qwen Team , month =. Qwen3.5: Accelerating Productivity with Native Multimodal Agents , url =
[36]

International Conference on Learning Representations , year=

From Sparse Dependence to Sparse Attention: Unveiling How Chain-of-Thought Enhances Transformer Sample Efficiency , author=. International Conference on Learning Representations , year=
[37]

arXiv preprint arXiv:2410.21333 , year=

Mind Your Step (by Step): Chain-of-Thought can Reduce Performance on Tasks where Thinking Makes Humans Worse , author=. arXiv preprint arXiv:2410.21333 , year=

arXiv
[38]

Lindsey, Jack and Gurnee, Wes and Ameisen, Emmanuel and Chen, Brian and Pearce, Adam and Turner, Nicholas L. and Citro, Craig and Abrahams, David and Carter, Shan and Hosmer, Basil and Marcus, Jonathan and Sklar, Michael and Templeton, Adly and Bricken, Trenton and McDougall, Callum and Cunningham, Hoagy and Henighan, Thomas and Jermyn, Adam and Jones, An...
[39]

arXiv preprint arXiv:2411.15124 , year=

Tulu 3: Pushing frontiers in open language model post-training , author=. arXiv preprint arXiv:2411.15124 , year=

Pith/arXiv arXiv
[40]

2025 , eprint=

Towards Better Chain-of-Thought: A Reflection on Effectiveness and Faithfulness , author=. 2025 , eprint=

2025
[41]

arXiv preprint arXiv:2508.00943 , year=

Llms can covertly sandbag on capability evaluations against chain-of-thought monitoring , author=. arXiv preprint arXiv:2508.00943 , year=

arXiv
[42]

2024 , eprint=

Making Reasoning Matter: Measuring and Improving Faithfulness of Chain-of-Thought Reasoning , author=. 2024 , eprint=

2024
[43]

2025 , eprint=

Typed Chain-of-Thought: A Curry-Howard Framework for Verifying LLM Reasoning , author=. 2025 , eprint=

2025
[44]

2025 , eprint=

How does Chain of Thought Think? Mechanistic Interpretability of Chain-of-Thought Reasoning with Sparse Autoencoding , author=. 2025 , eprint=

2025
[45]

arXiv preprint arXiv:2507.06261 , year=

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. arXiv preprint arXiv:2507.06261 , year=

Pith/arXiv arXiv
[46]

Philosophy of Science , year=

Studies in the Logic of Explanation , author=. Philosophy of Science , year=. doi:10.1086/286983 , url=

work page doi:10.1086/286983
[47]

Philosophy of Science , year=

Reintroducing Prediction to Explanation , author=. Philosophy of Science , year=. doi:10.1086/648111 , url=

work page doi:10.1086/648111
[48]

Proceedings of the 2021 ACM conference on fairness, accountability, and transparency , pages=

Formalizing trust in artificial intelligence: Prerequisites, causes and goals of human trust in AI , author=. Proceedings of the 2021 ACM conference on fairness, accountability, and transparency , pages=

2021
[49]

Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency , year=

How Explainability Contributes to Trust in AI , author=. Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency , year=

2022
[50]

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , year=

Towards Faithfully Interpretable NLP Systems: How Should We Define and Evaluate Faithfulness? , author=. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , year=
[51]

Advances in Neural Information Processing Systems , volume=

Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting , author=. Advances in Neural Information Processing Systems , volume=. 2023 , url=

2023
[52]

2025 , eprint=

Unspoken Hints: Accuracy Without Acknowledgement in LLM Reasoning , author=. 2025 , eprint=

2025
[53]

arXiv preprint arXiv:2503.08679 , year=

Chain-of-thought reasoning in the wild is not always faithful , author=. arXiv preprint arXiv:2503.08679 , year=

Pith/arXiv arXiv
[54]

plausibility: On the (un) reliability of explanations from large language models , author=

Faithfulness vs. plausibility: On the (un) reliability of explanations from large language models , author=. arXiv preprint arXiv:2402.04614 , year=

arXiv
[55]

Journal of Machine Learning Research , volume=

Causal abstraction: A theoretical foundation for mechanistic interpretability , author=. Journal of Machine Learning Research , volume=
[56]

arXiv preprint arXiv:2307.13702 , year=

Measuring Faithfulness in Chain-of-Thought Reasoning , author=. arXiv preprint arXiv:2307.13702 , year=

Pith/arXiv arXiv
[57]

2025 , eprint=

s1: Simple test-time scaling , author=. 2025 , eprint=

2025
[58]

arXiv preprint arXiv:2310.02226 , year=

Think before you speak: Training Language Models With Pause Tokens , author=. arXiv preprint arXiv:2310.02226 , year=

arXiv
[59]

arXiv preprint arXiv:2212.03827 , year=

Discovering Latent Knowledge in Language Models Without Supervision , author=. arXiv preprint arXiv:2212.03827 , year=

Pith/arXiv arXiv
[60]

Nature Machine Intelligence , volume=

Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead , author=. Nature Machine Intelligence , volume=. 2019 , publisher=

2019
[61]

1996 , publisher=

Using Language , author=. 1996 , publisher=

1996
[62]

International Journal of Human-Computer Studies , volume =

Lumsden, Karen and Moran, Alan , title =. International Journal of Human-Computer Studies , volume =. 2005 , doi =

2005
[63]

Chi, Michelene T. H. , title =. Educational Psychologist , volume =. 2001 , doi =

2001
[64]

Fonseca, Bea and Chi, Michelene T. H. , title =. Trends in Cognitive Sciences , volume =. 2009 , doi =

2009
[65]

Advances in neural information processing systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=
[66]

2025 , eprint=

Are DeepSeek R1 And Other Reasoning Models More Faithful? , author=. 2025 , eprint=

2025
[67]

Chain-of-Thought Is Not Explainability , publisher =

Barez, Fazl and Wu, Tung-Yu and Arcuschin, Iv\'. Chain-of-Thought Is Not Explainability , publisher =
[68]

2023 , eprint=

A Survey on Evaluation of Large Language Models , author=. 2023 , eprint=

2023
[69]

ArXiv , year=

Can LLMs Explain Themselves Counterfactually? , author=. ArXiv , year=
[70]

ArXiv , year=

How Well Can Reasoning Models Identify and Recover from Unhelpful Thoughts? , author=. ArXiv , year=
[71]

arXiv preprint arXiv:2406.10625 , year=

On the hardness of faithful chain-of-thought reasoning in large language models , author=. arXiv preprint arXiv:2406.10625 , year=

arXiv
[72]

arXiv preprint arXiv:2405.04382 , year=

Large language models cannot explain themselves , author=. arXiv preprint arXiv:2405.04382 , year=

arXiv
[73]

ArXiv , year=

MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency , author=. ArXiv , year=
[74]

arXiv preprint arXiv:2405.15092 , year=

Dissociation of faithful and unfaithful reasoning in llms , author=. arXiv preprint arXiv:2405.15092 , year=

arXiv
[75]

Advances in neural information processing systems , volume=

Large language models are zero-shot reasoners , author=. Advances in neural information processing systems , volume=
[76]

2025 , eprint=

A Pragmatic Way to Measure Chain-of-Thought Monitorability , author=. 2025 , eprint=

2025
[77]

2025 , eprint=

The Geometry of Reasoning: Flowing Logics in Representation Space , author=. 2025 , eprint=

2025
[78]

ArXiv , year=

Evaluating Step-by-step Reasoning Traces: A Survey , author=. ArXiv , year=
[79]

ICLR , year=

Emergent world representations: Exploring a sequence model trained on a synthetic task , author=. ICLR , year=
[80]

Understanding intermediate layers using linear classifier probes , author=

Showing first 80 references.

[1] [1]

arXiv preprint arXiv:2512.12777 , year=

State over Tokens: Characterizing the Role of Reasoning Tokens , author=. arXiv preprint arXiv:2512.12777 , year=

arXiv

[2] [2]

Nature , volume=

Machine behaviour , author=. Nature , volume=. 2019 , publisher=

2019

[3] [3]

Journal of Social Computing , year=

Generative Models as a Complex Systems Science: How Can We Make Sense of Large Language Model Behavior? , author=. Journal of Social Computing , year=

[4] [4]

Beyond Accuracy: Behavioral Testing of

Ribeiro, Marco Tulio and Wu, Tongshuang and Guestrin, Carlos and Singh, Sameer , booktitle=. Beyond Accuracy: Behavioral Testing of. 2020 , doi=

2020

[5] [5]

ICML Workshop on Human Interpretability in Machine Learning , year=

On the Robustness of Interpretability Methods , author=. ICML Workshop on Human Interpretability in Machine Learning , year=

[6] [6]

International Conference on Learning Representations , year=

Self-Consistency Improves Chain of Thought Reasoning in Language Models , author=. International Conference on Learning Representations , year=

[7] [7]

Evaluating Explainable

Hase, Peter and Bansal, Mohit , booktitle=. Evaluating Explainable. 2020 , doi=

2020

[8] [8]

2019 , howpublished=

The Bitter Lesson , author=. 2019 , howpublished=

2019

[9] [9]

arXiv preprint arXiv:1702.08608 , year=

Towards a rigorous science of interpretable machine learning , author=. arXiv preprint arXiv:1702.08608 , year=

Pith/arXiv arXiv

[10] [10]

Artificial Intelligence , volume=

Explanation in Artificial Intelligence: Insights from the Social Sciences , author=. Artificial Intelligence , volume=. 2019 , doi=

2019

[11] [11]

arXiv preprint arXiv:2512.07810 , year=

Auditing Games for Sandbagging , author=. arXiv preprint arXiv:2512.07810 , year=

arXiv

[12] [12]

arXiv preprint arXiv:2512.18311 , year=

Monitoring Monitorability , author=. arXiv preprint arXiv:2512.18311 , year=

arXiv

[13] [13]

2025 , eprint=

Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation , author=. 2025 , eprint=

2025

[14] [14]

2509.13334 , archivePrefix=

Anand Swaroop and Akshat Nallani and Saksham Uboweja and Adiliia Uzdenova and Michael Nguyen and Kevin Zhu and Sunishchal Dev and Ashwinee Panda and Vasu Sharma and Maheep Chaudhary , year=. 2509.13334 , archivePrefix=

arXiv

[15] [15]

CoRR , volume =

Joshua Goodman , title =. CoRR , volume =. 2001 , url =

2001

[16] [16]

Patterns , volume=

Explainability pitfalls: Beyond dark patterns in explainable AI , author=. Patterns , volume=. 2024 , publisher=

2024

[17] [17]

Jmir Ai , volume=

How explainable artificial intelligence can increase or decrease clinicians' trust in AI applications in health care: systematic review , author=. Jmir Ai , volume=. 2024 , publisher=

2024

[18] [18]

2025 , note =

International AI Safety Report: First Key Update: Capabilities and Risk Implications , institution =. 2025 , note =

2025

[19] [19]

Cognitive science , volume=

Finding structure in time , author=. Cognitive science , volume=. 1990 , publisher=

1990

[20] [20]

2025 , eprint=

Large language models can learn and generalize steganographic chain-of-thought under process supervision , author=. 2025 , eprint=

2025

[21] [21]

1980 , publisher=

Metaphors we live by , author=. 1980 , publisher=

1980

[22] [22]

Transactions of the Association for Computational Linguistics , volume =

Wang, Ling and Zhao, Qi and Li, Ming , title =. Transactions of the Association for Computational Linguistics , volume =. 2024 , doi =

2024

[23] [23]

2024 , eprint=

Let's Think Dot by Dot: Hidden Computation in Transformer Language Models , author=. 2024 , eprint=

2024

[24] [24]

arXiv preprint arXiv:2502.06807 , year =

Competitive Programming with Large Reasoning Models , author =. arXiv preprint arXiv:2502.06807 , year =

arXiv

[25] [25]

2025 , note =

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities , author =. 2025 , note =

2025

[26] [26]

arXiv preprint arXiv:2506.22992 , year=

MARBLE: A Hard Benchmark for Multimodal Spatial Reasoning and Planning , author=. arXiv preprint arXiv:2506.22992 , year=

arXiv

[27] [27]

Joshua T. Goodman. A bit of progress in language modeling. Computer Speech & Language. 2001. doi:10.1006/csla.2001.0174

work page doi:10.1006/csla.2001.0174 2001

[28] [28]

CoRR , volume =

Rebecca Hwa , title =. CoRR , volume =. 1999 , url =

1999

[29] [29]

Supervised Grammar Induction using Training Data with Limited Constituent Information

Hwa, Rebecca. Supervised Grammar Induction using Training Data with Limited Constituent Information. Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics. 1999

1999

[30] [30]

, title =

Jurafsky, Daniel and Martin, James H. , title =

[31] [31]

, author=

The Use of Participles and Gerunds. , author=. Online Submission , year=

[32] [32]

2024 , url=

OpenAI o1 System Card , author=. 2024 , url=

2024

[33] [33]

arXiv preprint arXiv:2411.14405 , year=

Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions , author=. arXiv preprint arXiv:2411.14405 , year=

arXiv

[34] [34]

arXiv preprint arXiv:2501.12948 , year=

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning , author=. arXiv preprint arXiv:2501.12948 , year=

Pith/arXiv arXiv

[35] [35]

Qwen3.5: Accelerating Productivity with Native Multimodal Agents , url =

Qwen Team , month =. Qwen3.5: Accelerating Productivity with Native Multimodal Agents , url =

[36] [36]

International Conference on Learning Representations , year=

From Sparse Dependence to Sparse Attention: Unveiling How Chain-of-Thought Enhances Transformer Sample Efficiency , author=. International Conference on Learning Representations , year=

[37] [37]

arXiv preprint arXiv:2410.21333 , year=

Mind Your Step (by Step): Chain-of-Thought can Reduce Performance on Tasks where Thinking Makes Humans Worse , author=. arXiv preprint arXiv:2410.21333 , year=

arXiv

[38] [38]

Lindsey, Jack and Gurnee, Wes and Ameisen, Emmanuel and Chen, Brian and Pearce, Adam and Turner, Nicholas L. and Citro, Craig and Abrahams, David and Carter, Shan and Hosmer, Basil and Marcus, Jonathan and Sklar, Michael and Templeton, Adly and Bricken, Trenton and McDougall, Callum and Cunningham, Hoagy and Henighan, Thomas and Jermyn, Adam and Jones, An...

[39] [39]

arXiv preprint arXiv:2411.15124 , year=

Tulu 3: Pushing frontiers in open language model post-training , author=. arXiv preprint arXiv:2411.15124 , year=

Pith/arXiv arXiv

[40] [40]

2025 , eprint=

Towards Better Chain-of-Thought: A Reflection on Effectiveness and Faithfulness , author=. 2025 , eprint=

2025

[41] [41]

arXiv preprint arXiv:2508.00943 , year=

Llms can covertly sandbag on capability evaluations against chain-of-thought monitoring , author=. arXiv preprint arXiv:2508.00943 , year=

arXiv

[42] [42]

2024 , eprint=

Making Reasoning Matter: Measuring and Improving Faithfulness of Chain-of-Thought Reasoning , author=. 2024 , eprint=

2024

[43] [43]

2025 , eprint=

Typed Chain-of-Thought: A Curry-Howard Framework for Verifying LLM Reasoning , author=. 2025 , eprint=

2025

[44] [44]

2025 , eprint=

How does Chain of Thought Think? Mechanistic Interpretability of Chain-of-Thought Reasoning with Sparse Autoencoding , author=. 2025 , eprint=

2025

[45] [45]

arXiv preprint arXiv:2507.06261 , year=

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. arXiv preprint arXiv:2507.06261 , year=

Pith/arXiv arXiv

[46] [46]

Philosophy of Science , year=

Studies in the Logic of Explanation , author=. Philosophy of Science , year=. doi:10.1086/286983 , url=

work page doi:10.1086/286983

[47] [47]

Philosophy of Science , year=

Reintroducing Prediction to Explanation , author=. Philosophy of Science , year=. doi:10.1086/648111 , url=

work page doi:10.1086/648111

[48] [48]

Proceedings of the 2021 ACM conference on fairness, accountability, and transparency , pages=

Formalizing trust in artificial intelligence: Prerequisites, causes and goals of human trust in AI , author=. Proceedings of the 2021 ACM conference on fairness, accountability, and transparency , pages=

2021

[49] [49]

Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency , year=

How Explainability Contributes to Trust in AI , author=. Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency , year=

2022

[50] [50]

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , year=

Towards Faithfully Interpretable NLP Systems: How Should We Define and Evaluate Faithfulness? , author=. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , year=

[51] [51]

Advances in Neural Information Processing Systems , volume=

Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting , author=. Advances in Neural Information Processing Systems , volume=. 2023 , url=

2023

[52] [52]

2025 , eprint=

Unspoken Hints: Accuracy Without Acknowledgement in LLM Reasoning , author=. 2025 , eprint=

2025

[53] [53]

arXiv preprint arXiv:2503.08679 , year=

Chain-of-thought reasoning in the wild is not always faithful , author=. arXiv preprint arXiv:2503.08679 , year=

Pith/arXiv arXiv

[54] [54]

plausibility: On the (un) reliability of explanations from large language models , author=

Faithfulness vs. plausibility: On the (un) reliability of explanations from large language models , author=. arXiv preprint arXiv:2402.04614 , year=

arXiv

[55] [55]

Journal of Machine Learning Research , volume=

Causal abstraction: A theoretical foundation for mechanistic interpretability , author=. Journal of Machine Learning Research , volume=

[56] [56]

arXiv preprint arXiv:2307.13702 , year=

Measuring Faithfulness in Chain-of-Thought Reasoning , author=. arXiv preprint arXiv:2307.13702 , year=

Pith/arXiv arXiv

[57] [57]

2025 , eprint=

s1: Simple test-time scaling , author=. 2025 , eprint=

2025

[58] [58]

arXiv preprint arXiv:2310.02226 , year=

Think before you speak: Training Language Models With Pause Tokens , author=. arXiv preprint arXiv:2310.02226 , year=

arXiv

[59] [59]

arXiv preprint arXiv:2212.03827 , year=

Discovering Latent Knowledge in Language Models Without Supervision , author=. arXiv preprint arXiv:2212.03827 , year=

Pith/arXiv arXiv

[60] [60]

Nature Machine Intelligence , volume=

Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead , author=. Nature Machine Intelligence , volume=. 2019 , publisher=

2019

[61] [61]

1996 , publisher=

Using Language , author=. 1996 , publisher=

1996

[62] [62]

International Journal of Human-Computer Studies , volume =

Lumsden, Karen and Moran, Alan , title =. International Journal of Human-Computer Studies , volume =. 2005 , doi =

2005

[63] [63]

Chi, Michelene T. H. , title =. Educational Psychologist , volume =. 2001 , doi =

2001

[64] [64]

Fonseca, Bea and Chi, Michelene T. H. , title =. Trends in Cognitive Sciences , volume =. 2009 , doi =

2009

[65] [65]

Advances in neural information processing systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=

[66] [66]

2025 , eprint=

Are DeepSeek R1 And Other Reasoning Models More Faithful? , author=. 2025 , eprint=

2025

[67] [67]

Chain-of-Thought Is Not Explainability , publisher =

Barez, Fazl and Wu, Tung-Yu and Arcuschin, Iv\'. Chain-of-Thought Is Not Explainability , publisher =

[68] [68]

2023 , eprint=

A Survey on Evaluation of Large Language Models , author=. 2023 , eprint=

2023

[69] [69]

ArXiv , year=

Can LLMs Explain Themselves Counterfactually? , author=. ArXiv , year=

[70] [70]

ArXiv , year=

How Well Can Reasoning Models Identify and Recover from Unhelpful Thoughts? , author=. ArXiv , year=

[71] [71]

arXiv preprint arXiv:2406.10625 , year=

On the hardness of faithful chain-of-thought reasoning in large language models , author=. arXiv preprint arXiv:2406.10625 , year=

arXiv

[72] [72]

arXiv preprint arXiv:2405.04382 , year=

Large language models cannot explain themselves , author=. arXiv preprint arXiv:2405.04382 , year=

arXiv

[73] [73]

ArXiv , year=

MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency , author=. ArXiv , year=

[74] [74]

arXiv preprint arXiv:2405.15092 , year=

Dissociation of faithful and unfaithful reasoning in llms , author=. arXiv preprint arXiv:2405.15092 , year=

arXiv

[75] [75]

Advances in neural information processing systems , volume=

Large language models are zero-shot reasoners , author=. Advances in neural information processing systems , volume=

[76] [76]

2025 , eprint=

A Pragmatic Way to Measure Chain-of-Thought Monitorability , author=. 2025 , eprint=

2025

[77] [77]

2025 , eprint=

The Geometry of Reasoning: Flowing Logics in Representation Space , author=. 2025 , eprint=

2025

[78] [78]

ArXiv , year=

Evaluating Step-by-step Reasoning Traces: A Survey , author=. ArXiv , year=

[79] [79]

ICLR , year=

Emergent world representations: Exploring a sequence model trained on a synthetic task , author=. ICLR , year=

[80] [80]

Understanding intermediate layers using linear classifier probes , author=