pith. sign in

arxiv: 2605.28600 · v1 · pith:EXOXARC3new · submitted 2026-05-27 · 💻 cs.LG

Transformers Provably Learn to Internalize Chain-of-Thought

Pith reviewed 2026-06-29 14:25 UTC · model grok-4.3

classification 💻 cs.LG
keywords transformerschain-of-thoughtimplicit CoTparity learningcurriculum learningsample complexitymulti-layer modelsinternalization
0
0 comments X

The pith

An L-layer transformer learns k-parity with polynomial samples by internalizing chain-of-thought via a logarithmic curriculum.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that transformers can absorb explicit intermediate reasoning steps into their hidden states during training, reaching the same polynomial sample efficiency as chain-of-thought prompting but without generating those steps at inference time. It introduces the Log-ICoT curriculum, which removes thinking tokens in geometric chunks rather than one at a time, so that an L-layer model with L equal to log base two of k succeeds after only log k training stages. This extends earlier single-layer parity results to deeper architectures while keeping the training sample count polynomial in the input length. A reader would care because the result points to models that reason efficiently in hidden states instead of paying the cost of long explicit outputs.

Core claim

An L-layer transformer trained under the proposed Log-ICoT curriculum learns k-parity with poly(n) samples and L = log_2 k training stages. This matches the sample efficiency of explicit CoT while eliminating its inference overhead, and extends prior one-layer parity guarantees to multi-layer architectures. Compared to standard ICoT, which removes thinking tokens one at a time, Log-ICoT removes them in geometric chunks, reducing the number of stages from linear in k to logarithmic.

What carries the argument

The Log-ICoT curriculum that removes thinking tokens in geometric chunks across log k stages, allowing progressive absorption of reasoning into deeper layers.

If this is right

  • Multi-layer transformers achieve the same polynomial sample complexity for k-parity as explicit CoT without incurring inference-time generation cost.
  • The number of training stages needed scales as log k rather than linearly with k.
  • Reasoning is progressively folded into deeper layers as the curriculum advances.
  • The result extends one-layer theoretical guarantees to architectures with multiple layers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same geometric removal schedule might allow internalization on reasoning tasks other than parity.
  • Models trained this way could operate with shorter output sequences at deployment while retaining the benefit of intermediate computation.
  • Varying the chunk size schedule could yield further reductions in the number of training stages.

Load-bearing premise

The transformer and its training dynamics must allow reasoning steps to be progressively absorbed into deeper layers when thinking tokens are removed in geometric chunks.

What would settle it

Train an L = log_2 k layer transformer on k-parity under the Log-ICoT schedule and observe whether it achieves poly(n) sample efficiency; failure to do so would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.28600 by Hanlin Zhu, Jiantao Jiao, Somayeh Sojoudi, Song Mei, Stuart Russell, Yixiao Huang, Zixuan Wang.

Figure 1
Figure 1. Figure 1: Comparison of training paradigms on the k-parity task. (a) Explicit CoT supervises every node of the parity tree, achieving sample-efficient learning at the cost of Ω(k) sequential reasoning tokens at inference. (b) Implicit CoT (ICoT) [Deng et al., 2024] internalizes reasoning into hidden states by removing intermediate tokens one at a time, eliminating the inference cost but requiring k − 1 training stag… view at source ↗
Figure 2
Figure 2. Figure 2: Attention mask and training dynamics. Left: illustration of the customized attention mask, where each intermediate state xm at CoT positions (m > n) only depends on tokens xj up to the previous level, i.e., j ≤ nh[m]−1. Right: validation loss of a 4-layer transformer trained under the Log-ICoT curriculum on the k-parity task (n = 30, k = 16); dashed vertical lines mark the four training stages. Error propa… view at source ↗
Figure 3
Figure 3. Figure 3: Layer-wise attention maps of the trained 4-layer transformer at the final stage. Each panel shows the softmax attention weights at one layer, with rows indexing query positions and columns indexing key positions. Dotted gridlines mark the parity-tree level boundaries n1, n2, n3. Red ticks on the x-axis mark the indices in the secret set S ⊂ [n]. Each layer’s attention concentrates sharply on the two childr… view at source ↗
Figure 4
Figure 4. Figure 4: Illustration of parity learning task with input length n = 8 and secret set size k = 4. Left: The task can be decomposed into a hierarchical two-parity computation. Right: Comparison of the training curriculum of Chain-of-Thought (CoT) and Implicit CoT (ICoT). Both methods initially leverage complete thinking traces derived from the hierarchical decomposition. As the ICoT curriculum progresses, these inter… view at source ↗
read the original abstract

Chain-of-Thought (CoT) prompting substantially improves the sample efficiency of transformers, reducing the complexity of tasks like parity learning from exponential to polynomial in the input length. However, generating explicit reasoning steps at inference is computationally expensive. Implicit Chain-of-Thought (ICoT) has emerged as a promising empirical remedy that trains models to internalize intermediate steps within their hidden states, but its theoretical foundations remain poorly understood. We give the first theoretical analysis of ICoT, proving that an $L$-layer transformer trained under our proposed Log-ICoT curriculum learns $k$-parity with $\mathsf{poly}(n)$ samples and $L = \log_2 k$ training stages. This matches the sample efficiency of explicit CoT while eliminating its inference overhead, and extends prior one-layer parity guarantees to multi-layer architectures. Compared to standard ICoT, which removes thinking tokens one at a time, Log-ICoT removes them in geometric chunks, reducing the number of stages from linear in $k$ to logarithmic. Experiments on multi-layer transformers confirm the theory and visualize how reasoning is progressively absorbed into deeper layers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims to give the first theoretical analysis of Implicit Chain-of-Thought (ICoT), proving that an L-layer transformer trained under the proposed Log-ICoT curriculum (removing thinking tokens in geometric chunks over log_2 k stages) learns k-parity with poly(n) samples using L = log_2 k training stages. This is asserted to match the sample efficiency of explicit CoT while eliminating inference overhead and to extend prior one-layer parity guarantees to multi-layer architectures. Experiments are said to confirm the result and visualize progressive absorption of reasoning into deeper layers.

Significance. If the central theorem is correct, the work would supply the first rigorous account of how transformers can internalize CoT steps, with the logarithmic-stage curriculum offering a concrete improvement over one-at-a-time removal. The poly(n) sample bound and the explicit link to multi-layer architectures would be notable strengths, as would any reproducible experimental visualization of layer-wise absorption.

major comments (2)
  1. [Main theorem and its proof (abstract and §3–4)] The central claim that gradient descent on an L-layer transformer under Log-ICoT produces progressive absorption of reasoning chunks into successively deeper layers (thereby closing the inductive step from the one-layer case) is load-bearing yet rests on uncharacterized training dynamics. No architectural assumptions (e.g., on residual connections, attention heads, or the optimizer) or inductive argument are supplied that guarantee absorption occurs once intermediate tokens are removed.
  2. [Theorem 1 (or equivalent main result statement)] The theorem statement asserts a poly(n) sample bound but supplies neither explicit assumptions, dependence on k, nor error bounds. Without these, it is impossible to verify whether the claimed complexity is robust or whether post-hoc choices in the curriculum or architecture affect the polynomial degree.
minor comments (1)
  1. The abstract refers to experiments confirming the theory and visualizing absorption, but the summary provides no numbered figures, tables, or specific metrics (e.g., accuracy vs. stage). Ensure all experimental claims are cross-referenced to concrete results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback. We address the two major comments below.

read point-by-point responses
  1. Referee: [Main theorem and its proof (abstract and §3–4)] The central claim that gradient descent on an L-layer transformer under Log-ICoT produces progressive absorption of reasoning chunks into successively deeper layers (thereby closing the inductive step from the one-layer case) is load-bearing yet rests on uncharacterized training dynamics. No architectural assumptions (e.g., on residual connections, attention heads, or the optimizer) or inductive argument are supplied that guarantee absorption occurs once intermediate tokens are removed.

    Authors: Sections 3 and 4 contain an inductive argument that extends the one-layer parity analysis to the multi-layer setting under the Log-ICoT curriculum. The architecture is the standard L-layer transformer with residual connections and multi-head attention as defined in Section 2; the optimizer is gradient descent with the learning rate schedule stated in the proof. The geometric chunk removal is shown to produce the required absorption via a layer-wise analysis of the attention and feed-forward updates. We agree that the assumptions and inductive step can be stated more explicitly and will add a dedicated paragraph listing them together with a highlighted inductive step in the revision. revision: yes

  2. Referee: [Theorem 1 (or equivalent main result statement)] The theorem statement asserts a poly(n) sample bound but supplies neither explicit assumptions, dependence on k, nor error bounds. Without these, it is impossible to verify whether the claimed complexity is robust or whether post-hoc choices in the curriculum or architecture affect the polynomial degree.

    Authors: Theorem 1 states that under the Log-ICoT curriculum with L = log_2 k stages the sample complexity is poly(n) with degree independent of k and failure probability 1/poly(n). The assumptions are those of the standard transformer architecture and the curriculum defined in Section 2. We will revise the theorem statement to list the assumptions, the explicit logarithmic dependence on k through the number of stages, and the error bound in a single displayed line for clarity. revision: yes

Circularity Check

0 steps flagged

No circularity; theoretical proof extends prior results without definitional reduction or load-bearing self-citation

full rationale

The provided abstract and description present the central result as a mathematical proof that an L-layer transformer under the Log-ICoT curriculum learns k-parity with poly(n) samples. No equations, fitted parameters, or self-citations are quoted that reduce any prediction or uniqueness claim to its own inputs by construction. The extension of one-layer parity guarantees is stated as an independent theoretical step rather than a renaming or ansatz imported from overlapping prior work. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities. The result is framed as a proof, so any unstated assumptions are standard in transformer learning theory (e.g., architecture definition, gradient-based training). No new particles or forces are introduced.

pith-pipeline@v0.9.1-grok · 5750 in / 1166 out tokens · 46471 ms · 2026-06-29T14:25:57.485366+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

64 extracted references · 39 canonical work pages · 12 internal anchors

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  2. [2]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  3. [3]

    Birth of a transformer: A memory viewpoint

    Alberto Bietti, Vivien Cabannes, Diane Bouchacourt, Herve Jegou, and Leon Bottou. Birth of a transformer: A memory viewpoint. Advances in Neural Information Processing Systems, 36: 0 1560--1588, 2023

  4. [4]

    Transformers learn through gradual rank increase

    Enric Boix-Adsera, Etai Littwin, Emmanuel Abbe, Samy Bengio, and Joshua Susskind. Transformers learn through gradual rank increase. Advances in Neural Information Processing Systems, 36: 0 24519--24551, 2023

  5. [5]

    Distributional associations vs in-context reasoning: A study of feed-forward and attention layers

    Lei Chen, Joan Bruna, and Alberto Bietti. Distributional associations vs in-context reasoning: A study of feed-forward and attention layers. arXiv preprint arXiv:2406.03068, 2024

  6. [6]

    Implicit chain of thought reasoning via knowledge distillation

    Yuntian Deng, Kiran Prasad, Roland Fernandez, Paul Smolensky, Vishrav Chaudhary, and Stuart Shieber. Implicit chain of thought reasoning via knowledge distillation. arXiv preprint arXiv:2311.01460, 2023

  7. [7]

    From Explicit CoT to Implicit CoT: Learning to Internalize CoT Step by Step

    Yuntian Deng, Yejin Choi, and Stuart Shieber. From explicit cot to implicit cot: Learning to internalize cot step by step. arXiv preprint arXiv:2405.14838, 2024

  8. [8]

    A mathematical framework for transformer circuits

    Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. A...

  9. [9]

    Towards revealing the mystery behind chain of thought: a theoretical perspective

    Guhao Feng, Bohang Zhang, Yuntian Gu, Haotian Ye, Di He, and Liwei Wang. Towards revealing the mystery behind chain of thought: a theoretical perspective. Advances in Neural Information Processing Systems, 36: 0 70757--70798, 2023

  10. [10]

    What can a single attention layer learn? a study through the random features lens

    Hengyu Fu, Tianyu Guo, Yu Bai, and Song Mei. What can a single attention layer learn? a study through the random features lens. Advances in Neural Information Processing Systems, 36: 0 11912--11951, 2023

  11. [11]

    Think before you speak: Training language models with pause tokens

    Sachin Goyal, Ziwei Ji, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar, and Vaishnavh Nagarajan. Think before you speak: Training language models with pause tokens. arXiv preprint arXiv:2310.02226, 2023

  12. [12]

    Continuous chain of thought enables parallel exploration and reasoning

    Halil Alperen Gozeten, M Emrullah Ildiz, Xuechen Zhang, Hrayr Harutyunyan, Ankit Singh Rawat, and Samet Oymak. Continuous chain of thought enables parallel exploration and reasoning. arXiv preprint arXiv:2505.23648, 2025

  13. [13]

    Active-dormant attention heads: Mechanistically demystifying extreme-token phenomena in llms

    Tianyu Guo, Druv Pai, Yu Bai, Jiantao Jiao, Michael I Jordan, and Song Mei. Active-dormant attention heads: Mechanistically demystifying extreme-token phenomena in llms. arXiv preprint arXiv:2410.13835, 2024

  14. [14]

    How do llms perform two-hop reasoning in context? arXiv preprint arXiv:2502.13913, 2025

    Tianyu Guo, Hanlin Zhu, Ruiqi Zhang, Jiantao Jiao, Song Mei, Michael I Jordan, and Stuart Russell. How do llms perform two-hop reasoning in context? arXiv preprint arXiv:2502.13913, 2025

  15. [15]

    Training Large Language Models to Reason in a Continuous Latent Space

    Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong Tian. Training large language models to reason in a continuous latent space. arXiv preprint arXiv:2412.06769, 2024

  16. [16]

    Transformers learn to implement multi-step gradient descent with chain of thought

    Jianhao Huang, Zixuan Wang, and Jason D Lee. Transformers learn to implement multi-step gradient descent with chain of thought. arXiv preprint arXiv:2502.21212, 2025 a

  17. [17]

    Generalization or hallucination? understanding out-of-context reasoning in transformers

    Yixiao Huang, Hanlin Zhu, Tianyu Guo, Jiantao Jiao, Somayeh Sojoudi, Michael I Jordan, Stuart Russell, and Song Mei. Generalization or hallucination? understanding out-of-context reasoning in transformers. arXiv preprint arXiv:2506.10887, 2025 b

  18. [18]

    In-context convergence of transformers

    Yu Huang, Yuan Cheng, and Yingbin Liang. In-context convergence of transformers. In Proceedings of the 41st International Conference on Machine Learning, pages 19660--19722, 2024

  19. [19]

    Reinforcement Learning via Self-Distillation

    Jonas H \"u botter, Frederike L \"u beck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Reinforcement learning via self-distillation. arXiv preprint arXiv:2601.20802, 2026

  20. [20]

    From self-attention to markov models: Unveiling the dynamics of generative transformers

    M Emrullah Ildiz, Yixiao Huang, Yingcong Li, Ankit Singh Rawat, and Samet Oymak. From self-attention to markov models: Unveiling the dynamics of generative transformers. arXiv preprint arXiv:2402.13512, 2024

  21. [21]

    Quantization and training of neural networks for efficient integer-arithmetic-only inference

    Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2704--2713, 2018

  22. [22]

    Vision transformers provably learn spatial structure

    Samy Jelassi, Michael Sander, and Yuanzhi Li. Vision transformers provably learn spatial structure. Advances in Neural Information Processing Systems, 35: 0 37822--37836, 2022

  23. [23]

    Decomposed Prompting: A Modular Approach for Solving Complex Tasks

    Tushar Khot, Harsh Trivedi, Matthew Finlayson, Yao Fu, Kyle Richardson, Peter Clark, and Ashish Sabharwal. Decomposed prompting: A modular approach for solving complex tasks. arXiv preprint arXiv:2210.02406, 2022

  24. [24]

    Transformers provably solve parity efficiently with chain of thought

    Juno Kim and Taiji Suzuki. Transformers provably solve parity efficiently with chain of thought. arXiv preprint arXiv:2410.08633, 2024

  25. [25]

    Juncai Li, Ru Li, Yuxiang Zhou, and Jeff Z. Pan. Dissecting implicit chain of thought: Can transformers learn it spontaneously?, 2026. URL https://openreview.net/forum?id=loP9q6E5kQ

  26. [26]

    Mechanics of next token prediction with self-attention

    Yingcong Li, Yixiao Huang, Muhammed E Ildiz, Ankit Singh Rawat, and Samet Oymak. Mechanics of next token prediction with self-attention. In International Conference on Artificial Intelligence and Statistics, pages 685--693. PMLR, 2024 a

  27. [27]

    Chain of thought empowers transformers to solve inherently serial problems

    Zhiyuan Li, Hong Liu, Denny Zhou, and Tengyu Ma. Chain of thought empowers transformers to solve inherently serial problems. arXiv preprint arXiv:2402.12875, 1, 2024 b

  28. [28]

    Transformers learn shortcuts to automata

    Bingbin Liu, Jordan T Ash, Surbhi Goel, Akshay Krishnamurthy, and Cyril Zhang. Transformers learn shortcuts to automata. arXiv preprint arXiv:2210.10749, 2022

  29. [29]

    Pause tokens strictly increase the expressivity of constant-depth transformers

    Charles London and Varun Kanade. Pause tokens strictly increase the expressivity of constant-depth transformers. arXiv preprint arXiv:2505.21024, 2025

  30. [30]

    Breaking the Reversal Curse in Autoregressive Language Models via Identity Bridge

    Xutao Ma, Yixiao Huang, Hanlin Zhu, and Somayeh Sojoudi. Breaking the reversal curse in autoregressive language models via identity bridge. arXiv preprint arXiv:2602.02470, 2026

  31. [31]

    One step of gradient descent is provably the optimal in-context learner with one layer of linear self-attention

    Arvind Mahankali, Tatsunori B Hashimoto, and Tengyu Ma. One step of gradient descent is provably the optimal in-context learner with one layer of linear self-attention. arXiv preprint arXiv:2307.03576, 2023

  32. [32]

    Exact expressive power of transformers with padding

    Will Merrill and Ashish Sabharwal. Exact expressive power of transformers with padding. Advances in Neural Information Processing Systems, 38: 0 112497--112524, 2026

  33. [33]

    The expressive power of transformers with chain of thought

    William Merrill and Ashish Sabharwal. The expressive power of transformers with chain of thought. arXiv preprint arXiv:2310.07923, 2023

  34. [34]

    How transformers learn causal structure with gradient descent

    Eshaan Nichani, Alex Damian, and Jason D Lee. How transformers learn causal structure with gradient descent. In International Conference on Machine Learning, pages 38018--38070. PMLR, 2024

  35. [35]

    Stabilizing transformers for reinforcement learning

    Emilio Parisotto, Francis Song, Jack Rae, Razvan Pascanu, Caglar Gulcehre, Siddhant Jayakumar, Max Jaderberg, Raphael Lopez Kaufman, Aidan Clark, Seb Noury, et al. Stabilizing transformers for reinforcement learning. In International conference on machine learning, pages 7487--7498. PMLR, 2020

  36. [36]

    Let's think dot by dot: Hidden computation in transformer language models

    Jacob Pfau, William Merrill, and Samuel R Bowman. Let's think dot by dot: Hidden computation in transformer language models. arXiv preprint arXiv:2404.15758, 2024

  37. [37]

    Learning and transferring sparse contextual bigrams with linear transformers

    Yunwei Ren, Zixuan Wang, and Jason D Lee. Learning and transferring sparse contextual bigrams with linear transformers. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  38. [38]

    Failures of gradient-based deep learning

    Shai Shalev-Shwartz, Ohad Shamir, and Shaked Shammah. Failures of gradient-based deep learning. In International Conference on Machine Learning, pages 3067--3075. PMLR, 2017

  39. [39]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

  40. [40]

    Highway Networks

    Rupesh Kumar Srivastava, Klaus Greff, and J \"u rgen Schmidhuber. Highway networks. arXiv preprint arXiv:1505.00387, 2015

  41. [41]

    Token assorted: Mixing latent and text tokens for improved language model reasoning

    DiJia Su, Hanlin Zhu, Yingchen Xu, Jiantao Jiao, Yuandong Tian, and Qinqing Zheng. Token assorted: Mixing latent and text tokens for improved language model reasoning. arXiv preprint arXiv:2502.03275, 2025

  42. [42]

    Scan and snap: Understanding training dynamics and token composition in 1-layer transformer

    Yuandong Tian, Yiping Wang, Beidi Chen, and Simon S Du. Scan and snap: Understanding training dynamics and token composition in 1-layer transformer. Advances in neural information processing systems, 36: 0 71911--71947, 2023 a

  43. [43]

    Joma: Demystifying multilayer transformers via joint dynamics of mlp and attention

    Yuandong Tian, Yiping Wang, Zhenyu Zhang, Beidi Chen, and Simon Du. Joma: Demystifying multilayer transformers via joint dynamics of mlp and attention. arXiv preprint arXiv:2310.00535, 2023 b

  44. [44]

    Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations

    Peiyi Wang, Lei Li, Zhihong Shao, RX Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations. arXiv preprint arXiv:2312.08935, 2023 a

  45. [45]

    Guiding language model reasoning with planning tokens

    Xinyi Wang, Lucas Caccia, Oleksiy Ostapenko, Xingdi Yuan, William Yang Wang, and Alessandro Sordoni. Guiding language model reasoning with planning tokens. arXiv preprint arXiv:2310.05707, 2023 b

  46. [46]

    Zixuan Wang, Stanley Wei, Daniel Hsu, and Jason D. Lee. Transformers provably learn sparse token selection while fully-connected nets cannot. In ICML, 2024

  47. [47]

    Learning compositional functions with transformers from easy-to-hard data

    Zixuan Wang, Eshaan Nichani, Alberto Bietti, Alex Damian, Daniel Hsu, Jason D Lee, and Denny Wu. Learning compositional functions with transformers from easy-to-hard data. arXiv preprint arXiv:2505.23683, 2025

  48. [48]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35: 0 24824--24837, 2022

  49. [49]

    From sparse dependence to sparse attention: unveiling how chain-of-thought enhances transformer sample efficiency

    Kaiyue Wen, Huaqing Zhang, Hongzhou Lin, and Jingzhao Zhang. From sparse dependence to sparse attention: unveiling how chain-of-thought enhances transformer sample efficiency. arXiv preprint arXiv:2410.05459, 2024

  50. [50]

    Sub-task decomposition enables learning in sequence to sequence tasks

    Noam Wies, Yoav Levine, and Amnon Shashua. Sub-task decomposition enables learning in sequence to sequence tasks. arXiv preprint arXiv:2204.02892, 2022

  51. [51]

    Integer quantization for deep learning inference: Principles and empirical evaluation

    Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev, and Paulius Micikevicius. Integer quantization for deep learning inference: Principles and empirical evaluation. arXiv preprint arXiv:2004.09602, 2020

  52. [52]

    Residual: Transformer with dual residual connections

    Shufang Xie, Huishuai Zhang, Junliang Guo, Xu Tan, Jiang Bian, Hany Hassan Awadalla, Arul Menezes, Tao Qin, and Rui Yan. Residual: Transformer with dual residual connections. arXiv preprint arXiv:2304.14802, 2023

  53. [53]

    MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

    Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284, 2023

  54. [54]

    MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning

    Xiang Yue, Xingwei Qu, Ge Zhang, Yao Fu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mammoth: Building math generalist models through hybrid instruction tuning. arXiv preprint arXiv:2309.05653, 2023

  55. [55]

    Trained transformers learn linear models in-context

    Ruiqi Zhang, Spencer Frei, and Peter L Bartlett. Trained transformers learn linear models in-context. Journal of Machine Learning Research, 25 0 (49): 0 1--55, 2024

  56. [56]

    Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

    Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models. arXiv preprint arXiv:2601.18734, 2026

  57. [57]

    Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

    Denny Zhou, Nathanael Sch \"a rli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, et al. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625, 2022

  58. [58]

    Hyper-connections

    Defa Zhu, Hongzhi Huang, Zihao Huang, Yutao Zeng, Yunyao Mao, Banggu Wu, Qiyang Min, and Xun Zhou. Hyper-connections. arXiv preprint arXiv:2409.19606, 2024 a

  59. [59]

    Towards a theoretical understanding of the'reversal curse'via training dynamics

    Hanlin Zhu, Baihe Huang, Shaolun Zhang, Michael Jordan, Jiantao Jiao, Yuandong Tian, and Stuart J Russell. Towards a theoretical understanding of the'reversal curse'via training dynamics. Advances in Neural Information Processing Systems, 37: 0 90473--90513, 2024 b

  60. [60]

    Emergence of superposition: Unveiling the training dynamics of chain of continuous thought

    Hanlin Zhu, Shibo Hao, Zhiting Hu, Jiantao Jiao, Stuart Russell, and Yuandong Tian. Emergence of superposition: Unveiling the training dynamics of chain of continuous thought. arXiv preprint arXiv:2509.23365, 2025 a

  61. [61]

    Reasoning by superposition: A theoretical perspective on chain of continuous thought

    Hanlin Zhu, Shibo Hao, Zhiting Hu, Jiantao Jiao, Stuart Russell, and Yuandong Tian. Reasoning by superposition: A theoretical perspective on chain of continuous thought. arXiv preprint arXiv:2505.12514, 2025 b

  62. [62]

    @esa (Ref

    \@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

  63. [63]

    \@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

  64. [64]

    proofsk Proof sketch

    @open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...