pith. sign in

arxiv: 2605.07123 · v1 · submitted 2026-05-08 · 💻 cs.LG

Convergence and Emergence of In-Context Reinforcement Learning with Chain of Thought

Pith reviewed 2026-05-11 01:11 UTC · model grok-4.3

classification 💻 cs.LG
keywords in-context reinforcement learningchain of thoughttemporal difference learninglinear transformerpolicy evaluationpretraining lossconvergence analysisfinite sample bounds
0
0 comments X

The pith

With the right parameters, chain-of-thought generation in a linear transformer performs repeated temporal difference learning updates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a direct mathematical connection between chain-of-thought reasoning and reinforcement learning inside transformers. In a policy evaluation setup, it shows that certain fixed transformer parameters make each generated thought token equivalent to one step of temporal difference learning. The evaluation error then shrinks geometrically with each additional thought step until it reaches a floor set by the length of available context. These same parameters are proven to be the global minimum of the pretraining loss, which accounts for their spontaneous appearance after training.

Core claim

In a policy evaluation setup with linear Transformer, the CoT generation process with specific parameters is equivalent to repeatedly executing temporal difference learning updates. The policy evaluation error decreases geometrically with CoT length and eventually saturates at a statistical floor determined by the context length. The desired Transformer parameters are a global minimizer of the pretraining loss.

What carries the argument

Linear Transformer parameters that equate chain-of-thought token generation to iterative temporal difference learning updates.

If this is right

  • In-context adaptation to new tasks occurs by executing internal TD updates without any weight changes.
  • Longer chain-of-thought sequences improve policy evaluation accuracy at a geometric rate until context length caps the gain.
  • Pretraining loss minimization naturally produces parameters that enable this in-context RL behavior.
  • Explicit finite-sample bounds quantify how quickly evaluation error converges with additional thought steps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same internal update mechanism may explain why chain-of-thought improves performance on planning and reasoning tasks outside reinforcement learning.
  • Approximate versions of the equivalence could appear in nonlinear transformers used in practice, offering a testable prediction for model inspection.
  • Pretraining objectives could be modified to encourage parameters that support more iterations or faster internal convergence.
  • One could verify the claim by extracting intermediate activations during chain-of-thought and checking whether they match TD value estimates.

Load-bearing premise

The proof assumes a linear attention mechanism and restricts the task to policy evaluation rather than full policy optimization.

What would settle it

Train a linear transformer to the claimed global minimum of pretraining loss, then compare its actual chain-of-thought outputs token-by-token against the sequence of temporal difference updates on held-out tasks.

Figures

Figures reproduced from arXiv: 2605.07123 by Rohan Chandra, Shangtong Zhang, Xinyu Liu, Zixuan Xie.

Figure 1
Figure 1. Figure 1: Learned parameters and element-wise learning progress. In (a), [PITH_FULL_IMAGE:figures/full_fig_p009_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Boyan’s chain topology with nonzero transitions. Adapted from [PITH_FULL_IMAGE:figures/full_fig_p029_2.png] view at source ↗
read the original abstract

In-context reinforcement learning (ICRL) refers to the ability of RL agents to adapt to new tasks at inference time without parameter updates by conditioning on additional context. Recent empirical studies further demonstrate that Chain-of-Thought (CoT) generation can amplify this ICRL capability. This paper is the first to provide a theoretical understanding on how CoT interacts with ICRL. We conduct our analysis in a policy evaluation setup with linear Transformer. We prove that with specific Transformer parameters, the CoT generation process is equivalent to repeatedly executing temporal difference learning updates. Additionally, we provide finite sample convergence analysis showing that the policy evaluation error decreases geometrically with CoT length and eventually saturates at a statistical floor determined by the context length. We also prove that the desired Transformer parameters are a global minimizer of the pretraining loss, providing a theoretical understanding on the empirical emergence of those parameters.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper analyzes in-context reinforcement learning (ICRL) with Chain-of-Thought (CoT) in a linear Transformer under a policy-evaluation MDP. It proves that specific Transformer parameters make the CoT generation process algebraically equivalent to repeated temporal-difference (TD) updates, establishes finite-sample bounds showing that the policy-evaluation error contracts geometrically with CoT length before saturating at a statistical floor set by context length, and shows that these parameters are a global minimizer of the pretraining loss.

Significance. If the derivations hold, the work supplies the first rigorous account of how CoT length controls ICRL performance in the linear case and links the emergence of effective parameters directly to pretraining-loss minimization. The explicit equivalence and geometric convergence results are concrete strengths that could guide the choice of CoT length in practice; the global-minimizer argument is especially useful because it explains why the required parameters arise without hand-tuning.

major comments (2)
  1. [§3] §3 (equivalence theorem): the algebraic identity between linear-attention CoT and the TD operator is derived from the closed-form attention update and holds only for linear attention in pure policy evaluation; once softmax nonlinearity or policy improvement is introduced the identity ceases to hold, yet the paper invokes the result to explain empirical emergence in practical (nonlinear, full-RL) models without supplying an approximation or robustness argument.
  2. [§4] §4 (finite-sample convergence): the geometric rate and saturation floor are stated with respect to the linear-Transformer forward pass; the proof sketch relies on the contraction property of the TD operator being preserved exactly by the attention matrix, but the finite-sample bound does not quantify the additional error introduced when the learned attention matrix deviates from the exact TD operator during pretraining.
minor comments (2)
  1. [Notation] The notation for the linear attention matrix and the TD target should be unified across the equivalence and convergence sections to avoid reader confusion.
  2. [Figure 2] Figure 2 (convergence curves) would benefit from an additional panel showing the dependence on context length, as the statistical floor is a central claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We respond to each major comment below and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [§3] §3 (equivalence theorem): the algebraic identity between linear-attention CoT and the TD operator is derived from the closed-form attention update and holds only for linear attention in pure policy evaluation; once softmax nonlinearity or policy improvement is introduced the identity ceases to hold, yet the paper invokes the result to explain empirical emergence in practical (nonlinear, full-RL) models without supplying an approximation or robustness argument.

    Authors: The manuscript restricts its claims to linear Transformers under policy evaluation, as stated in the abstract and Section 2. The equivalence is exact only in this setting. The paper does not assert that the algebraic identity carries over to softmax attention or policy improvement; it presents the linear case as a rigorous foundation that can help interpret broader empirical phenomena. We will add explicit scope statements in the introduction and a limitations paragraph in the discussion section to prevent over-interpretation, while leaving approximation arguments for nonlinear cases to future work. revision: partial

  2. Referee: [§4] §4 (finite-sample convergence): the geometric rate and saturation floor are stated with respect to the linear-Transformer forward pass; the proof sketch relies on the contraction property of the TD operator being preserved exactly by the attention matrix, but the finite-sample bound does not quantify the additional error introduced when the learned attention matrix deviates from the exact TD operator during pretraining.

    Authors: Section 4 derives the geometric bound under the assumption that the attention parameters exactly match the TD operator, which is justified by the global-minimizer result of Section 5. We agree that the current statement does not quantify the effect of finite-pretraining deviations. We will revise the theorem to state the exact-parameter assumption explicitly and add a remark that uses matrix perturbation theory to bound the change in contraction rate for small deviations, thereby addressing the additional error term. revision: yes

Circularity Check

0 steps flagged

No circularity: algebraic equivalence and loss minimization proven directly from linear Transformer equations

full rationale

The paper's central results consist of an algebraic identity showing that specific linear-attention parameters make the CoT forward pass identical to repeated TD updates, a finite-sample geometric convergence bound on the resulting policy-evaluation error, and a direct proof that those same parameters globally minimize the pretraining loss. All three follow from the closed-form expression for linear attention and the explicit definition of the pretraining objective; neither the TD equivalence nor the minimizer property is obtained by fitting a parameter to the target quantity and relabeling it. No self-citation chain, uniqueness theorem, or ansatz is invoked to close the argument. The analysis is therefore self-contained within its stated linear policy-evaluation setting.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the linear-Transformer architecture and the policy-evaluation setup; no additional free parameters or invented entities are introduced beyond the standard TD update rule.

axioms (1)
  • domain assumption The Transformer is linear and operates in a policy-evaluation setting.
    Stated in the abstract as the analysis framework.

pith-pipeline@v0.9.0 · 5454 in / 1172 out tokens · 31330 ms · 2026-05-11T01:11:31.797475+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

299 extracted references · 299 canonical work pages

  1. [1]

    Proceedings of the International Conference on Machine Learning , year =

    Amir Moeini and Minjae Kwon and Alper Kamil Bozkurt and Yuichi Motai and Rohan Chandra and Lu Feng and Shangtong Zhang , title =. Proceedings of the International Conference on Machine Learning , year =

  2. [2]

    ArXiv Preprint , year =

    Zixuan Xie and Xinyu Liu and Claire Chen and Shuze Daniel Liu and Rohan Chandra and Shangtong Zhang , title =. ArXiv Preprint , year =

  3. [3]

    Gandharv Patil and L. A. Prashanth and Dheeraj Nagaraj and Doina Precup , title =. Proceedings of the International Conference on Artificial Intelligence and Statistics , year =

  4. [4]

    Proceedings of the Conference on Learning Theory , year =

    Sergey Samsonov and Daniil Tiapkin and Alexey Naumov and Eric Moulines , title =. Proceedings of the Conference on Learning Theory , year =

  5. [5]

    arXiv preprint , year =

    Wei-Cheng Lee and Francesco Orabona , title =. arXiv preprint , year =

  6. [6]

    Finite-Sample Analysis of LSTD , booktitle =

    Alessandro Lazaric and Mohammad Ghavamzadeh and R. Finite-Sample Analysis of LSTD , booktitle =

  7. [7]

    and Rosenthal, Jeffrey S

    Roberts, Gareth O. and Rosenthal, Jeffrey S. , journal=. General state space

  8. [8]

    2002 , publisher=

    Lectures on the Coupling Method , author=. 2002 , publisher=

  9. [9]

    2017 , publisher=

    Asymptotic Theory of Weakly Dependent Random Processes , author=. 2017 , publisher=

  10. [10]

    and Cao, Yuan and Narasimhan, Karthik , title =

    Yao, Shunyu and Yu, Dian and Zhao, Jeffrey and Shafran, Izhak and Griffiths, Thomas L. and Cao, Yuan and Narasimhan, Karthik , title =. 2023 , booktitle=

  11. [11]

    2023 , booktitle=

    Self-Consistency Improves Chain of Thought Reasoning in Language Models , author=. 2023 , booktitle=

  12. [12]

    2024 , booktitle=

    Artificial Generational Intelligence: Cultural Accumulation in Reinforcement Learning , author=. 2024 , booktitle=

  13. [13]

    2022 , booktitle=

    Transformers are Meta-Reinforcement Learners , author=. 2022 , booktitle=

  14. [14]

    2020 , booktitle=

    VariBAD: A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning , author=. 2020 , booktitle=

  15. [15]

    2018 , journal=

    Some Considerations on Learning to Explore via Meta-Reinforcement Learning , author=. 2018 , journal=

  16. [16]

    International Conference on Machine Learning , year=

    Been There, Done That: Meta-Learning with Episodic Recall , author=. International Conference on Machine Learning , year=

  17. [17]

    2018 , booktitle=

    A Simple Neural Attentive Meta-Learner , author=. 2018 , booktitle=

  18. [18]

    Proceedings of the International Conference on Machine Learning , year=

    Emergence of in-context reinforcement learning from noise distillation , author=. Proceedings of the International Conference on Machine Learning , year=

  19. [19]

    Proceedings of the International Conference on Machine Learning , year=

    Vintix: Action model via in-context reinforcement learning , author=. Proceedings of the International Conference on Machine Learning , year=

  20. [20]

    Proceedings of the International Conference on Machine Learning , year=

    Emergent agentic transformer from chain of hindsight experience , author=. Proceedings of the International Conference on Machine Learning , year=

  21. [21]

    NeurIPS Foundation Models for Decision Making Workshop , year=

    Towards General-Purpose In-Context Learning Agents , author =. NeurIPS Foundation Models for Decision Making Workshop , year=

  22. [22]

    Ilya Zisman and Alexander Nikulin and Viacheslav Sinii and Denis Tarasov and Nikita Lyubaykin and Andrei Polubarov and Igor Kiselev and Vladislav Kurenkov , booktitle =

  23. [23]

    2022 , booktitle=

    Generalized Decision Transformer for Offline Hindsight Information Matching , author=. 2022 , booktitle=

  24. [24]

    2022 , booktitle=

    Prompting Decision Transformer for Few-Shot Policy Generalization , author=. 2022 , booktitle=

  25. [25]

    2022 , booktitle=

    RvS: What is Essential for Offline RL via Supervised Learning? , author=. 2022 , booktitle=

  26. [26]

    Transactions on Machine Learning Research , year=

    Random Policy Enables In-Context Reinforcement Learning within Trust Horizons , author=. Transactions on Machine Learning Research , year=

  27. [27]

    Proceedings of the International Conference on Machine Learning , year=

    Generalization to New Sequential Decision Making Tasks with In-Context Learning , author=. Proceedings of the International Conference on Machine Learning , year=

  28. [28]

    ArXiv preprint , year=

    Scaling Algorithm Distillation for Continuous Control with Mamba , author=. ArXiv preprint , year=

  29. [29]

    Ahmad Elawady and Gunjan Chhablani and Ram Ramrakhya and Karmesh Yadav and Dhruv Batra and Zsolt Kira and Andrew Szot , journal=

  30. [30]

    Proceedings of the Conference on Robot Learning , year=

    LocoFormer: Generalist Locomotion via Long-context Adaptation , author =. Proceedings of the Conference on Robot Learning , year=

  31. [31]

    Proceedings of the International Conference on Machine Learning , year=

    Human-Timescale Adaptation in an Open-Ended Task Space , author =. Proceedings of the International Conference on Machine Learning , year=

  32. [32]

    Foundations of Computational Mathematics , year =

    User-Friendly Tail Bounds for Sums of Random Matrices , author =. Foundations of Computational Mathematics , year =

  33. [33]

    2015 , journal =

    The Expected Norm of a Sum of Independent Random Matrices: An Elementary Approach , author =. 2015 , journal =

  34. [34]

    2024 , journal =

    Estimating the Mixing Coefficients of Geometrically Ergodic Markov Processes , author=. 2024 , journal =

  35. [35]

    A Survey and Some Open Questions , author =

    Basic Properties of Strong Mixing Conditions. A Survey and Some Open Questions , author =. Probability Surveys , year =

  36. [36]

    2024 , booktitle =

    How Many Pretraining Tasks Are Needed for In-Context Learning of Linear Regression? , author=. 2024 , booktitle =

  37. [37]

    2021 , booktitle =

    Rethinking Attention with Performers , author=. 2021 , booktitle =

  38. [38]

    Transformers are RNNs: fast autoregressive transformers with linear attention , year =

    Katharopoulos, Angelos and Vyas, Apoorv and Pappas, Nikolaos and Fleuret, Fran. Transformers are RNNs: fast autoregressive transformers with linear attention , year =

  39. [39]

    2022 , booktitle=

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author=. 2022 , booktitle=

  40. [40]

    2026 , booktitle =

    Reward Is Enough: LLMs Are In-Context Reinforcement Learners , author=. 2026 , booktitle =

  41. [41]

    Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , year=

    A Survey on In-context Learning , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , year=

  42. [42]

    2024 , booktitle =

    Can Looped Transformers Learn to Implement Multi-step Gradient Descent for In-context Learning? , author=. 2024 , booktitle =

  43. [43]

    2025 , journal =

    A Tutorial on Meta-Reinforcement Learning , author=. 2025 , journal =

  44. [44]

    2025 , journal =

    A Survey of In-Context Reinforcement Learning , author=. 2025 , journal =

  45. [45]

    2024 , booktitle =

    Decision Mamba: Reinforcement Learning via Hybrid Selective Sequence Modeling , author=. 2024 , booktitle =

  46. [46]

    2024 , booktitle =

    Huang, Sili and Hu, Jifeng and Chen, Hechang and Sun, Lichao and Yang, Bo , title =. 2024 , booktitle =

  47. [47]

    2025 , booktitle =

    Transformers Learn to Implement Multi-step Gradient Descent with Chain of Thought , author=. 2025 , booktitle =

  48. [48]

    2024 , booktitle =

    In-context Exploration-Exploitation for Reinforcement Learning , author=. 2024 , booktitle =

  49. [49]

    2023 , booktitle =

    Shi, Lucy Xiaoyang and Jiang, Yunfan and Grigsby, Jake and Fan, Linxi Jim and Zhu, Yuke , title =. 2023 , booktitle =

  50. [50]

    Proceedings of the International Conference on Machine Learning , year =

    Meta-Reinforcement Learning Robust to Distributional Shift Via Performing Lifelong In-Context Learning , author =. Proceedings of the International Conference on Machine Learning , year =

  51. [51]

    2024 , booktitle =

    AMAGO-2: Breaking the Multi-Task Barrier in Meta-Reinforcement Learning with Transformers , author=. 2024 , booktitle =

  52. [52]

    2024 , booktitle =

    AMAGO: Scalable In-Context Reinforcement Learning for Adaptive Agents , author=. 2024 , booktitle =

  53. [53]

    2023 , booktitle =

    Lu, Chris and Schroecker, Yannick and Gu, Albert and Parisotto, Emilio and Foerster, Jakob and Singh, Satinder and Behbahani, Feryal , title =. 2023 , booktitle =

  54. [54]

    2022 , booktitle=

    Introducing Symmetries to Black Box Meta Reinforcement Learning , author=. 2022 , booktitle=

  55. [55]

    MixTConv: Mixed Temporal Convolutional Kernels for Efficient Action Recognition , year =

    Shan, Kaiyu and Wang, Yongtao and Tang, Zhi and Chen, Ying and Li, Yangyan , booktitle =. MixTConv: Mixed Temporal Convolutional Kernels for Efficient Action Recognition , year =

  56. [56]

    TEA: Temporal Excitation and Aggregation for Action Recognition , year=

    Li, Yan and Ji, Bin and Shi, Xintian and Zhang, Jianguo and Kang, Bin and Wang, Limin , booktitle=. TEA: Temporal Excitation and Aggregation for Action Recognition , year=

  57. [57]

    Proceedings of the IEEE International Conference on Computer Vision , year=

    TSM: Temporal Shift Module for Efficient Video Understanding , author=. Proceedings of the IEEE International Conference on Computer Vision , year=

  58. [58]

    International Conference on Learning Representations , year=

    A Non-asymptotic Analysis of Non-parametric Temporal-Difference Learning , author=. International Conference on Learning Representations , year=

  59. [59]

    Advances in Neural Information Processing Systems , year =

    Wang, Jiuqi and Chandra, Rohan and Zhang, Shangtong , title =. Advances in Neural Information Processing Systems , year =

  60. [60]

    International Conference on Learning Representations , year =

    Transformers Can Learn Temporal Difference Methods for In-Context Reinforcement Learning , author =. International Conference on Learning Representations , year =

  61. [61]

    Proceedings of the International Conference on Machine Learning , year =

    Transformers Implement Functional Gradient Descent to Learn Non-Linear Functions In Context , author =. Proceedings of the International Conference on Machine Learning , year =

  62. [62]

    Exponential Hardness of Reinforcement Learning with Linear Function Approximation , booktitle =

    Kane, Daniel and Liu, Sihan and Lovett, Shachar and Mahajan, Gaurav and Szepesv. Exponential Hardness of Reinforcement Learning with Linear Function Approximation , booktitle =

  63. [63]

    SIAM Journal on Control and Optimization , year=

    A small gain analysis of single timescale actor critic , author=. SIAM Journal on Control and Optimization , year=

  64. [64]

    Advances in Neural Information Processing Systems , year=

    Finite-time analysis of single-timescale actor-critic , author=. Advances in Neural Information Processing Systems , year=

  65. [65]

    Proceedings of the International Conference on Machine Learning , year=

    A Generalized Reinforcement-Learning Model: Convergence and Applications , author=. Proceedings of the International Conference on Machine Learning , year=

  66. [66]

    and Srikant, R

    Beck, Carolyn L. and Srikant, R. , booktitle =. Improved upper bounds on the expected error in constant step-size Q-learning , year =

  67. [67]

    Advances in Neural Information Processing Systems , year=

    On the convergence and sample complexity analysis of deep q-networks with -greedy exploration , author=. Advances in Neural Information Processing Systems , year=

  68. [69]

    Volodymyr Mnih and Koray Kavukcuoglu and David Silver and Alex Graves and Ioannis Antonoglou and Daan Wierstra and Martin Riedmiller , title =

  69. [70]

    Fanghui Liu and Luca Viano and Volkan Cevher , title =

  70. [71]

    SIAM Journal on Mathematics of Data Science , year =

    Zaiwei Chen and John Paul Clarke and Siva Theja Maguluri , title =. SIAM Journal on Mathematics of Data Science , year =

  71. [72]

    Regularized

    Han-Dong, Lim and Donghwan, Lee , booktitle =. Regularized

  72. [73]

    and Meyn, Sean P , title =

    Devraj, Adithya M. and Meyn, Sean P , title =. 2022 , journal =

  73. [74]

    Isabel , booktitle =

    Melo, Francisco S.and Ribeiro, M. Isabel , booktitle =. Q-Learning with Linear Function Approximation , year =

  74. [75]

    Constant stepsize

    Zhang, Yixuan and Xie, Qiaomin , journal=. Constant stepsize

  75. [76]

    Performance of

    Chen, Zaiwei and Zhang, Sheng and Doan, Thinh T and Maguluri, Siva Theja and Clarke, John-Paul , journal =. Performance of

  76. [77]

    Joan, Bas-Serrano and Sebastian, Curi and Andreas, Krause and Gergely, Neu , title =

  77. [78]

    ArXiv Preprint , year =

    Gopalan, Aditya and Thoppe, Gugan , title =. ArXiv Preprint , year =

  78. [79]

    Gao, Bolin and Pavel, Lacra , title =

  79. [80]

    The Projected Bellman Equation in Reinforcement Learning , year =

    Meyn, Sean , journal =. The Projected Bellman Equation in Reinforcement Learning , year =

  80. [81]

    2022 , journal =

    Shangtong Zhang and Remi Tachet and Romain Laroche , title =. 2022 , journal =

Showing first 80 references.