Convergence and Emergence of In-Context Reinforcement Learning with Chain of Thought

Rohan Chandra; Shangtong Zhang; Xinyu Liu; Zixuan Xie

arxiv: 2605.07123 · v1 · submitted 2026-05-08 · 💻 cs.LG

Convergence and Emergence of In-Context Reinforcement Learning with Chain of Thought

Zixuan Xie , Xinyu Liu , Rohan Chandra , Shangtong Zhang This is my paper

Pith reviewed 2026-05-11 01:11 UTC · model grok-4.3

classification 💻 cs.LG

keywords in-context reinforcement learningchain of thoughttemporal difference learninglinear transformerpolicy evaluationpretraining lossconvergence analysisfinite sample bounds

0 comments

The pith

With the right parameters, chain-of-thought generation in a linear transformer performs repeated temporal difference learning updates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a direct mathematical connection between chain-of-thought reasoning and reinforcement learning inside transformers. In a policy evaluation setup, it shows that certain fixed transformer parameters make each generated thought token equivalent to one step of temporal difference learning. The evaluation error then shrinks geometrically with each additional thought step until it reaches a floor set by the length of available context. These same parameters are proven to be the global minimum of the pretraining loss, which accounts for their spontaneous appearance after training.

Core claim

In a policy evaluation setup with linear Transformer, the CoT generation process with specific parameters is equivalent to repeatedly executing temporal difference learning updates. The policy evaluation error decreases geometrically with CoT length and eventually saturates at a statistical floor determined by the context length. The desired Transformer parameters are a global minimizer of the pretraining loss.

What carries the argument

Linear Transformer parameters that equate chain-of-thought token generation to iterative temporal difference learning updates.

If this is right

In-context adaptation to new tasks occurs by executing internal TD updates without any weight changes.
Longer chain-of-thought sequences improve policy evaluation accuracy at a geometric rate until context length caps the gain.
Pretraining loss minimization naturally produces parameters that enable this in-context RL behavior.
Explicit finite-sample bounds quantify how quickly evaluation error converges with additional thought steps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same internal update mechanism may explain why chain-of-thought improves performance on planning and reasoning tasks outside reinforcement learning.
Approximate versions of the equivalence could appear in nonlinear transformers used in practice, offering a testable prediction for model inspection.
Pretraining objectives could be modified to encourage parameters that support more iterations or faster internal convergence.
One could verify the claim by extracting intermediate activations during chain-of-thought and checking whether they match TD value estimates.

Load-bearing premise

The proof assumes a linear attention mechanism and restricts the task to policy evaluation rather than full policy optimization.

What would settle it

Train a linear transformer to the claimed global minimum of pretraining loss, then compare its actual chain-of-thought outputs token-by-token against the sequence of temporal difference updates on held-out tasks.

Figures

Figures reproduced from arXiv: 2605.07123 by Rohan Chandra, Shangtong Zhang, Xinyu Liu, Zixuan Xie.

**Figure 2.** Figure 2: Boyan’s chain topology with nonzero transitions. Adapted from [PITH_FULL_IMAGE:figures/full_fig_p029_2.png] view at source ↗

read the original abstract

In-context reinforcement learning (ICRL) refers to the ability of RL agents to adapt to new tasks at inference time without parameter updates by conditioning on additional context. Recent empirical studies further demonstrate that Chain-of-Thought (CoT) generation can amplify this ICRL capability. This paper is the first to provide a theoretical understanding on how CoT interacts with ICRL. We conduct our analysis in a policy evaluation setup with linear Transformer. We prove that with specific Transformer parameters, the CoT generation process is equivalent to repeatedly executing temporal difference learning updates. Additionally, we provide finite sample convergence analysis showing that the policy evaluation error decreases geometrically with CoT length and eventually saturates at a statistical floor determined by the context length. We also prove that the desired Transformer parameters are a global minimizer of the pretraining loss, providing a theoretical understanding on the empirical emergence of those parameters.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper analyzes in-context reinforcement learning (ICRL) with Chain-of-Thought (CoT) in a linear Transformer under a policy-evaluation MDP. It proves that specific Transformer parameters make the CoT generation process algebraically equivalent to repeated temporal-difference (TD) updates, establishes finite-sample bounds showing that the policy-evaluation error contracts geometrically with CoT length before saturating at a statistical floor set by context length, and shows that these parameters are a global minimizer of the pretraining loss.

Significance. If the derivations hold, the work supplies the first rigorous account of how CoT length controls ICRL performance in the linear case and links the emergence of effective parameters directly to pretraining-loss minimization. The explicit equivalence and geometric convergence results are concrete strengths that could guide the choice of CoT length in practice; the global-minimizer argument is especially useful because it explains why the required parameters arise without hand-tuning.

major comments (2)

[§3] §3 (equivalence theorem): the algebraic identity between linear-attention CoT and the TD operator is derived from the closed-form attention update and holds only for linear attention in pure policy evaluation; once softmax nonlinearity or policy improvement is introduced the identity ceases to hold, yet the paper invokes the result to explain empirical emergence in practical (nonlinear, full-RL) models without supplying an approximation or robustness argument.
[§4] §4 (finite-sample convergence): the geometric rate and saturation floor are stated with respect to the linear-Transformer forward pass; the proof sketch relies on the contraction property of the TD operator being preserved exactly by the attention matrix, but the finite-sample bound does not quantify the additional error introduced when the learned attention matrix deviates from the exact TD operator during pretraining.

minor comments (2)

[Notation] The notation for the linear attention matrix and the TD target should be unified across the equivalence and convergence sections to avoid reader confusion.
[Figure 2] Figure 2 (convergence curves) would benefit from an additional panel showing the dependence on context length, as the statistical floor is a central claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We respond to each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [§3] §3 (equivalence theorem): the algebraic identity between linear-attention CoT and the TD operator is derived from the closed-form attention update and holds only for linear attention in pure policy evaluation; once softmax nonlinearity or policy improvement is introduced the identity ceases to hold, yet the paper invokes the result to explain empirical emergence in practical (nonlinear, full-RL) models without supplying an approximation or robustness argument.

Authors: The manuscript restricts its claims to linear Transformers under policy evaluation, as stated in the abstract and Section 2. The equivalence is exact only in this setting. The paper does not assert that the algebraic identity carries over to softmax attention or policy improvement; it presents the linear case as a rigorous foundation that can help interpret broader empirical phenomena. We will add explicit scope statements in the introduction and a limitations paragraph in the discussion section to prevent over-interpretation, while leaving approximation arguments for nonlinear cases to future work. revision: partial
Referee: [§4] §4 (finite-sample convergence): the geometric rate and saturation floor are stated with respect to the linear-Transformer forward pass; the proof sketch relies on the contraction property of the TD operator being preserved exactly by the attention matrix, but the finite-sample bound does not quantify the additional error introduced when the learned attention matrix deviates from the exact TD operator during pretraining.

Authors: Section 4 derives the geometric bound under the assumption that the attention parameters exactly match the TD operator, which is justified by the global-minimizer result of Section 5. We agree that the current statement does not quantify the effect of finite-pretraining deviations. We will revise the theorem to state the exact-parameter assumption explicitly and add a remark that uses matrix perturbation theory to bound the change in contraction rate for small deviations, thereby addressing the additional error term. revision: yes

Circularity Check

0 steps flagged

No circularity: algebraic equivalence and loss minimization proven directly from linear Transformer equations

full rationale

The paper's central results consist of an algebraic identity showing that specific linear-attention parameters make the CoT forward pass identical to repeated TD updates, a finite-sample geometric convergence bound on the resulting policy-evaluation error, and a direct proof that those same parameters globally minimize the pretraining loss. All three follow from the closed-form expression for linear attention and the explicit definition of the pretraining objective; neither the TD equivalence nor the minimizer property is obtained by fitting a parameter to the target quantity and relabeling it. No self-citation chain, uniqueness theorem, or ansatz is invoked to close the argument. The analysis is therefore self-contained within its stated linear policy-evaluation setting.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the linear-Transformer architecture and the policy-evaluation setup; no additional free parameters or invented entities are introduced beyond the standard TD update rule.

axioms (1)

domain assumption The Transformer is linear and operates in a policy-evaluation setting.
Stated in the abstract as the analysis framework.

pith-pipeline@v0.9.0 · 5454 in / 1172 out tokens · 31330 ms · 2026-05-11T01:11:31.797475+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel (J uniqueness) unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 1: with (P,Q) in (9), CoT step (7) on prompt (8) yields exactly wk+1 = wk + (α/n) Σ δj(wk) xj (batch TD)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MSPBE L(w) = ||C^{-1/2}(Aw-b)||^2 and contraction under η ∈ (0,μ/L]
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean alpha_pin_under_high_calibration unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Emergence: θ* globally minimizes Jk(θ;D) as k→∞ under finite-sample conditions

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

299 extracted references · 299 canonical work pages

[1]

Proceedings of the International Conference on Machine Learning , year =

Amir Moeini and Minjae Kwon and Alper Kamil Bozkurt and Yuichi Motai and Rohan Chandra and Lu Feng and Shangtong Zhang , title =. Proceedings of the International Conference on Machine Learning , year =

work page
[2]

ArXiv Preprint , year =

Zixuan Xie and Xinyu Liu and Claire Chen and Shuze Daniel Liu and Rohan Chandra and Shangtong Zhang , title =. ArXiv Preprint , year =

work page
[3]

Gandharv Patil and L. A. Prashanth and Dheeraj Nagaraj and Doina Precup , title =. Proceedings of the International Conference on Artificial Intelligence and Statistics , year =

work page
[4]

Proceedings of the Conference on Learning Theory , year =

Sergey Samsonov and Daniil Tiapkin and Alexey Naumov and Eric Moulines , title =. Proceedings of the Conference on Learning Theory , year =

work page
[5]

arXiv preprint , year =

Wei-Cheng Lee and Francesco Orabona , title =. arXiv preprint , year =

work page
[6]

Finite-Sample Analysis of LSTD , booktitle =

Alessandro Lazaric and Mohammad Ghavamzadeh and R. Finite-Sample Analysis of LSTD , booktitle =

work page
[7]

and Rosenthal, Jeffrey S

Roberts, Gareth O. and Rosenthal, Jeffrey S. , journal=. General state space

work page
[8]

2002 , publisher=

Lectures on the Coupling Method , author=. 2002 , publisher=

work page 2002
[9]

2017 , publisher=

Asymptotic Theory of Weakly Dependent Random Processes , author=. 2017 , publisher=

work page 2017
[10]

and Cao, Yuan and Narasimhan, Karthik , title =

Yao, Shunyu and Yu, Dian and Zhao, Jeffrey and Shafran, Izhak and Griffiths, Thomas L. and Cao, Yuan and Narasimhan, Karthik , title =. 2023 , booktitle=

work page 2023
[11]

2023 , booktitle=

Self-Consistency Improves Chain of Thought Reasoning in Language Models , author=. 2023 , booktitle=

work page 2023
[12]

2024 , booktitle=

Artificial Generational Intelligence: Cultural Accumulation in Reinforcement Learning , author=. 2024 , booktitle=

work page 2024
[13]

2022 , booktitle=

Transformers are Meta-Reinforcement Learners , author=. 2022 , booktitle=

work page 2022
[14]

2020 , booktitle=

VariBAD: A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning , author=. 2020 , booktitle=

work page 2020
[15]

2018 , journal=

Some Considerations on Learning to Explore via Meta-Reinforcement Learning , author=. 2018 , journal=

work page 2018
[16]

International Conference on Machine Learning , year=

Been There, Done That: Meta-Learning with Episodic Recall , author=. International Conference on Machine Learning , year=

work page
[17]

2018 , booktitle=

A Simple Neural Attentive Meta-Learner , author=. 2018 , booktitle=

work page 2018
[18]

Proceedings of the International Conference on Machine Learning , year=

Emergence of in-context reinforcement learning from noise distillation , author=. Proceedings of the International Conference on Machine Learning , year=

work page
[19]

Proceedings of the International Conference on Machine Learning , year=

Vintix: Action model via in-context reinforcement learning , author=. Proceedings of the International Conference on Machine Learning , year=

work page
[20]

Proceedings of the International Conference on Machine Learning , year=

Emergent agentic transformer from chain of hindsight experience , author=. Proceedings of the International Conference on Machine Learning , year=

work page
[21]

NeurIPS Foundation Models for Decision Making Workshop , year=

Towards General-Purpose In-Context Learning Agents , author =. NeurIPS Foundation Models for Decision Making Workshop , year=

work page
[22]

Ilya Zisman and Alexander Nikulin and Viacheslav Sinii and Denis Tarasov and Nikita Lyubaykin and Andrei Polubarov and Igor Kiselev and Vladislav Kurenkov , booktitle =

work page
[23]

2022 , booktitle=

Generalized Decision Transformer for Offline Hindsight Information Matching , author=. 2022 , booktitle=

work page 2022
[24]

2022 , booktitle=

Prompting Decision Transformer for Few-Shot Policy Generalization , author=. 2022 , booktitle=

work page 2022
[25]

2022 , booktitle=

RvS: What is Essential for Offline RL via Supervised Learning? , author=. 2022 , booktitle=

work page 2022
[26]

Transactions on Machine Learning Research , year=

Random Policy Enables In-Context Reinforcement Learning within Trust Horizons , author=. Transactions on Machine Learning Research , year=

work page
[27]

Proceedings of the International Conference on Machine Learning , year=

Generalization to New Sequential Decision Making Tasks with In-Context Learning , author=. Proceedings of the International Conference on Machine Learning , year=

work page
[28]

ArXiv preprint , year=

Scaling Algorithm Distillation for Continuous Control with Mamba , author=. ArXiv preprint , year=

work page
[29]

Ahmad Elawady and Gunjan Chhablani and Ram Ramrakhya and Karmesh Yadav and Dhruv Batra and Zsolt Kira and Andrew Szot , journal=

work page
[30]

Proceedings of the Conference on Robot Learning , year=

LocoFormer: Generalist Locomotion via Long-context Adaptation , author =. Proceedings of the Conference on Robot Learning , year=

work page
[31]

Proceedings of the International Conference on Machine Learning , year=

Human-Timescale Adaptation in an Open-Ended Task Space , author =. Proceedings of the International Conference on Machine Learning , year=

work page
[32]

Foundations of Computational Mathematics , year =

User-Friendly Tail Bounds for Sums of Random Matrices , author =. Foundations of Computational Mathematics , year =

work page
[33]

2015 , journal =

The Expected Norm of a Sum of Independent Random Matrices: An Elementary Approach , author =. 2015 , journal =

work page 2015
[34]

2024 , journal =

Estimating the Mixing Coefficients of Geometrically Ergodic Markov Processes , author=. 2024 , journal =

work page 2024
[35]

A Survey and Some Open Questions , author =

Basic Properties of Strong Mixing Conditions. A Survey and Some Open Questions , author =. Probability Surveys , year =

work page
[36]

2024 , booktitle =

How Many Pretraining Tasks Are Needed for In-Context Learning of Linear Regression? , author=. 2024 , booktitle =

work page 2024
[37]

2021 , booktitle =

Rethinking Attention with Performers , author=. 2021 , booktitle =

work page 2021
[38]

Transformers are RNNs: fast autoregressive transformers with linear attention , year =

Katharopoulos, Angelos and Vyas, Apoorv and Pappas, Nikolaos and Fleuret, Fran. Transformers are RNNs: fast autoregressive transformers with linear attention , year =

work page
[39]

2022 , booktitle=

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author=. 2022 , booktitle=

work page 2022
[40]

2026 , booktitle =

Reward Is Enough: LLMs Are In-Context Reinforcement Learners , author=. 2026 , booktitle =

work page 2026
[41]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , year=

A Survey on In-context Learning , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , year=

work page 2024
[42]

2024 , booktitle =

Can Looped Transformers Learn to Implement Multi-step Gradient Descent for In-context Learning? , author=. 2024 , booktitle =

work page 2024
[43]

2025 , journal =

A Tutorial on Meta-Reinforcement Learning , author=. 2025 , journal =

work page 2025
[44]

2025 , journal =

A Survey of In-Context Reinforcement Learning , author=. 2025 , journal =

work page 2025
[45]

2024 , booktitle =

Decision Mamba: Reinforcement Learning via Hybrid Selective Sequence Modeling , author=. 2024 , booktitle =

work page 2024
[46]

2024 , booktitle =

Huang, Sili and Hu, Jifeng and Chen, Hechang and Sun, Lichao and Yang, Bo , title =. 2024 , booktitle =

work page 2024
[47]

2025 , booktitle =

Transformers Learn to Implement Multi-step Gradient Descent with Chain of Thought , author=. 2025 , booktitle =

work page 2025
[48]

2024 , booktitle =

In-context Exploration-Exploitation for Reinforcement Learning , author=. 2024 , booktitle =

work page 2024
[49]

2023 , booktitle =

Shi, Lucy Xiaoyang and Jiang, Yunfan and Grigsby, Jake and Fan, Linxi Jim and Zhu, Yuke , title =. 2023 , booktitle =

work page 2023
[50]

Proceedings of the International Conference on Machine Learning , year =

Meta-Reinforcement Learning Robust to Distributional Shift Via Performing Lifelong In-Context Learning , author =. Proceedings of the International Conference on Machine Learning , year =

work page
[51]

2024 , booktitle =

AMAGO-2: Breaking the Multi-Task Barrier in Meta-Reinforcement Learning with Transformers , author=. 2024 , booktitle =

work page 2024
[52]

2024 , booktitle =

AMAGO: Scalable In-Context Reinforcement Learning for Adaptive Agents , author=. 2024 , booktitle =

work page 2024
[53]

2023 , booktitle =

Lu, Chris and Schroecker, Yannick and Gu, Albert and Parisotto, Emilio and Foerster, Jakob and Singh, Satinder and Behbahani, Feryal , title =. 2023 , booktitle =

work page 2023
[54]

2022 , booktitle=

Introducing Symmetries to Black Box Meta Reinforcement Learning , author=. 2022 , booktitle=

work page 2022
[55]

MixTConv: Mixed Temporal Convolutional Kernels for Efficient Action Recognition , year =

Shan, Kaiyu and Wang, Yongtao and Tang, Zhi and Chen, Ying and Li, Yangyan , booktitle =. MixTConv: Mixed Temporal Convolutional Kernels for Efficient Action Recognition , year =

work page
[56]

TEA: Temporal Excitation and Aggregation for Action Recognition , year=

Li, Yan and Ji, Bin and Shi, Xintian and Zhang, Jianguo and Kang, Bin and Wang, Limin , booktitle=. TEA: Temporal Excitation and Aggregation for Action Recognition , year=

work page
[57]

Proceedings of the IEEE International Conference on Computer Vision , year=

TSM: Temporal Shift Module for Efficient Video Understanding , author=. Proceedings of the IEEE International Conference on Computer Vision , year=

work page
[58]

International Conference on Learning Representations , year=

A Non-asymptotic Analysis of Non-parametric Temporal-Difference Learning , author=. International Conference on Learning Representations , year=

work page
[59]

Advances in Neural Information Processing Systems , year =

Wang, Jiuqi and Chandra, Rohan and Zhang, Shangtong , title =. Advances in Neural Information Processing Systems , year =

work page
[60]

International Conference on Learning Representations , year =

Transformers Can Learn Temporal Difference Methods for In-Context Reinforcement Learning , author =. International Conference on Learning Representations , year =

work page
[61]

Proceedings of the International Conference on Machine Learning , year =

Transformers Implement Functional Gradient Descent to Learn Non-Linear Functions In Context , author =. Proceedings of the International Conference on Machine Learning , year =

work page
[62]

Exponential Hardness of Reinforcement Learning with Linear Function Approximation , booktitle =

Kane, Daniel and Liu, Sihan and Lovett, Shachar and Mahajan, Gaurav and Szepesv. Exponential Hardness of Reinforcement Learning with Linear Function Approximation , booktitle =

work page
[63]

SIAM Journal on Control and Optimization , year=

A small gain analysis of single timescale actor critic , author=. SIAM Journal on Control and Optimization , year=

work page
[64]

Advances in Neural Information Processing Systems , year=

Finite-time analysis of single-timescale actor-critic , author=. Advances in Neural Information Processing Systems , year=

work page
[65]

Proceedings of the International Conference on Machine Learning , year=

A Generalized Reinforcement-Learning Model: Convergence and Applications , author=. Proceedings of the International Conference on Machine Learning , year=

work page
[66]

and Srikant, R

Beck, Carolyn L. and Srikant, R. , booktitle =. Improved upper bounds on the expected error in constant step-size Q-learning , year =

work page
[67]

Advances in Neural Information Processing Systems , year=

On the convergence and sample complexity analysis of deep q-networks with -greedy exploration , author=. Advances in Neural Information Processing Systems , year=

work page
[69]

Volodymyr Mnih and Koray Kavukcuoglu and David Silver and Alex Graves and Ioannis Antonoglou and Daan Wierstra and Martin Riedmiller , title =

work page
[70]

Fanghui Liu and Luca Viano and Volkan Cevher , title =

work page
[71]

SIAM Journal on Mathematics of Data Science , year =

Zaiwei Chen and John Paul Clarke and Siva Theja Maguluri , title =. SIAM Journal on Mathematics of Data Science , year =

work page
[72]

Regularized

Han-Dong, Lim and Donghwan, Lee , booktitle =. Regularized

work page
[73]

and Meyn, Sean P , title =

Devraj, Adithya M. and Meyn, Sean P , title =. 2022 , journal =

work page 2022
[74]

Isabel , booktitle =

Melo, Francisco S.and Ribeiro, M. Isabel , booktitle =. Q-Learning with Linear Function Approximation , year =

work page
[75]

Constant stepsize

Zhang, Yixuan and Xie, Qiaomin , journal=. Constant stepsize

work page
[76]

Performance of

Chen, Zaiwei and Zhang, Sheng and Doan, Thinh T and Maguluri, Siva Theja and Clarke, John-Paul , journal =. Performance of

work page
[77]

Joan, Bas-Serrano and Sebastian, Curi and Andreas, Krause and Gergely, Neu , title =

work page
[78]

ArXiv Preprint , year =

Gopalan, Aditya and Thoppe, Gugan , title =. ArXiv Preprint , year =

work page
[79]

Gao, Bolin and Pavel, Lacra , title =

work page
[80]

The Projected Bellman Equation in Reinforcement Learning , year =

Meyn, Sean , journal =. The Projected Bellman Equation in Reinforcement Learning , year =

work page
[81]

2022 , journal =

Shangtong Zhang and Remi Tachet and Romain Laroche , title =. 2022 , journal =

work page 2022

Showing first 80 references.

[1] [1]

Proceedings of the International Conference on Machine Learning , year =

Amir Moeini and Minjae Kwon and Alper Kamil Bozkurt and Yuichi Motai and Rohan Chandra and Lu Feng and Shangtong Zhang , title =. Proceedings of the International Conference on Machine Learning , year =

work page

[2] [2]

ArXiv Preprint , year =

Zixuan Xie and Xinyu Liu and Claire Chen and Shuze Daniel Liu and Rohan Chandra and Shangtong Zhang , title =. ArXiv Preprint , year =

work page

[3] [3]

Gandharv Patil and L. A. Prashanth and Dheeraj Nagaraj and Doina Precup , title =. Proceedings of the International Conference on Artificial Intelligence and Statistics , year =

work page

[4] [4]

Proceedings of the Conference on Learning Theory , year =

Sergey Samsonov and Daniil Tiapkin and Alexey Naumov and Eric Moulines , title =. Proceedings of the Conference on Learning Theory , year =

work page

[5] [5]

arXiv preprint , year =

Wei-Cheng Lee and Francesco Orabona , title =. arXiv preprint , year =

work page

[6] [6]

Finite-Sample Analysis of LSTD , booktitle =

Alessandro Lazaric and Mohammad Ghavamzadeh and R. Finite-Sample Analysis of LSTD , booktitle =

work page

[7] [7]

and Rosenthal, Jeffrey S

Roberts, Gareth O. and Rosenthal, Jeffrey S. , journal=. General state space

work page

[8] [8]

2002 , publisher=

Lectures on the Coupling Method , author=. 2002 , publisher=

work page 2002

[9] [9]

2017 , publisher=

Asymptotic Theory of Weakly Dependent Random Processes , author=. 2017 , publisher=

work page 2017

[10] [10]

and Cao, Yuan and Narasimhan, Karthik , title =

Yao, Shunyu and Yu, Dian and Zhao, Jeffrey and Shafran, Izhak and Griffiths, Thomas L. and Cao, Yuan and Narasimhan, Karthik , title =. 2023 , booktitle=

work page 2023

[11] [11]

2023 , booktitle=

Self-Consistency Improves Chain of Thought Reasoning in Language Models , author=. 2023 , booktitle=

work page 2023

[12] [12]

2024 , booktitle=

Artificial Generational Intelligence: Cultural Accumulation in Reinforcement Learning , author=. 2024 , booktitle=

work page 2024

[13] [13]

2022 , booktitle=

Transformers are Meta-Reinforcement Learners , author=. 2022 , booktitle=

work page 2022

[14] [14]

2020 , booktitle=

VariBAD: A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning , author=. 2020 , booktitle=

work page 2020

[15] [15]

2018 , journal=

Some Considerations on Learning to Explore via Meta-Reinforcement Learning , author=. 2018 , journal=

work page 2018

[16] [16]

International Conference on Machine Learning , year=

Been There, Done That: Meta-Learning with Episodic Recall , author=. International Conference on Machine Learning , year=

work page

[17] [17]

2018 , booktitle=

A Simple Neural Attentive Meta-Learner , author=. 2018 , booktitle=

work page 2018

[18] [18]

Proceedings of the International Conference on Machine Learning , year=

Emergence of in-context reinforcement learning from noise distillation , author=. Proceedings of the International Conference on Machine Learning , year=

work page

[19] [19]

Proceedings of the International Conference on Machine Learning , year=

Vintix: Action model via in-context reinforcement learning , author=. Proceedings of the International Conference on Machine Learning , year=

work page

[20] [20]

Proceedings of the International Conference on Machine Learning , year=

Emergent agentic transformer from chain of hindsight experience , author=. Proceedings of the International Conference on Machine Learning , year=

work page

[21] [21]

NeurIPS Foundation Models for Decision Making Workshop , year=

Towards General-Purpose In-Context Learning Agents , author =. NeurIPS Foundation Models for Decision Making Workshop , year=

work page

[22] [22]

Ilya Zisman and Alexander Nikulin and Viacheslav Sinii and Denis Tarasov and Nikita Lyubaykin and Andrei Polubarov and Igor Kiselev and Vladislav Kurenkov , booktitle =

work page

[23] [23]

2022 , booktitle=

Generalized Decision Transformer for Offline Hindsight Information Matching , author=. 2022 , booktitle=

work page 2022

[24] [24]

2022 , booktitle=

Prompting Decision Transformer for Few-Shot Policy Generalization , author=. 2022 , booktitle=

work page 2022

[25] [25]

2022 , booktitle=

RvS: What is Essential for Offline RL via Supervised Learning? , author=. 2022 , booktitle=

work page 2022

[26] [26]

Transactions on Machine Learning Research , year=

Random Policy Enables In-Context Reinforcement Learning within Trust Horizons , author=. Transactions on Machine Learning Research , year=

work page

[27] [27]

Proceedings of the International Conference on Machine Learning , year=

Generalization to New Sequential Decision Making Tasks with In-Context Learning , author=. Proceedings of the International Conference on Machine Learning , year=

work page

[28] [28]

ArXiv preprint , year=

Scaling Algorithm Distillation for Continuous Control with Mamba , author=. ArXiv preprint , year=

work page

[29] [29]

Ahmad Elawady and Gunjan Chhablani and Ram Ramrakhya and Karmesh Yadav and Dhruv Batra and Zsolt Kira and Andrew Szot , journal=

work page

[30] [30]

Proceedings of the Conference on Robot Learning , year=

LocoFormer: Generalist Locomotion via Long-context Adaptation , author =. Proceedings of the Conference on Robot Learning , year=

work page

[31] [31]

Proceedings of the International Conference on Machine Learning , year=

Human-Timescale Adaptation in an Open-Ended Task Space , author =. Proceedings of the International Conference on Machine Learning , year=

work page

[32] [32]

Foundations of Computational Mathematics , year =

User-Friendly Tail Bounds for Sums of Random Matrices , author =. Foundations of Computational Mathematics , year =

work page

[33] [33]

2015 , journal =

The Expected Norm of a Sum of Independent Random Matrices: An Elementary Approach , author =. 2015 , journal =

work page 2015

[34] [34]

2024 , journal =

Estimating the Mixing Coefficients of Geometrically Ergodic Markov Processes , author=. 2024 , journal =

work page 2024

[35] [35]

A Survey and Some Open Questions , author =

Basic Properties of Strong Mixing Conditions. A Survey and Some Open Questions , author =. Probability Surveys , year =

work page

[36] [36]

2024 , booktitle =

How Many Pretraining Tasks Are Needed for In-Context Learning of Linear Regression? , author=. 2024 , booktitle =

work page 2024

[37] [37]

2021 , booktitle =

Rethinking Attention with Performers , author=. 2021 , booktitle =

work page 2021

[38] [38]

Transformers are RNNs: fast autoregressive transformers with linear attention , year =

Katharopoulos, Angelos and Vyas, Apoorv and Pappas, Nikolaos and Fleuret, Fran. Transformers are RNNs: fast autoregressive transformers with linear attention , year =

work page

[39] [39]

2022 , booktitle=

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author=. 2022 , booktitle=

work page 2022

[40] [40]

2026 , booktitle =

Reward Is Enough: LLMs Are In-Context Reinforcement Learners , author=. 2026 , booktitle =

work page 2026

[41] [41]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , year=

A Survey on In-context Learning , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , year=

work page 2024

[42] [42]

2024 , booktitle =

Can Looped Transformers Learn to Implement Multi-step Gradient Descent for In-context Learning? , author=. 2024 , booktitle =

work page 2024

[43] [43]

2025 , journal =

A Tutorial on Meta-Reinforcement Learning , author=. 2025 , journal =

work page 2025

[44] [44]

2025 , journal =

A Survey of In-Context Reinforcement Learning , author=. 2025 , journal =

work page 2025

[45] [45]

2024 , booktitle =

Decision Mamba: Reinforcement Learning via Hybrid Selective Sequence Modeling , author=. 2024 , booktitle =

work page 2024

[46] [46]

2024 , booktitle =

Huang, Sili and Hu, Jifeng and Chen, Hechang and Sun, Lichao and Yang, Bo , title =. 2024 , booktitle =

work page 2024

[47] [47]

2025 , booktitle =

Transformers Learn to Implement Multi-step Gradient Descent with Chain of Thought , author=. 2025 , booktitle =

work page 2025

[48] [48]

2024 , booktitle =

In-context Exploration-Exploitation for Reinforcement Learning , author=. 2024 , booktitle =

work page 2024

[49] [49]

2023 , booktitle =

Shi, Lucy Xiaoyang and Jiang, Yunfan and Grigsby, Jake and Fan, Linxi Jim and Zhu, Yuke , title =. 2023 , booktitle =

work page 2023

[50] [50]

Proceedings of the International Conference on Machine Learning , year =

Meta-Reinforcement Learning Robust to Distributional Shift Via Performing Lifelong In-Context Learning , author =. Proceedings of the International Conference on Machine Learning , year =

work page

[51] [51]

2024 , booktitle =

AMAGO-2: Breaking the Multi-Task Barrier in Meta-Reinforcement Learning with Transformers , author=. 2024 , booktitle =

work page 2024

[52] [52]

2024 , booktitle =

AMAGO: Scalable In-Context Reinforcement Learning for Adaptive Agents , author=. 2024 , booktitle =

work page 2024

[53] [53]

2023 , booktitle =

Lu, Chris and Schroecker, Yannick and Gu, Albert and Parisotto, Emilio and Foerster, Jakob and Singh, Satinder and Behbahani, Feryal , title =. 2023 , booktitle =

work page 2023

[54] [54]

2022 , booktitle=

Introducing Symmetries to Black Box Meta Reinforcement Learning , author=. 2022 , booktitle=

work page 2022

[55] [55]

MixTConv: Mixed Temporal Convolutional Kernels for Efficient Action Recognition , year =

Shan, Kaiyu and Wang, Yongtao and Tang, Zhi and Chen, Ying and Li, Yangyan , booktitle =. MixTConv: Mixed Temporal Convolutional Kernels for Efficient Action Recognition , year =

work page

[56] [56]

TEA: Temporal Excitation and Aggregation for Action Recognition , year=

Li, Yan and Ji, Bin and Shi, Xintian and Zhang, Jianguo and Kang, Bin and Wang, Limin , booktitle=. TEA: Temporal Excitation and Aggregation for Action Recognition , year=

work page

[57] [57]

Proceedings of the IEEE International Conference on Computer Vision , year=

TSM: Temporal Shift Module for Efficient Video Understanding , author=. Proceedings of the IEEE International Conference on Computer Vision , year=

work page

[58] [58]

International Conference on Learning Representations , year=

A Non-asymptotic Analysis of Non-parametric Temporal-Difference Learning , author=. International Conference on Learning Representations , year=

work page

[59] [59]

Advances in Neural Information Processing Systems , year =

Wang, Jiuqi and Chandra, Rohan and Zhang, Shangtong , title =. Advances in Neural Information Processing Systems , year =

work page

[60] [60]

International Conference on Learning Representations , year =

Transformers Can Learn Temporal Difference Methods for In-Context Reinforcement Learning , author =. International Conference on Learning Representations , year =

work page

[61] [61]

Proceedings of the International Conference on Machine Learning , year =

Transformers Implement Functional Gradient Descent to Learn Non-Linear Functions In Context , author =. Proceedings of the International Conference on Machine Learning , year =

work page

[62] [62]

Exponential Hardness of Reinforcement Learning with Linear Function Approximation , booktitle =

Kane, Daniel and Liu, Sihan and Lovett, Shachar and Mahajan, Gaurav and Szepesv. Exponential Hardness of Reinforcement Learning with Linear Function Approximation , booktitle =

work page

[63] [63]

SIAM Journal on Control and Optimization , year=

A small gain analysis of single timescale actor critic , author=. SIAM Journal on Control and Optimization , year=

work page

[64] [64]

Advances in Neural Information Processing Systems , year=

Finite-time analysis of single-timescale actor-critic , author=. Advances in Neural Information Processing Systems , year=

work page

[65] [65]

Proceedings of the International Conference on Machine Learning , year=

A Generalized Reinforcement-Learning Model: Convergence and Applications , author=. Proceedings of the International Conference on Machine Learning , year=

work page

[66] [66]

and Srikant, R

Beck, Carolyn L. and Srikant, R. , booktitle =. Improved upper bounds on the expected error in constant step-size Q-learning , year =

work page

[67] [67]

Advances in Neural Information Processing Systems , year=

On the convergence and sample complexity analysis of deep q-networks with -greedy exploration , author=. Advances in Neural Information Processing Systems , year=

work page

[68] [69]

Volodymyr Mnih and Koray Kavukcuoglu and David Silver and Alex Graves and Ioannis Antonoglou and Daan Wierstra and Martin Riedmiller , title =

work page

[69] [70]

Fanghui Liu and Luca Viano and Volkan Cevher , title =

work page

[70] [71]

SIAM Journal on Mathematics of Data Science , year =

Zaiwei Chen and John Paul Clarke and Siva Theja Maguluri , title =. SIAM Journal on Mathematics of Data Science , year =

work page

[71] [72]

Regularized

Han-Dong, Lim and Donghwan, Lee , booktitle =. Regularized

work page

[72] [73]

and Meyn, Sean P , title =

Devraj, Adithya M. and Meyn, Sean P , title =. 2022 , journal =

work page 2022

[73] [74]

Isabel , booktitle =

Melo, Francisco S.and Ribeiro, M. Isabel , booktitle =. Q-Learning with Linear Function Approximation , year =

work page

[74] [75]

Constant stepsize

Zhang, Yixuan and Xie, Qiaomin , journal=. Constant stepsize

work page

[75] [76]

Performance of

Chen, Zaiwei and Zhang, Sheng and Doan, Thinh T and Maguluri, Siva Theja and Clarke, John-Paul , journal =. Performance of

work page

[76] [77]

Joan, Bas-Serrano and Sebastian, Curi and Andreas, Krause and Gergely, Neu , title =

work page

[77] [78]

ArXiv Preprint , year =

Gopalan, Aditya and Thoppe, Gugan , title =. ArXiv Preprint , year =

work page

[78] [79]

Gao, Bolin and Pavel, Lacra , title =

work page

[79] [80]

The Projected Bellman Equation in Reinforcement Learning , year =

Meyn, Sean , journal =. The Projected Bellman Equation in Reinforcement Learning , year =

work page

[80] [81]

2022 , journal =

Shangtong Zhang and Remi Tachet and Romain Laroche , title =. 2022 , journal =

work page 2022