arxiv: 2605.07333 · v1 · submitted 2026-05-08 · 💻 cs.LG

Recognition: 1 theorem link

· Lean Theorem

Beyond Linear Attention: Softmax Transformers Implement In-Context Reinforcement Learning

Zixuan Xie , Xinyu Liu , Claire Chen , Shuze Daniel Liu , Rohan Chandra , Shangtong Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:09 UTC · model grok-4.3

classification 💻 cs.LG

keywords softmax attentionin-context reinforcement learningtemporal difference learningtransformer forward passpolicy evaluationweighted softmax TDkernel space

0 comments

The pith

Softmax attention in Transformers computes iterative updates of a weighted softmax TD learning algorithm across layers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a Transformer using ordinary softmax attention, with the right choice of parameters, carries out exactly the same calculations as successive steps of weighted softmax temporal difference learning for policy evaluation. This equivalence holds layer by layer in the forward pass and requires no changes to the model's weights. A sympathetic reader cares because it removes the unrealistic linear-attention shortcut used in prior theory and shows how the attention mechanism that works in practice can still perform in-context reinforcement learning. The work further shows that the same parameters minimize the pretraining loss and cause evaluation error to shrink with depth under a contraction condition.

Core claim

With carefully chosen parameters, the layerwise forward pass of a softmax Transformer is mathematically identical to iterative updates of a weighted softmax TD algorithm that performs policy evaluation in kernel space; under an additional contraction condition the policy evaluation error decreases with the number of layers; and those same parameters are a global minimizer of the pretraining loss.

What carries the argument

The exact algebraic equivalence between each softmax attention layer and one step of the weighted softmax TD update rule, where attention scores implement the softmax over kernel-space value estimates.

If this is right

Policy evaluation error contracts toward zero as the number of layers increases whenever the contraction condition holds.
The parameters that implement the TD equivalence also emerge as the global solution to the pretraining objective.
Weighted softmax TD recovers both ordinary linear TD and tabular TD as special cases.
A pretrained Transformer can adapt to new tasks by processing context alone, without any gradient updates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Deeper Transformers could solve more complex in-context RL tasks provided the contraction condition scales with depth.
The same equivalence might be used to design pretraining losses that deliberately encourage RL-like behavior in attention layers.
Practical models could be inspected layer by layer to verify whether their internal computations align with TD-style updates on real tasks.

Load-bearing premise

There exist parameters that simultaneously realize the exact layerwise equivalence to the TD updates, satisfy the contraction condition needed for error decay, and globally minimize the pretraining loss.

What would settle it

Fix the identified parameters in a Transformer and check whether each layer's output vector matches the formula for one weighted softmax TD update applied to the preceding layer's output; mismatch at any layer would refute the claimed equivalence.

Figures

Figures reproduced from arXiv: 2605.07333 by Claire Chen, Rohan Chandra, Shangtong Zhang, Shuze Daniel Liu, Xinyu Liu, Zixuan Xie.

**Figure 2.** Figure 2: Original vs. shifted memory rows. 5 Inference-Time Convergence We now address (Q2) by establishing the convergence of softmax ICTD to the true value function vπ as the number of layers L → ∞ and the context length n → ∞. Throughout the rest of the paper, we assume the trajectory τn = (S0, R1, . . . , Sn) visits every state, i.e., {S0, S1, . . . , Sn−1} = S. To analyze the recursion (17) on S, for a given c… view at source ↗

**Figure 3.** Figure 3: Emergence of the learned TD block. (a) Learned [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Boyan’s chain topology with nonzero transitions. Adapted from [PITH_FULL_IMAGE:figures/full_fig_p023_4.png] view at source ↗

**Figure 5.** Figure 5: Emergence vs. training steps under the default mask and [PITH_FULL_IMAGE:figures/full_fig_p025_5.png] view at source ↗

**Figure 6.** Figure 6: Mask relaxation on Boyan’s chain. We compare the full-mask setting in [PITH_FULL_IMAGE:figures/full_fig_p027_6.png] view at source ↗

**Figure 7.** Figure 7: Kernel weighted TD verification. Layer-wise log discrepancy [PITH_FULL_IMAGE:figures/full_fig_p027_7.png] view at source ↗

read the original abstract

In-context reinforcement learning (ICRL) studies agents that, after pretraining, adapt to new tasks by conditioning on additional context without parameter updates. Existing theoretical analyses of ICRL largely rely on linear attention, which replaces the softmax function in the standard attention with an identity mapping. This paper provides the first theoretical understanding of ICRL without making the unrealistic linear attention simplification. In particular, we consider the standard softmax attention used in practice. We show that, with certain parameters, the layerwise forward pass of a Transformer with such softmax attention is equivalent to iterative updates of a weighted softmax temporal difference (TD) learning algorithm. Here, weighted softmax TD is a new RL algorithm that performs policy evaluation in kernel space and adopts both linear TD and tabular TD as special cases. We also prove that under a certain contraction condition, the policy evaluation error decays as the number of layers grows, with the identified parameters above. Finally, we prove that those parameters are a global minimizer of a pretraining loss, explaining their emergence in our numerical experiments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper links standard softmax attention to iterative weighted softmax TD updates for in-context RL and shows the matching parameters minimize pretraining loss, but the joint conditions on those parameters are the part that needs the closest check.

read the letter

The main point here is that this work gives the first equivalence between unmodified softmax attention and a generalized TD algorithm without falling back on linear attention. They introduce weighted softmax TD as a kernel-space policy evaluation method that recovers linear TD and tabular TD as special cases. With specific parameters the layerwise Transformer forward pass matches the iterative TD updates, the evaluation error contracts with depth under a stated condition, and those same parameters globally minimize the pretraining loss. That last piece is what lets them explain why the behavior shows up in experiments. The proofs are formal and the reduction to known algorithms is clean, so the core technical step lands. The soft spot is exactly the one the stress-test note flags: the parameters have to satisfy three things at once (exact equivalence, contraction, and global loss minimization). The derivations for the equivalence and for the loss minimizer are separate, and if the minimizer does not automatically meet the contraction condition the explanatory power drops. The paper states the joint result but the explicit verification that one parameter set works for all three feels like the load-bearing claim that could use more direct checking or a simple example. The contraction condition itself also looks somewhat restrictive. This is aimed at people working on theoretical accounts of in-context learning and on RL interpretations of attention. It is worth sending to peer review because the formal claims are new and the math is there, even if the parameter conditions will probably draw questions.

Referee Report

2 major / 2 minor

Summary. The paper claims that, with certain parameters, the layerwise forward pass through a Transformer using standard softmax attention is mathematically equivalent to iterative updates of a newly introduced weighted softmax temporal difference (TD) learning algorithm for policy evaluation in kernel space (which subsumes linear and tabular TD as special cases). It further asserts that under a contraction condition these parameters ensure the policy evaluation error decays with network depth, and that the same parameters globally minimize a pretraining loss, thereby explaining their emergence in experiments. This is positioned as the first theoretical account of in-context RL that avoids the linear-attention simplification.

Significance. If the equivalences, contraction, and global-minimizer results hold, the work would be significant for providing the first rigorous bridge between practical softmax Transformers and in-context reinforcement learning. It introduces a new RL algorithm (weighted softmax TD) whose fixed-point and contraction properties are tied directly to Transformer layers, offers a mathematical explanation for why certain parameters arise during pretraining, and removes a key unrealistic assumption from prior ICRL theory.

major comments (2)

[Abstract] Abstract: the central claim requires a single set of parameters that simultaneously (1) realize the exact forward-pass equivalence to weighted softmax TD iterations, (2) satisfy the contraction condition guaranteeing error decay with depth, and (3) globally minimize the pretraining loss. The abstract states these parameters are identified and proven to be global minimizers, but does not indicate whether the loss minimizer is shown to coincide with the contraction-satisfying equivalence parameters or whether the loss is constructed around the same in-context RL objective, leaving a potential circularity that is load-bearing for the explanation of parameter emergence.
[Abstract] Abstract: the contraction condition under which the policy evaluation error decays with the number of layers is asserted but neither stated explicitly nor accompanied by the required assumptions on the kernel or weighting function; without these details the error-decay claim cannot be verified and is load-bearing for the depth-dependent convergence result.

minor comments (2)

The manuscript introduces the weighted softmax TD algorithm; a short paragraph contrasting it with standard linear TD and tabular TD (including how the kernel and weighting recover each as special cases) would improve accessibility.
Notation for the attention weights, value-function embeddings, and TD update operators should be introduced once in a dedicated preliminaries section and used consistently thereafter to avoid redefinition across sections.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. We address each major comment below, clarifying the logical structure of our results and indicating revisions to the abstract.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim requires a single set of parameters that simultaneously (1) realize the exact forward-pass equivalence to weighted softmax TD iterations, (2) satisfy the contraction condition guaranteeing error decay with depth, and (3) globally minimize the pretraining loss. The abstract states these parameters are identified and proven to be global minimizers, but does not indicate whether the loss minimizer is shown to coincide with the contraction-satisfying equivalence parameters or whether the loss is constructed around the same in-context RL objective, leaving a potential circularity that is load-bearing for the explanation of parameter emergence.

Authors: The parameters identified for the exact forward-pass equivalence to the weighted softmax TD iterations are the same set shown to satisfy the contraction condition (ensuring error decay with depth) and proven to be global minimizers of the pretraining loss. The proof proceeds sequentially: the equivalence is derived first from the attention mechanism and TD update rules, the contraction is established for these parameters under the stated kernel and weighting assumptions, and only then is the global minimization result shown for the pretraining loss (which is defined on the in-context policy evaluation objective). This ordering avoids circularity. We will revise the abstract to explicitly note that the same parameters achieve all three properties and to outline this logical flow. revision: yes
Referee: [Abstract] Abstract: the contraction condition under which the policy evaluation error decays with the number of layers is asserted but neither stated explicitly nor accompanied by the required assumptions on the kernel or weighting function; without these details the error-decay claim cannot be verified and is load-bearing for the depth-dependent convergence result.

Authors: The contraction condition (a bound on the operator norm of the weighted softmax TD update) and the supporting assumptions (the kernel is positive semi-definite and the weighting function is non-negative and integrates to one) are stated explicitly in Section 3 and Theorem 4. We agree the abstract is too terse on this point. We will revise the abstract to include a concise statement of the contraction condition together with the key assumptions on the kernel and weighting function. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation chain is self-contained

full rationale

The paper first identifies specific parameter settings that make the softmax attention forward pass mathematically equivalent to one step of weighted softmax TD (via direct substitution into the attention equations). It then proves contraction of the value error under those parameters as depth increases. Finally, it proves those same parameters globally minimize the pretraining loss by showing that any other parameters yield strictly higher loss on the in-context prediction task, using the fact that the TD fixed point is the unique minimizer of the Bellman residual in the kernel space. These are three sequential, independent mathematical arguments; the loss is the standard next-token prediction loss on RL trajectories and is not defined in terms of the TD algorithm itself. No self-citations, no fitted parameters renamed as predictions, and no ansatz smuggled via prior work. The central claim therefore rests on explicit derivations rather than reduction to its own inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the existence of specific parameters that simultaneously realize the layerwise equivalence and minimize pretraining loss, plus an unspecified contraction condition required for error decay. Weighted softmax TD is introduced as a new algorithm without external validation.

free parameters (1)

certain parameters
Parameters chosen so that the Transformer forward pass matches iterative weighted softmax TD updates and globally minimize the pretraining loss.

axioms (1)

domain assumption contraction condition
Required for the proof that policy evaluation error decays with additional layers.

invented entities (1)

weighted softmax TD learning algorithm no independent evidence
purpose: Performs policy evaluation in kernel space and generalizes linear TD and tabular TD.
New RL algorithm introduced to establish the equivalence with softmax attention.

pith-pipeline@v0.9.0 · 5494 in / 1371 out tokens · 33517 ms · 2026-05-11T01:09:34.720802+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

with certain parameters, the layerwise forward pass of a Transformer with such softmax attention is equivalent to iterative updates of a weighted softmax temporal difference (TD) learning algorithm

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

299 extracted references · 299 canonical work pages

[1]

2026 , booktitle =

Reward Is Enough: LLMs Are In-Context Reinforcement Learners , author=. 2026 , booktitle =

work page 2026
[2]

2025 , booktitle =

Transformers Learn to Implement Multi-step Gradient Descent with Chain of Thought , author=. 2025 , booktitle =

work page 2025
[3]

ArXiv Preprint , year =

Zixuan Xie and Xinyu Liu and Rohan Chandra and Shangtong Zhang , title =. ArXiv Preprint , year =

work page
[4]

and Rheinboldt, Werner C

Ortega, James M. and Rheinboldt, Werner C. , title =. 2000 , note =

work page 2000
[5]

Journal of Machine Learning Research , year =

Jianqing Fan and Bai Jiang and Qiang Sun , title =. Journal of Machine Learning Research , year =

work page
[6]

2018 , publisher=

High-Dimensional Probability: An Introduction with Applications in Data Science , author=. 2018 , publisher=

work page 2018
[7]

2024 , journal =

Estimating the Mixing Coefficients of Geometrically Ergodic Markov Processes , author=. 2024 , journal =

work page 2024
[8]

A Survey and Some Open Questions , author =

Basic Properties of Strong Mixing Conditions. A Survey and Some Open Questions , author =. Probability Surveys , year =

work page
[9]

2025 , journal =

Softmax Linear: Transformers May Learn to Classify In-Context by Kernel Gradient Descent , author =. 2025 , journal =

work page 2025
[10]

Advances in Neural Information Processing Systems , year =

Towards Understanding How Transformers Learn In-Context Through a Representation Learning Lens , author =. Advances in Neural Information Processing Systems , year =

work page
[11]

and Cao, Yuan and Narasimhan, Karthik , title =

Yao, Shunyu and Yu, Dian and Zhao, Jeffrey and Shafran, Izhak and Griffiths, Thomas L. and Cao, Yuan and Narasimhan, Karthik , title =. 2023 , booktitle=

work page 2023
[12]

2023 , booktitle=

Self-Consistency Improves Chain of Thought Reasoning in Language Models , author=. 2023 , booktitle=

work page 2023
[13]

2024 , booktitle=

Artificial Generational Intelligence: Cultural Accumulation in Reinforcement Learning , author=. 2024 , booktitle=

work page 2024
[14]

2022 , booktitle=

Transformers are Meta-Reinforcement Learners , author=. 2022 , booktitle=

work page 2022
[15]

2020 , booktitle=

VariBAD: A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning , author=. 2020 , booktitle=

work page 2020
[16]

2018 , journal=

Some Considerations on Learning to Explore via Meta-Reinforcement Learning , author=. 2018 , journal=

work page 2018
[17]

International Conference on Machine Learning , year=

Been There, Done That: Meta-Learning with Episodic Recall , author=. International Conference on Machine Learning , year=

work page
[18]

2018 , booktitle=

A Simple Neural Attentive Meta-Learner , author=. 2018 , booktitle=

work page 2018
[19]

Theory of Probability and its Applications , year=

On estimating regression , author=. Theory of Probability and its Applications , year=

work page
[20]

Smooth regression analysis , author=. Sankhy

work page
[21]

Learning with Kernels , author=

work page
[22]

Gaussian Processes for Machine Learning , author=

work page
[23]

Transformers learn to implement preconditioned gradient descent for in-context learning , year =

Ahn, Kwangjun and Cheng, Xiang and Daneshmand, Hadi and Sra, Suvrit , journal =. Transformers learn to implement preconditioned gradient descent for in-context learning , year =

work page
[24]

Ansel, Jason and Yang, Edward and He, Horace and Gimelshein, Natalia and Jain, Animesh and Voznesensky, Michael and Bao, Bin and Bell, Peter and Berard, David and Burovski, Evgeni and Chauhan, Geeta and Chourdia, Anjali and Constable, Will and Desmaison, Alban and DeVito, Zachary and Ellison, Elias and Feng, Will and Gong, Jiong and Gschwind, Michael and ...

work page
[25]

International conference on machine learning , title =

Azar, Mohammad Gheshlaghi and Osband, Ian and Munos, R. International conference on machine learning , title =

work page
[26]

Proceedings of the International Conference on Machine Learning , year=

Human-timescale adaptation in an open-ended task space , author=. Proceedings of the International Conference on Machine Learning , year=

work page
[27]

A survey of meta-reinforcement learning , year =

Beck, Jacob and Vuorio, Risto and Liu, Evan Zheran and Xiong, Zheng and Zintgraf, Luisa and Finn, Chelsea and Whiteson, Shimon , journal =. A survey of meta-reinforcement learning , year =

work page
[28]

, booktitle =

Boyan, Justin A. , booktitle =. Least-Squares Temporal Difference Learning , year =

work page
[29]

Proceedings of the International Conference on Learning Representations , year=

Randomized ensembled double q-learning: Learning fast without a model , author=. Proceedings of the International Conference on Learning Representations , year=

work page
[30]

Contextual bandits with linear payoff functions , year =

Chu, Wei and Li, Lihong and Reyzin, Lev and Schapire, Robert , booktitle =. Contextual bandits with linear payoff functions , year =

work page
[31]

2024 , booktitle =

In-context Exploration-Exploitation for Reinforcement Learning , author=. 2024 , booktitle =

work page 2024
[32]

Duan, Yan and Schulman, John and Chen, Xi and Bartlett, Peter L and Sutskever, Ilya and Abbeel, Pieter , journal =

work page
[33]

2024 , booktitle =

AMAGO: Scalable In-Context Reinforcement Learning for Adaptive Agents , author=. 2024 , booktitle =

work page 2024
[34]

2024 , booktitle =

AMAGO-2: Breaking the Multi-Task Barrier in Meta-Reinforcement Learning with Transformers , author=. 2024 , booktitle =

work page 2024
[35]

and Millman, K

Harris, Charles R. and Millman, K. Jarrod and van der Walt, St. Nature , title =

work page
[36]

Proceedings of the International Conference on Machine Learning , year=

In-context decision transformer: reinforcement learning via hierarchical chain-of-thought , author=. Proceedings of the International Conference on Machine Learning , year=

work page
[37]

2024 , booktitle =

Decision Mamba: Reinforcement Learning via Hybrid Selective Sequence Modeling , author=. 2024 , booktitle =

work page 2024
[38]

Hunter, J. D. , journal =. Matplotlib: A 2D graphics environment , year =

work page
[39]

Proceedings of the AAAI Conference on Artificial Intelligence , year=

Introducing symmetries to black box meta reinforcement learning , author=. Proceedings of the AAAI Conference on Artificial Intelligence , year=

work page
[40]

Proceedings of the International Conference on Learning Representations , year=

In-context reinforcement learning with algorithm distillation , author=. Proceedings of the International Conference on Learning Representations , year=

work page
[41]

Proceedings of the International Conference on Learning Representations , year=

Transformers as decision makers: Provable in-context reinforcement learning via supervised pretraining , author=. Proceedings of the International Conference on Learning Representations , year=

work page
[42]

Proceedings of the International Conference on Machine Learning , year=

Emergent agentic transformer from chain of hindsight experience , author=. Proceedings of the International Conference on Machine Learning , year=

work page
[43]

Advances in Neural Information Processing Systems , year=

Structured state space models for in-context reinforcement learning , author=. Advances in Neural Information Processing Systems , year=

work page
[44]

and Veness, Joel and Bellemare, Marc G

Mnih, Volodymyr and Kavukcuoglu, Koray and Silver, David and Rusu, Andrei A. and Veness, Joel and Bellemare, Marc G. and Graves, Alex and Riedmiller, Martin A. and Fidjeland, Andreas and Ostrovski, Georg and Petersen, Stig and Beattie, Charles and Sadik, Amir and Antonoglou, Ioannis and King, Helen and Kumaran, Dharshan and Wierstra, Daan and Legg, Shane ...

work page
[45]

Proceedings of the International Conference on Machine Learning , title =

Mnih, Volodymyr and Badia, Adri. Proceedings of the International Conference on Machine Learning , title =

work page
[46]

ArXiv Preprint , year=

Safe in-context reinforcement learning , author=. ArXiv Preprint , year=

work page
[47]

2025 , journal =

A Survey of In-Context Reinforcement Learning , author=. 2025 , journal =

work page 2025
[48]

and Zhang, Kaiqing , booktitle=

Park, Chanwoo and Liu, Xiangyu and Ozdaglar, Asuman E. and Zhang, Kaiqing , booktitle=. Do

work page
[49]

Proceedings of the International Conference on Machine Learning , year=

Vintix: Action model via in-context reinforcement learning , author=. Proceedings of the International Conference on Machine Learning , year=

work page
[50]

Markov decision processes: discrete stochastic dynamic programming , year =

Puterman, Martin L , publisher =. Markov decision processes: discrete stochastic dynamic programming , year =

work page
[51]

A tutorial on thompson sampling , year =

Russo, Daniel J and Van Roy, Benjamin and Kazerouni, Abbas and Osband, Ian and Wen, Zheng and others , journal =. A tutorial on thompson sampling , year =

work page
[52]

and Moritz, Philipp , booktitle =

Schulman, John and Levine, Sergey and Abbeel, Pieter and Jordan, Michael I. and Moritz, Philipp , booktitle =. Trust Region Policy Optimization , year =

work page
[53]

Proximal Policy Optimization Algorithms , year =

Schulman, John and Wolski, Filip and Dhariwal, Prafulla and Radford, Alec and Klimov, Oleg , journal =. Proximal Policy Optimization Algorithms , year =

work page
[54]

Machine Learning , year=

A primal-dual perspective of online learning algorithms , author=. Machine Learning , year=

work page
[55]

Advances in Neural Information Processing Systems , year=

Cross-episodic curriculum for transformer agents , author=. Advances in Neural Information Processing Systems , year=

work page
[56]

, journal =

Sutton, Richard S. , journal =. Learning to Predict by the Methods of Temporal Differences , year =

work page
[57]

Reinforcement Learning: An Introduction (2nd Edition) , year =

Sutton, Richard S and Barto, Andrew G , publisher =. Reinforcement Learning: An Introduction (2nd Edition) , year =

work page
[58]

and Maei, Hamid R

Sutton, Richard S. and Maei, Hamid R. and Szepesv. Advances in Neural Information Processing Systems , title =

work page
[59]

and Maei, Hamid Reza and Precup, Doina and Bhatnagar, Shalabh and Silver, David and Szepesv

Sutton, Richard S. and Maei, Hamid Reza and Precup, Doina and Bhatnagar, Shalabh and Silver, David and Szepesv. Proceedings of the International Conference on Machine Learning , title =

work page
[60]

Tarasov, Denis and Nikulin, Alexander and Zisman, Ilya and Klepach, Albina and Polubarov, Andrei and Nikita, Lyubaykin and Derevyagin, Alexander and Kiselev, Igor and Kurenkov, Vladislav , booktitle=. Yes,

work page
[61]

Attention is All you Need , year =

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , year =

work page
[62]

Learning to reinforcement learn , year =

Wang, Jane X and Kurth-Nelson, Zeb and Tirumala, Dhruva and Soyer, Hubert and Leibo, Joel Z and Munos, Remi and Blundell, Charles and Kumaran, Dharshan and Botvinick, Matt , journal =. Learning to reinforcement learn , year =

work page
[63]

Proceedings of the International Conference on Learning Representations , year=

Transformers can learn temporal difference methods for in-context reinforcement learning , author=. Proceedings of the International Conference on Learning Representations , year=

work page
[64]

Proceedings of the International Conference on Machine Learning , year =

Meta-Reinforcement Learning Robust to Distributional Shift Via Performing Lifelong In-Context Learning , author =. Proceedings of the International Conference on Machine Learning , year =

work page
[65]

Journal of Machine Learning Research , year=

Trained transformers learn linear models in-context , author=. Journal of Machine Learning Research , year=

work page
[66]

Proceedings of the International Conference on Machine Learning , year=

Emergence of in-context reinforcement learning from noise distillation , author=. Proceedings of the International Conference on Machine Learning , year=

work page
[67]

Ilya Zisman and Alexander Nikulin and Viacheslav Sinii and Denis Tarasov and Nikita Lyubaykin and Andrei Polubarov and Igor Kiselev and Vladislav Kurenkov , booktitle =

work page
[68]

Proceedings of the International Conference on Machine Learning , year=

Human-Timescale Adaptation in an Open-Ended Task Space , author =. Proceedings of the International Conference on Machine Learning , year=

work page
[69]

NeurIPS Foundation Models for Decision Making Workshop , year=

Towards General-Purpose In-Context Learning Agents , author=. NeurIPS Foundation Models for Decision Making Workshop , year=

work page
[70]

2022 , booktitle=

Generalized Decision Transformer for Offline Hindsight Information Matching , author=. 2022 , booktitle=

work page 2022
[71]

2022 , booktitle=

Prompting Decision Transformer for Few-Shot Policy Generalization , author=. 2022 , booktitle=

work page 2022
[72]

2022 , booktitle=

RvS: What is Essential for Offline RL via Supervised Learning? , author=. 2022 , booktitle=

work page 2022
[73]

Transactions on Machine Learning Research , year=

Random Policy Enables In-Context Reinforcement Learning within Trust Horizons , author=. Transactions on Machine Learning Research , year=

work page
[74]

Proceedings of the International Conference on Machine Learning , year=

Generalization to New Sequential Decision Making Tasks with In-Context Learning , author =. Proceedings of the International Conference on Machine Learning , year=

work page
[75]

ArXiv preprint , year=

Scaling Algorithm Distillation for Continuous Control with Mamba , author=. ArXiv preprint , year=

work page
[76]

Ahmad Elawady and Gunjan Chhablani and Ram Ramrakhya and Karmesh Yadav and Dhruv Batra and Zsolt Kira and Andrew Szot , journal=

work page
[77]

Proceedings of the Conference on Robot Learning , year=

LocoFormer: Generalist Locomotion via Long-context Adaptation , author=. Proceedings of the Conference on Robot Learning , year=

work page
[78]

2024 , booktitle =

Can Looped Transformers Learn to Implement Multi-step Gradient Descent for In-context Learning? , author=. 2024 , booktitle =

work page 2024
[79]

2025 , journal =

A Tutorial on Meta-Reinforcement Learning , author=. 2025 , journal =

work page 2025
[80]

2024 , booktitle =

Huang, Sili and Hu, Jifeng and Chen, Hechang and Sun, Lichao and Yang, Bo , title =. 2024 , booktitle =

work page 2024

Showing first 80 references.