Reinforcement Learning Foundation Models Should Already Be A Thing

Abdelrahman Zighem; Jill-J\^enn Vie

arxiv: 2606.18812 · v2 · pith:RKEPXCWNnew · submitted 2026-06-17 · 💻 cs.LG · cs.AI

Reinforcement Learning Foundation Models Should Already Be A Thing

Abdelrahman Zighem , Jill-J\^enn Vie This is my paper

Pith reviewed 2026-06-26 21:47 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords reinforcement learningfoundation modelssynthetic dataMDPsin-context learningsufficient statisticsgraph attention network

0 comments

The pith

MDPs have a fixed-size sufficient statistic that is tabular and independent of observed episodes, enabling foundation models for RL pretrained on synthetic data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that reinforcement learning is missing from the foundation model paradigm despite the feasibility of generating synthetic MDPs, just as synthetic data powers tabular models like TabPFN. It establishes that MDPs possess a fixed-size sufficient statistic that stays tabular no matter how many episodes are observed. This statistic fits directly into attention-based architectures, simply by swapping the output head for a policy instead of a prediction target. A proof-of-concept model trained entirely on synthetic MDPs then solves new tabular RL problems in context, using fewer episodes than classical methods online and matching strong baselines offline.

Core claim

MDPs admit a fixed-size sufficient statistic, independent of the episodes observed and tabular in shape, which makes them directly amenable to the attention-based architectures used for tabular foundation models, with a policy head replacing the supervised target.

What carries the argument

The fixed-size sufficient statistic of MDPs that remains tabular regardless of episode count, allowing direct application of transformer attention with a policy head.

Load-bearing premise

Sampling a synthetic MDP is as feasible as sampling a synthetic tabular dataset and that such synthetic MDPs provide useful priors for real-world RL problems.

What would settle it

A trained model that requires task-specific fine-tuning or fails to outperform baselines on held-out tabular MDPs would falsify the claim that the sufficient statistic enables effective in-context RL foundation models.

Figures

Figures reproduced from arXiv: 2606.18812 by Abdelrahman Zighem, Jill-J\^enn Vie.

**Figure 1.** Figure 1: In-context returns as a function of the number of episodes in context, one panel per evaluation depth K and one line per heldout environment. The model was trained with K = 20. Since FrozenLake is a stochastic environment, the return is an average over 256 runs. Normalized scores are computed as R−Rrand Ropt−Rrand where R is the average model return, Ropt is the optimal average return and Rrand is the ave… view at source ↗

**Figure 2.** Figure 2: Offline policy recovery from a fixed uniform-random dataset. The figure plots the exact normalized start-state value of the policy recovered by the in-context model (solid) and by pessimistic value iteration (VI-LCB, dashed) against the number of episodes in the dataset, for three held-out MDPs (color); mean over 8 seeds (shaded: one standard deviation). The VI-LCB penalty coefficient is c = 0.1, hand-pic… view at source ↗

read the original abstract

Foundation models for language and vision are powered by internet-scale data, while structured domains such as tabular prediction are powered by synthetic data. This substitute shifts the challenge from collection to prior design. Such priors already exist for many structured tasks: TabPFN and its successors solve tabular classification with a transformer pretrained on a synthetic Bayesian prior. We make two points. \textbf{First}, reinforcement learning is the conspicuous gap: sampling a synthetic MDP is as feasible as sampling a synthetic tabular dataset, yet no in-context RL work treats prior design as a primary objective. \textbf{Second}, MDPs admit a fixed-size sufficient statistic, independent of the episodes observed and tabular in shape, which makes them directly amenable to the attention-based architectures used for tabular foundation models, with a policy head replacing the supervised target. Together these define the agenda for an RL foundation model. As a proof of concept, we train a Graph Attention Network entirely on synthetic MDPs and show that, with no task-specific tuning, it solves held-out tabular benchmarks in context, both online and offline: online, in far fewer episodes than UCB-VI and tabular Q-learning, and offline, competitively with VI-LCB.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper sketches a clean way to apply synthetic priors to RL by feeding MDP count tables into a graph attention network with a policy head, but the POC stays inside the synthetic distribution.

read the letter

The main point is that MDPs have a fixed-size sufficient statistic in the form of transition and reward count tables, so you can treat them like tabular data and pretrain an attention model on synthetic MDPs the way TabPFN does for classification. They replace the usual output head with a policy head and train a GAT entirely on synthetic data.

The POC shows the resulting model can solve held-out tabular RL benchmarks in context, both online (fewer episodes than UCB-VI or tabular Q-learning) and offline (competitive with VI-LCB). That is new as a direct application of the synthetic-prior approach to RL.

The argument about the sufficient statistic is internally consistent and explains the architecture choice without extra machinery. The agenda is stated plainly: RL has been missing this kind of prior design work.

The soft spot is that all the held-out benchmarks come from the same synthetic distribution used for pretraining. There is no test of whether the model still helps when the target MDP has transition or reward structure outside that distribution, which is the part that would matter for reducing real-world data collection. The abstract also gives no numbers, error bars, or protocol details, so the size of the reported gains is hard to judge.

This is for people already working on in-context learning or foundation models for structured decision problems. A reader thinking about priors for RL would get a clear starting point.

It deserves peer review because the idea is coherent and the POC is at least a reasonable first step, even if more distribution-shift experiments would be needed to strengthen the case.

Referee Report

2 major / 2 minor

Summary. The paper argues that RL foundation models are feasible by pretraining attention-based models (e.g., GAT) on synthetic MDPs, using their fixed-size tabular sufficient statistic (transition/reward count tables of size |S|×|A|×|S|) as input with a policy head in place of a supervised output, analogous to TabPFN. As a proof-of-concept, a GAT trained entirely on synthetic MDPs is shown to solve held-out tabular benchmarks in-context without task-specific tuning: outperforming UCB-VI and tabular Q-learning online (fewer episodes) and competing with VI-LCB offline.

Significance. If the transfer results hold beyond the synthetic distribution, the work would establish a concrete agenda for in-context RL foundation models by reducing prior design to synthetic MDP sampling and leveraging existing tabular FM architectures. The identification of the episode-independent tabular sufficient statistic is a clean observation that directly motivates the architecture choice and could enable reproducible, parameter-light pretraining pipelines.

major comments (2)

[Abstract] Abstract (second point and POC paragraph): The evaluation demonstrates gains only on held-out MDPs sampled from the same synthetic prior used in pretraining. This leaves untested whether the learned priors improve performance on MDPs whose transition or reward structure lies outside that distribution, which is required to support the claim that synthetic MDP pretraining supplies useful priors for real-world RL tasks.
[Abstract] Abstract (POC description): No quantitative metrics, episode counts, error bars, or statistical significance tests are supplied for the claims of 'far fewer episodes' versus UCB-VI/Q-learning or competitiveness with VI-LCB, preventing verification that the architecture advantage is practically meaningful rather than marginal.

minor comments (2)

The manuscript should clarify the precise form of the sufficient statistic (e.g., whether raw counts, normalized probabilities, or log-probabilities are fed to the GAT) and how variable |S| and |A| are handled during pretraining and inference.
Related work on in-context RL or meta-RL that already uses synthetic task distributions should be cited to better position the novelty of treating prior design as the primary objective.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the opportunity to respond to the referee's report. We address each of the major comments below.

read point-by-point responses

Referee: [Abstract] Abstract (second point and POC paragraph): The evaluation demonstrates gains only on held-out MDPs sampled from the same synthetic prior used in pretraining. This leaves untested whether the learned priors improve performance on MDPs whose transition or reward structure lies outside that distribution, which is required to support the claim that synthetic MDP pretraining supplies useful priors for real-world RL tasks.

Authors: We agree that the current experiments evaluate performance only on held-out MDPs drawn from the same synthetic prior used for pretraining. This mirrors the in-distribution evaluation protocol used in tabular foundation model work such as TabPFN and is appropriate for a proof-of-concept establishing that in-context RL is feasible under a synthetic prior. The manuscript does not claim or demonstrate that the learned priors improve performance on MDPs with transition or reward structures outside this distribution, nor does it provide evidence for direct applicability to real-world tasks. We will revise the abstract to explicitly qualify the scope of the held-out evaluation and to frame the contribution as a proof-of-concept for the overall agenda rather than as evidence of utility on real-world MDPs. revision: yes
Referee: [Abstract] Abstract (POC description): No quantitative metrics, episode counts, error bars, or statistical significance tests are supplied for the claims of 'far fewer episodes' versus UCB-VI/Q-learning or competitiveness with VI-LCB, preventing verification that the architecture advantage is practically meaningful rather than marginal.

Authors: The abstract provides a concise summary of the results. The full manuscript contains the detailed quantitative comparisons, including episode counts and performance metrics with the baselines. To make the abstract self-contained, we will revise it to include specific quantitative metrics (e.g., average episodes to solve and performance values) along with references to the corresponding figures and tables. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper states two points: (1) synthetic MDPs can be sampled like tabular data (a feasibility claim, not a derivation), and (2) finite MDPs have a fixed-size tabular sufficient statistic consisting of |S|×|A|×|S| transition/reward counts. This second point is a standard property of MDPs, not derived from the paper's own equations, fits, or citations. The GAT architecture choice follows directly from this known tabular structure by replacing the supervised head with a policy head. The POC trains on synthetic MDPs and evaluates on held-out synthetic benchmarks drawn from the same distribution family; this is consistent with the setup but does not reduce any result to a self-fit or self-citation. TabPFN is cited as external prior work with no author overlap. No self-definitional, fitted-input, or uniqueness-imported steps appear. The derivation chain is therefore independent of its own outputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that MDPs possess a fixed-size tabular sufficient statistic and that synthetic MDP sampling can be made representative of real tasks; no free parameters or invented entities are explicitly fitted in the abstract.

free parameters (1)

GAT architecture and training hyperparameters
Specific choices for the graph attention network layers, attention heads, and training procedure on synthetic MDPs are required but not detailed.

axioms (1)

domain assumption MDPs admit a fixed-size sufficient statistic that is independent of observed episodes and tabular in shape
Invoked as the key property enabling direct use of attention architectures with a policy head.

invented entities (1)

RL foundation model pretrained on synthetic MDPs no independent evidence
purpose: In-context solver for online and offline RL tasks without task-specific tuning
New concept introduced as the target agenda; no independent evidence provided.

pith-pipeline@v0.9.1-grok · 5738 in / 1280 out tokens · 25581 ms · 2026-06-26T21:47:55.249507+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 9 linked inside Pith

[1]

A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M

Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M. S., Bohg, J., Bosse- lut, A., Brunskill, E., et al. On the opportunities and risks of foundation models.arXiv preprint arXiv:2108.07258,

Pith/arXiv arXiv
[2]

Choi, K., Cundy, C., Srivastava, S., and Ermon, S

URL https: //arxiv.org/abs/2106.01345. Choi, K., Cundy, C., Srivastava, S., and Ermon, S. LM- Priors: Pre-trained language models as task-specific pri- ors. InNeurIPS Workshop on Foundation Models for Decision Making,

Pith/arXiv arXiv
[3]

Cobbe, K., Hesse, C., Hilton, J., and Schulman, J

URL https://arxiv.org/ abs/2210.12530. Cobbe, K., Hesse, C., Hilton, J., and Schulman, J. Lever- aging procedural generation to benchmark reinforce- ment learning. InInternational Conference on Machine Learning (ICML),

arXiv
[4]

Dearden, R., Friedman, N., and Russell, S

URL https://arxiv.org/ abs/1912.01588. Dearden, R., Friedman, N., and Russell, S. Bayesian Q- learning. InAAAI Conference on Artificial Intelligence,

arXiv 1912
[6]

org/abs/1611.02779

URL https://arxiv. org/abs/1611.02779. Grigsby, J., Fan, L., and Zhu, Y . AMAGO: Scalable in- context reinforcement learning for adaptive agents. In International Conference on Learning Representations (ICLR),

Pith/arXiv arXiv
[7]

Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S

URL https://arxiv.org/abs/ 2310.09971. Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft actor- critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. InInternational Con- ference on Machine Learning (ICML), pp. 1861–1870,

arXiv
[8]

5 Position: Reinforcement Learning Foundation Models Should Already Be A Thing Hollmann, N., M¨uller, S., Purucker, L., Krishnakumar, A., K¨orner, M., Hoo, R

URL https://arxiv.org/abs/2207.01848. 5 Position: Reinforcement Learning Foundation Models Should Already Be A Thing Hollmann, N., M¨uller, S., Purucker, L., Krishnakumar, A., K¨orner, M., Hoo, R. S., Shen, H., and Hutter, F. Accu- rate predictions on small data with a tabular foundation model.Nature,

Pith/arXiv arXiv
[9]

Laskin, M., Wang, L., Oh, J., Parisotto, E., Spencer, S., Steigerwald, R., Strouse, D., Hansen, S., Filos, A., Brooks, E., Gazeau, M., Sahni, H., Singh, S., and Mnih, V

URL https://arxiv.org/ abs/2501.02945. Laskin, M., Wang, L., Oh, J., Parisotto, E., Spencer, S., Steigerwald, R., Strouse, D., Hansen, S., Filos, A., Brooks, E., Gazeau, M., Sahni, H., Singh, S., and Mnih, V . In-context reinforcement learning with algorithm distilla- tion. InInternational Conference on Learning Represen- tations (ICLR),

arXiv
[10]

URL https://arxiv.org/ abs/2210.14215. Lee, J. N., Xie, A., Pacchiano, A., Chandak, Y ., Finn, C., Nachum, O., and Brunskill, E. Supervised pretraining can learn in-context reinforcement learning. InAdvances in Neural Information Processing Systems (NeurIPS),

arXiv
[11]

Lin, L., Bai, Y ., and Mei, S

URLhttps://arxiv.org/abs/2306.14892. Lin, L., Bai, Y ., and Mei, S. Transformers as decision makers: Provable in-context reinforcement learning via supervised pretraining. InInternational Conference on Learning Representations (ICLR),

arXiv
[12]

M¨uller, S., Hollmann, N., Pineda Arango, S., Grabocka, J., and Hutter, F

URL https: //arxiv.org/abs/2310.08566. M¨uller, S., Hollmann, N., Pineda Arango, S., Grabocka, J., and Hutter, F. Transformers can do Bayesian infer- ence. InInternational Conference on Learning Represen- tations (ICLR),

arXiv
[13]

M¨uller, S., Reuter, A., Hollmann, N., R ¨ugamer, D., and Hutter, F

URL https://arxiv.org/ abs/2112.10510. M¨uller, S., Reuter, A., Hollmann, N., R ¨ugamer, D., and Hutter, F. Position: The future of Bayesian prediction is prior-fitted. InInternational Conference on Machine Learning (ICML),

arXiv
[14]

Qu, J., Holzm ¨uller, D., Varoquaux, G., and Le Morvan, M

URL https://arxiv.org/ abs/2505.23947. Qu, J., Holzm ¨uller, D., Varoquaux, G., and Le Morvan, M. TabICL: A tabular foundation model for in-context learning on large data. InInternational Conference on Machine Learning (ICML),

arXiv
[15]

Qu, J., Holzm ¨uller, D., Varoquaux, G., and Le Morvan, M

URL https: //arxiv.org/abs/2502.05564. Qu, J., Holzm ¨uller, D., Varoquaux, G., and Le Morvan, M. TabICLv2: A better, faster, scalable, and open tabu- lar foundation model.arXiv preprint arXiv:2602.11139,

Pith/arXiv arXiv
[16]

Schiff, D., Lindenbaum, O., and Efroni, Y

URL https://arxiv.org/abs/ 2205.06175. Schiff, D., Lindenbaum, O., and Efroni, Y . Gradient free deep reinforcement learning with TabPFN.arXiv preprint arXiv:2509.11259,

Pith/arXiv arXiv
[17]

Son, J., Lee, S., and Kim, G

URL https:// arxiv.org/abs/2509.11259. Son, J., Lee, S., and Kim, G. Distilling reinforcement learning algorithms for in-context model-based planning. InInternational Conference on Learning Representa- tions (ICLR),

arXiv
[18]

Strens, M

URL https://arxiv.org/ abs/2502.19009. Strens, M. A Bayesian framework for reinforcement learn- ing. InInternational Conference on Machine Learning (ICML),

arXiv
[19]

Veliˇckovi´c, P., Cucurull, G., Casanova, A., Romero, A., Li`o, P., and Bengio, Y

URL https://arxiv.org/abs/1706.03762. Veliˇckovi´c, P., Cucurull, G., Casanova, A., Romero, A., Li`o, P., and Bengio, Y . Graph attention networks,

Pith/arXiv arXiv
[20]

URL https://arxiv.org/abs/1710.10903. Wang, J. X., Kurth-Nelson, Z., Tirumala, D., Soyer, H., Leibo, J. Z., Munos, R., Blundell, C., Kumaran, D., and Botvinick, M. Learning to reinforcement learn.arXiv preprint arXiv:1611.05763,

Pith/arXiv arXiv
[21]

Zab¨ergja, G., Kamel, R., Kadra, A., Frey, C

URL https://arxiv.org/abs/2410.07927. Zab¨ergja, G., Kamel, R., Kadra, A., Frey, C. M. M., and Grabocka, J. End-to-end compression for tabular foun- dation models.arXiv preprint arXiv:2602.05649,

arXiv
[22]

6 Position: Reinforcement Learning Foundation Models Should Already Be A Thing A

URLhttps://arxiv.org/abs/2602.05649. 6 Position: Reinforcement Learning Foundation Models Should Already Be A Thing A. Implementation and training details This appendix documents the proof-of-concept implementation: the prior over MDPs, the supervision targets, the model input, the architecture, and the optimization. The four stages map one-to-one onto th...

Pith/arXiv arXiv
[23]

For GridWorld, the initial and final states are always respectively the top-left and bottom-right corners of the square

and FrozenLake (Brockman et al., 2016). For GridWorld, the initial and final states are always respectively the top-left and bottom-right corners of the square. We set step cost=−1andgoal r= 10.0. For FrozenLake, we use a 4×4 grid, with holes at index 5,7,11,12 (the grid is indexed row-major), slip= 0.2 , step cost= 0,hole r=−1andgoal r= 1.0. B.2. Learnin...

2016
[24]

The first protocol measures the greedy return as a function of the number of episodes in context and of the evaluation depth K. We roll out 512 episodes per environment, built with τ= 0.3 and read off the greedy return at every episode index that is a power of two, sweeping K∈ {4,8,16,20,24,38} over the weight-tied propagation layer. A greedy episode is s...

2000
[25]

We fix the behavior policy to uniform random, reset to sstart = 0 whenever an absorbing state is reached or after 50 steps, and collect a single stream of transitions per seed

The third protocol isolates the model as an offline estimator: given a fixed dataset of transitions, how good a policy can it recover, and how does that compare against a standard offline-RL baseline on the identical data. We fix the behavior policy to uniform random, reset to sstart = 0 whenever an absorbing state is reached or after 50 steps, and collec...

2048

[1] [1]

A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M

Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M. S., Bohg, J., Bosse- lut, A., Brunskill, E., et al. On the opportunities and risks of foundation models.arXiv preprint arXiv:2108.07258,

Pith/arXiv arXiv

[2] [2]

Choi, K., Cundy, C., Srivastava, S., and Ermon, S

URL https: //arxiv.org/abs/2106.01345. Choi, K., Cundy, C., Srivastava, S., and Ermon, S. LM- Priors: Pre-trained language models as task-specific pri- ors. InNeurIPS Workshop on Foundation Models for Decision Making,

Pith/arXiv arXiv

[3] [3]

Cobbe, K., Hesse, C., Hilton, J., and Schulman, J

URL https://arxiv.org/ abs/2210.12530. Cobbe, K., Hesse, C., Hilton, J., and Schulman, J. Lever- aging procedural generation to benchmark reinforce- ment learning. InInternational Conference on Machine Learning (ICML),

arXiv

[4] [4]

Dearden, R., Friedman, N., and Russell, S

URL https://arxiv.org/ abs/1912.01588. Dearden, R., Friedman, N., and Russell, S. Bayesian Q- learning. InAAAI Conference on Artificial Intelligence,

arXiv 1912

[5] [6]

org/abs/1611.02779

URL https://arxiv. org/abs/1611.02779. Grigsby, J., Fan, L., and Zhu, Y . AMAGO: Scalable in- context reinforcement learning for adaptive agents. In International Conference on Learning Representations (ICLR),

Pith/arXiv arXiv

[6] [7]

Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S

URL https://arxiv.org/abs/ 2310.09971. Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft actor- critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. InInternational Con- ference on Machine Learning (ICML), pp. 1861–1870,

arXiv

[7] [8]

5 Position: Reinforcement Learning Foundation Models Should Already Be A Thing Hollmann, N., M¨uller, S., Purucker, L., Krishnakumar, A., K¨orner, M., Hoo, R

URL https://arxiv.org/abs/2207.01848. 5 Position: Reinforcement Learning Foundation Models Should Already Be A Thing Hollmann, N., M¨uller, S., Purucker, L., Krishnakumar, A., K¨orner, M., Hoo, R. S., Shen, H., and Hutter, F. Accu- rate predictions on small data with a tabular foundation model.Nature,

Pith/arXiv arXiv

[8] [9]

Laskin, M., Wang, L., Oh, J., Parisotto, E., Spencer, S., Steigerwald, R., Strouse, D., Hansen, S., Filos, A., Brooks, E., Gazeau, M., Sahni, H., Singh, S., and Mnih, V

URL https://arxiv.org/ abs/2501.02945. Laskin, M., Wang, L., Oh, J., Parisotto, E., Spencer, S., Steigerwald, R., Strouse, D., Hansen, S., Filos, A., Brooks, E., Gazeau, M., Sahni, H., Singh, S., and Mnih, V . In-context reinforcement learning with algorithm distilla- tion. InInternational Conference on Learning Represen- tations (ICLR),

arXiv

[9] [10]

URL https://arxiv.org/ abs/2210.14215. Lee, J. N., Xie, A., Pacchiano, A., Chandak, Y ., Finn, C., Nachum, O., and Brunskill, E. Supervised pretraining can learn in-context reinforcement learning. InAdvances in Neural Information Processing Systems (NeurIPS),

arXiv

[10] [11]

Lin, L., Bai, Y ., and Mei, S

URLhttps://arxiv.org/abs/2306.14892. Lin, L., Bai, Y ., and Mei, S. Transformers as decision makers: Provable in-context reinforcement learning via supervised pretraining. InInternational Conference on Learning Representations (ICLR),

arXiv

[11] [12]

M¨uller, S., Hollmann, N., Pineda Arango, S., Grabocka, J., and Hutter, F

URL https: //arxiv.org/abs/2310.08566. M¨uller, S., Hollmann, N., Pineda Arango, S., Grabocka, J., and Hutter, F. Transformers can do Bayesian infer- ence. InInternational Conference on Learning Represen- tations (ICLR),

arXiv

[12] [13]

M¨uller, S., Reuter, A., Hollmann, N., R ¨ugamer, D., and Hutter, F

URL https://arxiv.org/ abs/2112.10510. M¨uller, S., Reuter, A., Hollmann, N., R ¨ugamer, D., and Hutter, F. Position: The future of Bayesian prediction is prior-fitted. InInternational Conference on Machine Learning (ICML),

arXiv

[13] [14]

Qu, J., Holzm ¨uller, D., Varoquaux, G., and Le Morvan, M

URL https://arxiv.org/ abs/2505.23947. Qu, J., Holzm ¨uller, D., Varoquaux, G., and Le Morvan, M. TabICL: A tabular foundation model for in-context learning on large data. InInternational Conference on Machine Learning (ICML),

arXiv

[14] [15]

Qu, J., Holzm ¨uller, D., Varoquaux, G., and Le Morvan, M

URL https: //arxiv.org/abs/2502.05564. Qu, J., Holzm ¨uller, D., Varoquaux, G., and Le Morvan, M. TabICLv2: A better, faster, scalable, and open tabu- lar foundation model.arXiv preprint arXiv:2602.11139,

Pith/arXiv arXiv

[15] [16]

Schiff, D., Lindenbaum, O., and Efroni, Y

URL https://arxiv.org/abs/ 2205.06175. Schiff, D., Lindenbaum, O., and Efroni, Y . Gradient free deep reinforcement learning with TabPFN.arXiv preprint arXiv:2509.11259,

Pith/arXiv arXiv

[16] [17]

Son, J., Lee, S., and Kim, G

URL https:// arxiv.org/abs/2509.11259. Son, J., Lee, S., and Kim, G. Distilling reinforcement learning algorithms for in-context model-based planning. InInternational Conference on Learning Representa- tions (ICLR),

arXiv

[17] [18]

Strens, M

URL https://arxiv.org/ abs/2502.19009. Strens, M. A Bayesian framework for reinforcement learn- ing. InInternational Conference on Machine Learning (ICML),

arXiv

[18] [19]

Veliˇckovi´c, P., Cucurull, G., Casanova, A., Romero, A., Li`o, P., and Bengio, Y

URL https://arxiv.org/abs/1706.03762. Veliˇckovi´c, P., Cucurull, G., Casanova, A., Romero, A., Li`o, P., and Bengio, Y . Graph attention networks,

Pith/arXiv arXiv

[19] [20]

URL https://arxiv.org/abs/1710.10903. Wang, J. X., Kurth-Nelson, Z., Tirumala, D., Soyer, H., Leibo, J. Z., Munos, R., Blundell, C., Kumaran, D., and Botvinick, M. Learning to reinforcement learn.arXiv preprint arXiv:1611.05763,

Pith/arXiv arXiv

[20] [21]

Zab¨ergja, G., Kamel, R., Kadra, A., Frey, C

URL https://arxiv.org/abs/2410.07927. Zab¨ergja, G., Kamel, R., Kadra, A., Frey, C. M. M., and Grabocka, J. End-to-end compression for tabular foun- dation models.arXiv preprint arXiv:2602.05649,

arXiv

[21] [22]

6 Position: Reinforcement Learning Foundation Models Should Already Be A Thing A

URLhttps://arxiv.org/abs/2602.05649. 6 Position: Reinforcement Learning Foundation Models Should Already Be A Thing A. Implementation and training details This appendix documents the proof-of-concept implementation: the prior over MDPs, the supervision targets, the model input, the architecture, and the optimization. The four stages map one-to-one onto th...

Pith/arXiv arXiv

[22] [23]

For GridWorld, the initial and final states are always respectively the top-left and bottom-right corners of the square

and FrozenLake (Brockman et al., 2016). For GridWorld, the initial and final states are always respectively the top-left and bottom-right corners of the square. We set step cost=−1andgoal r= 10.0. For FrozenLake, we use a 4×4 grid, with holes at index 5,7,11,12 (the grid is indexed row-major), slip= 0.2 , step cost= 0,hole r=−1andgoal r= 1.0. B.2. Learnin...

2016

[23] [24]

The first protocol measures the greedy return as a function of the number of episodes in context and of the evaluation depth K. We roll out 512 episodes per environment, built with τ= 0.3 and read off the greedy return at every episode index that is a power of two, sweeping K∈ {4,8,16,20,24,38} over the weight-tied propagation layer. A greedy episode is s...

2000

[24] [25]

We fix the behavior policy to uniform random, reset to sstart = 0 whenever an absorbing state is reached or after 50 steps, and collect a single stream of transitions per seed

The third protocol isolates the model as an offline estimator: given a fixed dataset of transitions, how good a policy can it recover, and how does that compare against a standard offline-RL baseline on the identical data. We fix the behavior policy to uniform random, reset to sstart = 0 whenever an absorbing state is reached or after 50 steps, and collec...

2048