Reinforcement Learning Foundation Models Should Already Be A Thing
Pith reviewed 2026-06-26 21:47 UTC · model grok-4.3
The pith
MDPs have a fixed-size sufficient statistic that is tabular and independent of observed episodes, enabling foundation models for RL pretrained on synthetic data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MDPs admit a fixed-size sufficient statistic, independent of the episodes observed and tabular in shape, which makes them directly amenable to the attention-based architectures used for tabular foundation models, with a policy head replacing the supervised target.
What carries the argument
The fixed-size sufficient statistic of MDPs that remains tabular regardless of episode count, allowing direct application of transformer attention with a policy head.
Load-bearing premise
Sampling a synthetic MDP is as feasible as sampling a synthetic tabular dataset and that such synthetic MDPs provide useful priors for real-world RL problems.
What would settle it
A trained model that requires task-specific fine-tuning or fails to outperform baselines on held-out tabular MDPs would falsify the claim that the sufficient statistic enables effective in-context RL foundation models.
Figures
read the original abstract
Foundation models for language and vision are powered by internet-scale data, while structured domains such as tabular prediction are powered by synthetic data. This substitute shifts the challenge from collection to prior design. Such priors already exist for many structured tasks: TabPFN and its successors solve tabular classification with a transformer pretrained on a synthetic Bayesian prior. We make two points. \textbf{First}, reinforcement learning is the conspicuous gap: sampling a synthetic MDP is as feasible as sampling a synthetic tabular dataset, yet no in-context RL work treats prior design as a primary objective. \textbf{Second}, MDPs admit a fixed-size sufficient statistic, independent of the episodes observed and tabular in shape, which makes them directly amenable to the attention-based architectures used for tabular foundation models, with a policy head replacing the supervised target. Together these define the agenda for an RL foundation model. As a proof of concept, we train a Graph Attention Network entirely on synthetic MDPs and show that, with no task-specific tuning, it solves held-out tabular benchmarks in context, both online and offline: online, in far fewer episodes than UCB-VI and tabular Q-learning, and offline, competitively with VI-LCB.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper argues that RL foundation models are feasible by pretraining attention-based models (e.g., GAT) on synthetic MDPs, using their fixed-size tabular sufficient statistic (transition/reward count tables of size |S|×|A|×|S|) as input with a policy head in place of a supervised output, analogous to TabPFN. As a proof-of-concept, a GAT trained entirely on synthetic MDPs is shown to solve held-out tabular benchmarks in-context without task-specific tuning: outperforming UCB-VI and tabular Q-learning online (fewer episodes) and competing with VI-LCB offline.
Significance. If the transfer results hold beyond the synthetic distribution, the work would establish a concrete agenda for in-context RL foundation models by reducing prior design to synthetic MDP sampling and leveraging existing tabular FM architectures. The identification of the episode-independent tabular sufficient statistic is a clean observation that directly motivates the architecture choice and could enable reproducible, parameter-light pretraining pipelines.
major comments (2)
- [Abstract] Abstract (second point and POC paragraph): The evaluation demonstrates gains only on held-out MDPs sampled from the same synthetic prior used in pretraining. This leaves untested whether the learned priors improve performance on MDPs whose transition or reward structure lies outside that distribution, which is required to support the claim that synthetic MDP pretraining supplies useful priors for real-world RL tasks.
- [Abstract] Abstract (POC description): No quantitative metrics, episode counts, error bars, or statistical significance tests are supplied for the claims of 'far fewer episodes' versus UCB-VI/Q-learning or competitiveness with VI-LCB, preventing verification that the architecture advantage is practically meaningful rather than marginal.
minor comments (2)
- The manuscript should clarify the precise form of the sufficient statistic (e.g., whether raw counts, normalized probabilities, or log-probabilities are fed to the GAT) and how variable |S| and |A| are handled during pretraining and inference.
- Related work on in-context RL or meta-RL that already uses synthetic task distributions should be cited to better position the novelty of treating prior design as the primary objective.
Simulated Author's Rebuttal
Thank you for the opportunity to respond to the referee's report. We address each of the major comments below.
read point-by-point responses
-
Referee: [Abstract] Abstract (second point and POC paragraph): The evaluation demonstrates gains only on held-out MDPs sampled from the same synthetic prior used in pretraining. This leaves untested whether the learned priors improve performance on MDPs whose transition or reward structure lies outside that distribution, which is required to support the claim that synthetic MDP pretraining supplies useful priors for real-world RL tasks.
Authors: We agree that the current experiments evaluate performance only on held-out MDPs drawn from the same synthetic prior used for pretraining. This mirrors the in-distribution evaluation protocol used in tabular foundation model work such as TabPFN and is appropriate for a proof-of-concept establishing that in-context RL is feasible under a synthetic prior. The manuscript does not claim or demonstrate that the learned priors improve performance on MDPs with transition or reward structures outside this distribution, nor does it provide evidence for direct applicability to real-world tasks. We will revise the abstract to explicitly qualify the scope of the held-out evaluation and to frame the contribution as a proof-of-concept for the overall agenda rather than as evidence of utility on real-world MDPs. revision: yes
-
Referee: [Abstract] Abstract (POC description): No quantitative metrics, episode counts, error bars, or statistical significance tests are supplied for the claims of 'far fewer episodes' versus UCB-VI/Q-learning or competitiveness with VI-LCB, preventing verification that the architecture advantage is practically meaningful rather than marginal.
Authors: The abstract provides a concise summary of the results. The full manuscript contains the detailed quantitative comparisons, including episode counts and performance metrics with the baselines. To make the abstract self-contained, we will revise it to include specific quantitative metrics (e.g., average episodes to solve and performance values) along with references to the corresponding figures and tables. revision: yes
Circularity Check
No significant circularity; derivation is self-contained
full rationale
The paper states two points: (1) synthetic MDPs can be sampled like tabular data (a feasibility claim, not a derivation), and (2) finite MDPs have a fixed-size tabular sufficient statistic consisting of |S|×|A|×|S| transition/reward counts. This second point is a standard property of MDPs, not derived from the paper's own equations, fits, or citations. The GAT architecture choice follows directly from this known tabular structure by replacing the supervised head with a policy head. The POC trains on synthetic MDPs and evaluates on held-out synthetic benchmarks drawn from the same distribution family; this is consistent with the setup but does not reduce any result to a self-fit or self-citation. TabPFN is cited as external prior work with no author overlap. No self-definitional, fitted-input, or uniqueness-imported steps appear. The derivation chain is therefore independent of its own outputs.
Axiom & Free-Parameter Ledger
free parameters (1)
- GAT architecture and training hyperparameters
axioms (1)
- domain assumption MDPs admit a fixed-size sufficient statistic that is independent of observed episodes and tabular in shape
invented entities (1)
-
RL foundation model pretrained on synthetic MDPs
no independent evidence
Reference graph
Works this paper leans on
-
[1]
A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M
Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M. S., Bohg, J., Bosse- lut, A., Brunskill, E., et al. On the opportunities and risks of foundation models.arXiv preprint arXiv:2108.07258,
-
[2]
Choi, K., Cundy, C., Srivastava, S., and Ermon, S
URL https: //arxiv.org/abs/2106.01345. Choi, K., Cundy, C., Srivastava, S., and Ermon, S. LM- Priors: Pre-trained language models as task-specific pri- ors. InNeurIPS Workshop on Foundation Models for Decision Making,
-
[3]
Cobbe, K., Hesse, C., Hilton, J., and Schulman, J
URL https://arxiv.org/ abs/2210.12530. Cobbe, K., Hesse, C., Hilton, J., and Schulman, J. Lever- aging procedural generation to benchmark reinforce- ment learning. InInternational Conference on Machine Learning (ICML),
-
[4]
Dearden, R., Friedman, N., and Russell, S
URL https://arxiv.org/ abs/1912.01588. Dearden, R., Friedman, N., and Russell, S. Bayesian Q- learning. InAAAI Conference on Artificial Intelligence,
arXiv 1912
-
[6]
URL https://arxiv. org/abs/1611.02779. Grigsby, J., Fan, L., and Zhu, Y . AMAGO: Scalable in- context reinforcement learning for adaptive agents. In International Conference on Learning Representations (ICLR),
-
[7]
Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S
URL https://arxiv.org/abs/ 2310.09971. Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft actor- critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. InInternational Con- ference on Machine Learning (ICML), pp. 1861–1870,
-
[8]
URL https://arxiv.org/abs/2207.01848. 5 Position: Reinforcement Learning Foundation Models Should Already Be A Thing Hollmann, N., M¨uller, S., Purucker, L., Krishnakumar, A., K¨orner, M., Hoo, R. S., Shen, H., and Hutter, F. Accu- rate predictions on small data with a tabular foundation model.Nature,
-
[9]
URL https://arxiv.org/ abs/2501.02945. Laskin, M., Wang, L., Oh, J., Parisotto, E., Spencer, S., Steigerwald, R., Strouse, D., Hansen, S., Filos, A., Brooks, E., Gazeau, M., Sahni, H., Singh, S., and Mnih, V . In-context reinforcement learning with algorithm distilla- tion. InInternational Conference on Learning Represen- tations (ICLR),
-
[10]
URL https://arxiv.org/ abs/2210.14215. Lee, J. N., Xie, A., Pacchiano, A., Chandak, Y ., Finn, C., Nachum, O., and Brunskill, E. Supervised pretraining can learn in-context reinforcement learning. InAdvances in Neural Information Processing Systems (NeurIPS),
-
[11]
URLhttps://arxiv.org/abs/2306.14892. Lin, L., Bai, Y ., and Mei, S. Transformers as decision makers: Provable in-context reinforcement learning via supervised pretraining. InInternational Conference on Learning Representations (ICLR),
-
[12]
M¨uller, S., Hollmann, N., Pineda Arango, S., Grabocka, J., and Hutter, F
URL https: //arxiv.org/abs/2310.08566. M¨uller, S., Hollmann, N., Pineda Arango, S., Grabocka, J., and Hutter, F. Transformers can do Bayesian infer- ence. InInternational Conference on Learning Represen- tations (ICLR),
-
[13]
M¨uller, S., Reuter, A., Hollmann, N., R ¨ugamer, D., and Hutter, F
URL https://arxiv.org/ abs/2112.10510. M¨uller, S., Reuter, A., Hollmann, N., R ¨ugamer, D., and Hutter, F. Position: The future of Bayesian prediction is prior-fitted. InInternational Conference on Machine Learning (ICML),
-
[14]
Qu, J., Holzm ¨uller, D., Varoquaux, G., and Le Morvan, M
URL https://arxiv.org/ abs/2505.23947. Qu, J., Holzm ¨uller, D., Varoquaux, G., and Le Morvan, M. TabICL: A tabular foundation model for in-context learning on large data. InInternational Conference on Machine Learning (ICML),
-
[15]
Qu, J., Holzm ¨uller, D., Varoquaux, G., and Le Morvan, M
URL https: //arxiv.org/abs/2502.05564. Qu, J., Holzm ¨uller, D., Varoquaux, G., and Le Morvan, M. TabICLv2: A better, faster, scalable, and open tabu- lar foundation model.arXiv preprint arXiv:2602.11139,
-
[16]
Schiff, D., Lindenbaum, O., and Efroni, Y
URL https://arxiv.org/abs/ 2205.06175. Schiff, D., Lindenbaum, O., and Efroni, Y . Gradient free deep reinforcement learning with TabPFN.arXiv preprint arXiv:2509.11259,
-
[17]
URL https:// arxiv.org/abs/2509.11259. Son, J., Lee, S., and Kim, G. Distilling reinforcement learning algorithms for in-context model-based planning. InInternational Conference on Learning Representa- tions (ICLR),
- [18]
-
[19]
Veliˇckovi´c, P., Cucurull, G., Casanova, A., Romero, A., Li`o, P., and Bengio, Y
URL https://arxiv.org/abs/1706.03762. Veliˇckovi´c, P., Cucurull, G., Casanova, A., Romero, A., Li`o, P., and Bengio, Y . Graph attention networks,
-
[20]
URL https://arxiv.org/abs/1710.10903. Wang, J. X., Kurth-Nelson, Z., Tirumala, D., Soyer, H., Leibo, J. Z., Munos, R., Blundell, C., Kumaran, D., and Botvinick, M. Learning to reinforcement learn.arXiv preprint arXiv:1611.05763,
-
[21]
Zab¨ergja, G., Kamel, R., Kadra, A., Frey, C
URL https://arxiv.org/abs/2410.07927. Zab¨ergja, G., Kamel, R., Kadra, A., Frey, C. M. M., and Grabocka, J. End-to-end compression for tabular foun- dation models.arXiv preprint arXiv:2602.05649,
-
[22]
6 Position: Reinforcement Learning Foundation Models Should Already Be A Thing A
URLhttps://arxiv.org/abs/2602.05649. 6 Position: Reinforcement Learning Foundation Models Should Already Be A Thing A. Implementation and training details This appendix documents the proof-of-concept implementation: the prior over MDPs, the supervision targets, the model input, the architecture, and the optimization. The four stages map one-to-one onto th...
-
[23]
For GridWorld, the initial and final states are always respectively the top-left and bottom-right corners of the square
and FrozenLake (Brockman et al., 2016). For GridWorld, the initial and final states are always respectively the top-left and bottom-right corners of the square. We set step cost=−1andgoal r= 10.0. For FrozenLake, we use a 4×4 grid, with holes at index 5,7,11,12 (the grid is indexed row-major), slip= 0.2 , step cost= 0,hole r=−1andgoal r= 1.0. B.2. Learnin...
2016
-
[24]
The first protocol measures the greedy return as a function of the number of episodes in context and of the evaluation depth K. We roll out 512 episodes per environment, built with τ= 0.3 and read off the greedy return at every episode index that is a power of two, sweeping K∈ {4,8,16,20,24,38} over the weight-tied propagation layer. A greedy episode is s...
2000
-
[25]
We fix the behavior policy to uniform random, reset to sstart = 0 whenever an absorbing state is reached or after 50 steps, and collect a single stream of transitions per seed
The third protocol isolates the model as an offline estimator: given a fixed dataset of transitions, how good a policy can it recover, and how does that compare against a standard offline-RL baseline on the identical data. We fix the behavior policy to uniform random, reset to sstart = 0 whenever an absorbing state is reached or after 50 steps, and collec...
2048
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.