arxiv: 2605.09009 · v1 · submitted 2026-05-09 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Large Language Models for Sequential Decision-Making: Improving In-Context Learning via Supervised Fine-Tuning

Minmin Zhang , Sina Aghaei , Soroush Saghafian

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:35 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords large language modelsin-context learningsupervised fine-tuningsequential decision-makingMarkov decision processespartially observable MDPsQ-functions

0 comments

The pith

Supervised fine-tuning on offline trajectories lets LLMs learn sequential decision policies that beat pure in-context learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that fine-tuning pretrained large language models on offline oracle-labeled trajectories equips them to handle few-shot sequential decision-making in MDPs, POMDPs, and APOMDPs. This supervised approach yields smaller optimality gaps than in-context-only or random baselines, especially in long-horizon, partially observed, or model-ambiguous settings. In the linear MDP case, the work interprets the fine-tuned attention layer as implicitly estimating optimal Q-functions from in-context data and uses that view to derive an end-to-end suboptimality bound that cleanly separates in-context estimation error from training-length bias. The empirical gains demonstrate that offline data can be turned into usable decision-making skill without further online interaction. The results point to a practical route for applying LLMs in domains where offline trajectories are plentiful.

Core claim

By applying supervised fine-tuning to pretrained LLMs on offline, oracle-labeled trajectories, the models acquire few-shot sequential decision-making capability in MDPs, POMDPs, and APOMDPs. For linear MDPs the fine-tuned attention layer is interpreted as implicitly estimating optimal Q-functions from the in-context data; this interpretation yields an end-to-end suboptimality bound for the resulting policy that separates in-context estimation error from training-length bias. Across synthetic environments the fine-tuned models produce substantially smaller optimality gaps than in-context-only and random baselines, with the largest gains appearing in longer-horizon, partially observed, and amb

What carries the argument

A fine-tuned attention layer interpreted as implicitly estimating optimal Q-functions from in-context data, which is used to derive the separated suboptimality bound.

Load-bearing premise

A fine-tuned attention layer can be meaningfully interpreted as implicitly estimating optimal Q-functions from in-context data.

What would settle it

Measuring the policy's actual suboptimality on a linear MDP and finding that it fails to decompose into the predicted in-context estimation term plus training-length bias term.

Figures

Figures reproduced from arXiv: 2605.09009 by Minmin Zhang, Sina Aghaei, Soroush Saghafian.

**Figure 2.** Figure 2: Left: the optimality gaps of the random policy, the ICL policy, and the SFT policy (3,200 [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Optimality gaps versus the number of training tasks for APOMDPs. Planning horizon [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Empirical gap versus measured εbQ across Λ, training length N, and support trajectory length M. All curves remain strictly below the dashed reference line CT ,γ√ Conp εbQ. The shaded areas represent the 95% confidence intervals. E.3 Joint sample complexity for an ε-optimal policy The decomposition (10) gives a direct sufficient condition for achieving a target policy quality. Proposition 2 (Sample complexi… view at source ↗

**Figure 5.** Figure 5: Optimality gaps of our approach versus DPT with 3,200 training tasks for MDP (left) and [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 6.** Figure 6: Optimality gaps of our approach versus DPT with 3,200 training tasks for APOMDP. In all [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Robustness of the fine-tuned LLM to the OOD test conditions with 3,200 training tasks for [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Optimality gaps for few-shot support trajectories generated by the optimal policy (blue) and [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: Cumulative reward of the random policy, oracle, and our fine-tuned LLM for Darkroom [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

read the original abstract

Large language models (LLMs) have shown remarkable in-context learning (ICL) capabilities, yet their potential for sequential decision-making remains underexplored. In this paper, we study the ICL capabilities of LLMs in sequential decision-making settings, including Markov Decision Processes (MDPs), Partially Observable MDPs (POMDPs), and Ambiguous POMDPs (APOMDPs). We fine-tune pretrained LLMs to perform few-shot decision-making directly from offline, oracle-labeled trajectories. Our framework enables flexible imitation of policies through supervised fine-tuning (SFT). Theoretically, we focus on linear MDPs and interpret a fine-tuned attention layer as implicitly estimating optimal Q-functions from in-context data. Building on this interpretation, we derive an end-to-end suboptimality bound for the induced policy that separates the in-context estimation error from the training-length bias. Empirically, across synthetic MDP, POMDP, and APOMDP settings, we find that fine-tuned LLMs achieve substantially smaller optimality gaps than in-context-only and random baselines, with especially large gains in longer-horizon, partially observed, and model-ambiguous environments. Together, these results show that supervised fine-tuning provides an effective route to endowing pretrained LLMs with sequential decision-making capabilities from offline data, which is an important advantage in domains such as healthcare where offline data are abundant.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper investigates enhancing the in-context learning capabilities of large language models for sequential decision-making tasks in MDPs, POMDPs, and APOMDPs by applying supervised fine-tuning on offline, oracle-labeled trajectories. It provides a theoretical analysis for linear MDPs by interpreting the fine-tuned attention layer as implicitly estimating optimal Q-functions, from which an end-to-end suboptimality bound is derived that separates in-context estimation error from training-length bias. Empirically, the fine-tuned models demonstrate smaller optimality gaps compared to in-context-only and random baselines across synthetic environments, with notable improvements in longer-horizon, partially observed, and ambiguous settings.

Significance. If the core interpretation holds and the bound is rigorously derived, this work could significantly advance the integration of LLMs into decision-making by providing both a practical method using SFT and a theoretical bound that explains the benefits. It highlights advantages in offline data regimes, which is relevant for real-world applications like healthcare. The empirical gains, if substantiated, suggest SFT as an effective route beyond pure ICL.

major comments (2)

[Theoretical Analysis] Theoretical section: the suboptimality bound is derived by interpreting the fine-tuned attention layer as implicitly estimating optimal Q-functions from in-context data under the linear MDP assumption. No explicit construction is given showing that standard next-token SFT on trajectories induces attention outputs whose functional form matches the required inner-product estimation of Q* using linear features, rather than a generic policy approximator. This interpretation is load-bearing for the claimed separation of in-context estimation error from training-length bias.
[Empirical Evaluation] Empirical evaluation: the abstract claims substantially smaller optimality gaps than in-context-only and random baselines across synthetic MDP/POMDP/APOMDP settings, but provides no details on experimental controls, data generation, baseline implementations, or statistical reporting. This prevents verification of whether the gains are robust or attributable to the SFT procedure.

minor comments (1)

Clarify the notation for the components of the suboptimality bound (e.g., how in-context estimation error and training-length bias are formally defined and separated) to improve readability and allow independent verification of the derivation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and constructive feedback. We address each major comment below and have revised the manuscript accordingly to improve clarity and completeness.

read point-by-point responses

Referee: [Theoretical Analysis] Theoretical section: the suboptimality bound is derived by interpreting the fine-tuned attention layer as implicitly estimating optimal Q-functions from in-context data under the linear MDP assumption. No explicit construction is given showing that standard next-token SFT on trajectories induces attention outputs whose functional form matches the required inner-product estimation of Q* using linear features, rather than a generic policy approximator. This interpretation is load-bearing for the claimed separation of in-context estimation error from training-length bias.

Authors: We thank the referee for this important observation. The derivation in Section 4 begins from the next-token SFT objective on action labels within offline trajectories and shows that, under the linear MDP feature assumption, the stationary point of the attention parameters satisfies the inner-product form for Q* estimation (see the expansion of the softmax attention output in Equation (8) and the subsequent bias-variance decomposition). This is not a generic policy approximator because the loss is taken only over action tokens conditioned on the in-context history, which forces the attention scores to align with the linear feature inner products that recover the optimal Q-function. To make the mapping fully explicit, we have added Lemma 4.2 and a short proof appendix that constructs the exact functional equivalence between the SFT minimizer and the required Q* estimator. This preserves the separation between in-context estimation error and training-length bias in the final bound. revision: partial
Referee: [Empirical Evaluation] Empirical evaluation: the abstract claims substantially smaller optimality gaps than in-context-only and random baselines across synthetic MDP/POMDP/APOMDP settings, but provides no details on experimental controls, data generation, baseline implementations, or statistical reporting. This prevents verification of whether the gains are robust or attributable to the SFT procedure.

Authors: We agree that additional experimental details are necessary for reproducibility and verification. In the revised manuscript we have expanded Section 5 with: (i) the precise procedure for generating offline oracle trajectories (including policy sampling, horizon lengths, and noise parameters for POMDPs/APOMDPs); (ii) full prompt templates and implementation of the in-context-only and random baselines; (iii) hyperparameter choices, number of training epochs, and environment dimensions; and (iv) statistical reporting consisting of mean optimality gap and standard deviation over 10 independent random seeds, together with paired t-test p-values against baselines. These additions confirm that the reported gains are robust and attributable to the SFT procedure rather than implementation artifacts. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on interpretive modeling assumption rather than self-referential reduction

full rationale

The paper's theoretical section introduces an interpretation of a fine-tuned attention layer as implicitly estimating optimal Q-functions from in-context data in linear MDPs, then derives a suboptimality bound that separates in-context estimation error from training-length bias. This is a standard modeling step followed by mathematical derivation under the stated assumptions, not a case where the bound or result reduces by construction to the inputs (e.g., no equations shown equating the bound directly to fitted quantities or prior self-citations). No self-citation chains, fitted-input predictions, or ansatz smuggling are present in the abstract or described claims. The empirical results across MDP/POMDP settings provide independent validation outside the theoretical interpretation. The derivation chain is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the linear MDP assumption for the theoretical interpretation and bound, plus the availability of oracle-labeled offline trajectories; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Linear MDPs for the attention-layer interpretation and suboptimality bound
Invoked to derive the end-to-end bound separating in-context estimation error from training-length bias.

pith-pipeline@v0.9.0 · 5558 in / 1347 out tokens · 57234 ms · 2026-05-12T02:35:42.660810+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

interpret a fine-tuned attention layer as implicitly estimating optimal Q-functions from in-context data... end-to-end suboptimality bound... separates the in-context estimation error from the training-length bias
IndisputableMonolith/Foundation/RealityFromDistinction reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

linear MDPs... Q∗τ(s,a)=⟨ϕ(s,a),w∗τ⟩... Γ−1... in-context estimator

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 5 internal anchors

[1]

James Robins. A new approach to causal inference in mortality studies with a sustained exposure period—application to control of the healthy worker survivor effect.Mathematical Modelling, 7(9- 12):1393–1512, 1986

work page 1986
[2]

Optimal dynamic treatment regimes.Journal of the Royal Statistical Society Series B: Statistical Methodology, 65(2):331–355, 2003

Susan A Murphy. Optimal dynamic treatment regimes.Journal of the Royal Statistical Society Series B: Statistical Methodology, 65(2):331–355, 2003

work page 2003
[3]

Ambiguous partially observable Markov decision processes: Structural results and applications.Journal of Economic Theory, 178:1–35, 2018

Soroush Saghafian. Ambiguous partially observable Markov decision processes: Structural results and applications.Journal of Economic Theory, 178:1–35, 2018

work page 2018
[4]

Ambiguous dynamic treatment regimes: A reinforcement learning approach.Manage- ment Science, 70(9):5667–5690, 2024

Soroush Saghafian. Ambiguous dynamic treatment regimes: A reinforcement learning approach.Manage- ment Science, 70(9):5667–5690, 2024

work page 2024
[5]

Cambridge University Press, 2025

Soroush Saghafian.Insight-driven problem solving: Analytics science to improve the world. Cambridge University Press, 2025

work page 2025
[6]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in Neural Information Processing Systems, 33:1877–1901, 2020

work page 1901
[7]

What can transformers learn in- context? A case study of simple function classes.Advances in Neural Information Processing Systems, 35:30583–30598, 2022

Shivam Garg, Dimitris Tsipras, Percy S Liang, and Gregory Valiant. What can transformers learn in- context? A case study of simple function classes.Advances in Neural Information Processing Systems, 35:30583–30598, 2022

work page 2022
[8]

Trained transformers learn linear models in-context

Ruiqi Zhang, Spencer Frei, and Peter L Bartlett. Trained transformers learn linear models in-context. Journal of Machine Learning Research, 25(49):1–55, 2024

work page 2024
[9]

Decision transformer: Reinforcement learning via sequence modeling.Advances in Neural Information Processing Systems, 34:15084–15097, 2021

Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning via sequence modeling.Advances in Neural Information Processing Systems, 34:15084–15097, 2021

work page 2021
[10]

arXiv preprint arXiv:2210.14215 , year=

Michael Laskin, Luyu Wang, Junhyuk Oh, Emilio Parisotto, Stephen Spencer, Richie Steigerwald, DJ Strouse, Steven Hansen, Angelos Filos, Ethan Brooks, et al. In-context reinforcement learning with algorithm distillation.arXiv preprint arXiv:2210.14215, 2022

work page arXiv 2022
[11]

Supervised pretraining can learn in-context reinforcement learning.Advances in Neural Information Processing Systems, 36:43057–43083, 2023

Jonathan Lee, Annie Xie, Aldo Pacchiano, Yash Chandak, Chelsea Finn, Ofir Nachum, and Emma Brunskill. Supervised pretraining can learn in-context reinforcement learning.Advances in Neural Information Processing Systems, 36:43057–43083, 2023

work page 2023
[12]

Effective generative AI: The human-algorithm centaur.Harvard Data Science Review, (Special Issue 5), 2024

Soroush Saghafian and Lihi Idan. Effective generative AI: The human-algorithm centaur.Harvard Data Science Review, (Special Issue 5), 2024

work page 2024
[13]

G., Rao, K., Sadigh, D., and Zeng, A

Suvir Mirchandani, Fei Xia, Pete Florence, Brian Ichter, Danny Driess, Montserrat Gonzalez Arenas, Kanishka Rao, Dorsa Sadigh, and Andy Zeng. Large language models as general pattern machines.arXiv preprint arXiv:2307.04721, 2023. 10

work page arXiv 2023
[14]

Probing the decision boundaries of in-context learning in large language models.Advances in Neural Information Processing Systems, 37:130408–130432, 2024

Siyan Zhao, Tung Nguyen, and Aditya Grover. Probing the decision boundaries of in-context learning in large language models.Advances in Neural Information Processing Systems, 37:130408–130432, 2024

work page 2024
[15]

Large language models for time series: A survey,

Xiyuan Zhang, Ranak Roy Chowdhury, Rajesh K Gupta, and Jingbo Shang. Large language models for time series: A survey.arXiv preprint arXiv:2402.01801, 2024

work page arXiv 2024
[16]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

Inner Monologue: Embodied Reasoning through Planning with Language Models

Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, et al. Inner monologue: Embodied reasoning through planning with language models.arXiv preprint arXiv:2207.05608, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[18]

A Generalist Agent

Scott Reed, Konrad Zolna, Emilio Parisotto, Sergio Gomez Colmenarejo, Alexander Novikov, Gabriel Barth-Maron, Mai Gimenez, Yury Sulsky, Jackie Kay, Jost Tobias Springenberg, et al. A generalist agent. arXiv preprint arXiv:2205.06175, 2022

work page internal anchor Pith review arXiv 2022
[19]

Pre-trained language models for interactive decision-making

Shuang Li, Xavier Puig, Chris Paxton, Yilun Du, Clinton Wang, Linxi Fan, Tao Chen, De-An Huang, Ekin Akyürek, Anima Anandkumar, et al. Pre-trained language models for interactive decision-making. Advances in Neural Information Processing Systems, 35:31199–31212, 2022

work page 2022
[20]

Survey on large language model-enhanced reinforcement learning: Concept, taxonomy, and methods.IEEE Transactions on Neural Networks and Learning Systems, 36(6):9737–9757, 2024

Yuji Cao, Huan Zhao, Yuheng Cheng, Ting Shu, Yue Chen, Guolong Liu, Gaoqi Liang, Junhua Zhao, Jinyue Yan, and Yun Li. Survey on large language model-enhanced reinforcement learning: Concept, taxonomy, and methods.IEEE Transactions on Neural Networks and Learning Systems, 36(6):9737–9757, 2024

work page 2024
[21]

QLoRA: Efficient finetuning of quantized LLMs.Advances in Neural Information Processing Systems, 36:10088–10115, 2023

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA: Efficient finetuning of quantized LLMs.Advances in Neural Information Processing Systems, 36:10088–10115, 2023

work page 2023
[22]

Algorithm, human, or the centaur: How to enhance clinical care?HKS Working Paper No

Arlen Dean, Agni Orfanoudaki, Soroush Saghafian, Karen Song, Harini A Chakkera, and Curtiss Cook. Algorithm, human, or the centaur: How to enhance clinical care?HKS Working Paper No. RWP22-027, 2022

work page 2022
[23]

Towards causal foundation model: on duality between causal inference and attention.arXiv preprint arXiv:2310.00809, 2023

Jiaqi Zhang, Joel Jennings, Agrin Hilmkil, Nick Pawlowski, Cheng Zhang, and Chao Ma. Towards causal foundation model: On duality between causal inference and attention.arXiv preprint arXiv:2310.00809, 2023

work page arXiv 2023
[24]

Large language models and causal inference in collaboration: A comprehen- sive survey.Findings of the Association for Computational Linguistics: NAACL 2025, pages 7668–7684, 2025

Xiaoyu Liu, Paiheng Xu, Junda Wu, Jiaxin Yuan, Yifan Yang, Yuhang Zhou, Fuxiao Liu, Tianrui Guan, Haoliang Wang, Tong Yu, et al. Large language models and causal inference in collaboration: A comprehen- sive survey.Findings of the Association for Computational Linguistics: NAACL 2025, pages 7668–7684, 2025

work page 2025
[25]

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems.arXiv preprint arXiv:2005.01643, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2005
[26]

Conservative Q-learning for offline reinforcement learning.Advances in Neural Information Processing Systems, 33:1179–1191, 2020

Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative Q-learning for offline reinforcement learning.Advances in Neural Information Processing Systems, 33:1179–1191, 2020

work page 2020
[27]

Is pessimism provably efficient for offline RL?International Conference on Machine Learning, pages 5084–5096, 2021

Ying Jin, Zhuoran Yang, and Zhaoran Wang. Is pessimism provably efficient for offline RL?International Conference on Machine Learning, pages 5084–5096, 2021

work page 2021
[28]

A minimalist approach to offline reinforcement learning.Advances in Neural Information Processing Systems, 34:20132–20145, 2021

Scott Fujimoto and Shixiang Shane Gu. A minimalist approach to offline reinforcement learning.Advances in Neural Information Processing Systems, 34:20132–20145, 2021

work page 2021
[29]

RL$^2$: Fast Reinforcement Learning via Slow Reinforcement Learning

Yan Duan, John Schulman, Xi Chen, Peter L Bartlett, Ilya Sutskever, and Pieter Abbeel. RL 2: Fast reinforcement learning via slow reinforcement learning.arXiv preprint arXiv:1611.02779, 2016

work page Pith review arXiv 2016
[30]

Learning to reinforcement learn

Jane X Wang, Zeb Kurth-Nelson, Dhruva Tirumala, Hubert Soyer, Joel Z Leibo, Rémi Munos, Charles Blundell, Dharshan Kumaran, and Matt Botvinick. Learning to reinforcement learn.arXiv preprint arXiv:1611.05763, 2016

work page Pith review arXiv 2016
[31]

Model-agnostic meta-learning for fast adaptation of deep networks.International Conference on Machine Learning, pages 1126–1135, 2017

Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks.International Conference on Machine Learning, pages 1126–1135, 2017

work page 2017
[32]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023. 11 A Additional related work Causal inference and LLMs.Causal foundation models have been developed to ...

work page internal anchor Pith review Pith/arXiv arXiv 2023