arxiv: 2605.12379 · v1 · submitted 2026-05-12 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Discrete Flow Matching for Offline-to-Online Reinforcement Learning

Fairoz Nower Khan, Nabuat Zaman Nahim, Peizhong Ju

Pith reviewed 2026-05-13 05:41 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords discrete flow matchingoffline-to-online RLcontinuous-time Markov chainpath-space penaltycandidate-set approximationJerichoreinforcement learning

0 comments

The pith

A path-space penalty on full trajectories lets discrete RL policies improve online while retaining offline knowledge.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops DRIFT to enable stable improvement when fine-tuning reinforcement learning policies from offline data to online interactions in discrete action spaces. It does so by updating a pretrained continuous-time Markov chain policy with an advantage-weighted discrete flow matching objective and introducing a path-space penalty that regularizes the complete trajectory distribution. For handling large action spaces, the method samples a small candidate set of actions from reference rollouts and exploration. The authors show through theory and experiments that this yields stable gains across tasks, including top performance on Jericho using a basic GRU encoder, surpassing approaches based on pretrained language models. Sympathetic readers would value this if it means simpler models can succeed where complex ones are currently used, especially when retaining prior knowledge is critical.

Core claim

The central discovery is that updating an offline pretrained CTMC policy via advantage-weighted discrete flow matching, combined with a path-space penalty on the full trajectory distribution and a candidate-set approximation for large spaces, enables stable offline-to-online RL. The path-space penalty preserves useful knowledge by acting on the entire distribution rather than final actions alone. Theoretical results establish that candidate-set error is bounded by missing target mass and generator error decreases with increased high-probability coverage. Experiments confirm consistent improvement on discrete RL benchmarks and superior Jericho scores with a simple encoder.

What carries the argument

the path-space penalty that regularizes the full CTMC trajectory distribution to preserve pretrained knowledge during online updates

If this is right

Stable improvement from offline to online across all tasks tested
Highest average performance on the Jericho benchmark using a simple GRU encoder
Outperforms language model based methods on the same tasks
The path-space penalty stays bounded throughout fine-tuning
The CTMC generator adapts quicker to reward changes than deterministic alternatives

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach may extend to other discrete control problems where retaining offline behaviors is important during adaptation.
If candidate coverage is high, the approximation could enable scaling to even larger action spaces with minimal loss in accuracy.

Load-bearing premise

Regularizing the full trajectory distribution via the path-space penalty is enough to prevent loss of useful pretrained behaviors when the policy starts interacting online.

What would settle it

Running the fine-tuning without the path-space penalty and observing degradation in performance on tasks that rely on offline knowledge would falsify the preservation claim; similarly, if generator error fails to decrease with larger candidate sets covering more probability mass.

Figures

Figures reproduced from arXiv: 2605.12379 by Fairoz Nower Khan, Nabuat Zaman Nahim, Peizhong Ju.

**Figure 2.** Figure 2: Full hyperparameter ablation on Toy 5 (|A|=64, 3 seeds). Top: mean evaluation reward (higher is better). Bottom: cold-goal visits out of 100 evaluation episodes (higher = better transfer to unseen goals). Dark bars mark the best value in each sweep. (a) KL weight: α=0 best (no distribution shift in this env). (b) CTMC steps: sharp quality threshold at M=10. (c) Temperature: β=0.4 best balances sharpness an… view at source ↗

**Figure 3.** Figure 3: Cold-start ablation on Toy 5. (a) Mean reward: remo [PITH_FULL_IMAGE:figures/full_fig_p022_3.png] view at source ↗

**Figure 4.** Figure 4: Path-space KL divergence during online fine-tunin [PITH_FULL_IMAGE:figures/full_fig_p023_4.png] view at source ↗

**Figure 5.** Figure 5: Toy 3 results. (a) Mean reward: DRIFT and DQN both re [PITH_FULL_IMAGE:figures/full_fig_p025_5.png] view at source ↗

**Figure 6.** Figure 6: Toy 4 results (12×12, |A|=128). (a) Candidate-set phase transition: performance saturates above 16% coverage (dashed line); V ∗=9.5 shown dotted. (b) Method comparison: DRIFT pre-trained achieves 9.52 without any online interaction; DQN offline scores 0.00. (c) Goal diversity: DRIFT visits all three goals; DQN mode-collapses [PITH_FULL_IMAGE:figures/full_fig_p026_6.png] view at source ↗

**Figure 7.** Figure 7: Goal-Switch results. (a) Steps to reach the switch [PITH_FULL_IMAGE:figures/full_fig_p027_7.png] view at source ↗

read the original abstract

Many reinforcement learning (RL) tasks have discrete action spaces, but most generative policy methods based on diffusion and flow matching are designed for continuous control. Meanwhile, generative policies usually rely heavily on offline datasets and offline-to-online RL is itself challenging, as the policy must improve from new interaction without losing useful behavior learned from static data. To address those challenges, we introduce DRIFT, an online fine-tuning method that updates an offline pretrained continuous-time Markov chain (CTMC) policy with an advantage-weighted discrete flow matching loss. To preserve useful pretrained knowledge, we add a path-space penalty that regularizes the full CTMC trajectory distribution, rather than only the final action distribution. For large discrete action spaces, we introduce a candidate-set approximation that updates the actor over a small subset of actions sampled from reference-policy rollouts and uniform exploration. Our theoretical analysis shows that the candidate-set error is controlled by missing target probability mass, and the induced CTMC generator error decreases as the candidate set covers more high-probability actions. Experiments on prevailing discrete action RL task show that our method provides stable offline-to-online improvement across all tasks, achieving the highest average score on Jericho with a simple GRU encoder while outperforming methods that use pretrained language models. Controlled experiments further confirm that the path-space penalty remains bounded during fine-tuning and that the CTMC generator adapts to shifted rewards faster than deterministic baselines. The candidate-set mechanism is supported by a stability analysis showing that the generator error decreases exponentially with candidate coverage.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DRIFT adapts discrete flow-matching policies from offline pretraining to online fine-tuning with a path-space penalty and candidate-set approximation, showing stable gains but with a potential gap in coverage during shifts.

read the letter

DRIFT is a method that takes a continuous-time Markov chain policy pretrained on offline data and fine-tunes it online using discrete flow matching with advantage weighting. It includes a path-space penalty to regularize the whole trajectory distribution so useful pretrained behavior is not lost, and a candidate-set approximation that limits updates to a small number of actions sampled from the reference policy plus uniform exploration. What is new is the application of flow matching to discrete offline-to-online RL, the use of the path-space penalty rather than a simpler final-distribution regularizer, and the candidate-set mechanism with its accompanying theory. The analysis shows the approximation error is controlled by the amount of missing target probability mass, and that the CTMC generator error decreases exponentially as more high-probability actions are included in the set. On the experimental side, the method achieves stable improvement on several discrete action RL tasks, records the highest average score on Jericho using only a simple GRU encoder, and beats approaches that rely on pretrained language models. Additional controlled experiments verify that the path-space penalty remains bounded throughout fine-tuning and that the generator adapts to reward shifts more quickly than deterministic baselines. One soft spot worth noting is the reliance on a fixed candidate set drawn once from offline rollouts. As online fine-tuning proceeds and the policy begins to favor new actions under the updated rewards, the missing target mass could increase, potentially weakening the error bound just when adaptation is most needed. The stability analysis in the paper claims exponential decrease with coverage, but it would be good to see whether coverage is checked or the set is refreshed dynamically during training. If the full paper includes such checks or shows that the initial set suffices in practice, that concern is minor. This paper is for researchers in reinforcement learning who work with discrete actions and are interested in bringing generative modeling techniques like flow matching into the offline-to-online setting. A reader focused on practical methods for discrete control or on theoretical analysis of generative policies would get value from the combination of the algorithm, the error bounds, and the benchmark results. The work shows enough technical substance and empirical support that it deserves a serious referee.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces DRIFT, a method for offline-to-online RL in discrete action spaces. It pretrains a continuous-time Markov chain (CTMC) policy via discrete flow matching on offline data, then fine-tunes it online using an advantage-weighted discrete flow matching loss. A path-space penalty regularizes the full trajectory distribution to preserve pretrained knowledge, while a candidate-set approximation (sampling from offline rollouts plus uniform exploration) enables scaling to large action spaces. Theoretical analysis claims the candidate-set error is bounded by missing target probability mass and decreases exponentially with coverage of high-probability actions; experiments report stable improvement across tasks, highest average score on Jericho using a simple GRU encoder, and outperformance of pretrained-LM baselines, with controlled experiments confirming bounded path-space penalty and faster adaptation than deterministic baselines.

Significance. If the theoretical error bounds and experimental stability hold under policy shift, the work would provide a principled mechanism for stable fine-tuning of generative discrete policies without catastrophic forgetting of offline behavior. The path-space regularization and candidate-set analysis could influence future discrete-action generative RL methods, particularly in text-based or combinatorial domains where pretrained knowledge is valuable.

major comments (2)

[Theoretical analysis] Theoretical analysis (as summarized in the abstract and §3): the candidate-set approximation error is claimed to be controlled by missing target probability mass with exponential decrease in generator error as coverage of high-probability actions increases. However, candidates are sampled once from the fixed offline reference policy; the analysis does not address how the bound evolves when online fine-tuning shifts the target distribution and new high-advantage actions appear outside the initial candidate set. This directly affects the stability guarantee for offline-to-online improvement.
[Experiments] §4 (Experiments): the claim of stable improvement and highest Jericho score is presented without reported tables of per-task returns, variance across seeds, or direct comparison metrics against the pretrained-LM baselines. The controlled experiments on path-space penalty boundedness and faster CTMC adaptation are referenced but lack quantitative details (e.g., penalty values over fine-tuning steps or adaptation speed metrics) needed to verify the supporting claims.

minor comments (2)

[Method] Notation for the CTMC generator and flow-matching loss should be introduced with explicit equations early in the method section to aid readability.
[Abstract] The abstract states 'prevailing discrete action RL task' (singular); this should be corrected to 'tasks' for grammatical consistency.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and detailed review of our manuscript. The comments highlight important aspects of the theoretical analysis and experimental presentation that we will address in the revision. We respond to each major comment below.

read point-by-point responses

Referee: [Theoretical analysis] Theoretical analysis (as summarized in the abstract and §3): the candidate-set approximation error is claimed to be controlled by missing target probability mass with exponential decrease in generator error as coverage of high-probability actions increases. However, candidates are sampled once from the fixed offline reference policy; the analysis does not address how the bound evolves when online fine-tuning shifts the target distribution and new high-advantage actions appear outside the initial candidate set. This directly affects the stability guarantee for offline-to-online improvement.

Authors: We appreciate this observation regarding the scope of the analysis. Section 3 derives a bound on the candidate-set approximation error for a given target distribution, showing that the error depends on the missing probability mass and that the induced CTMC generator error decreases exponentially with increased coverage of high-probability actions. The analysis is stated for a fixed target, and we agree that it does not explicitly characterize how the bound changes as the target distribution shifts during online fine-tuning or when new high-advantage actions emerge outside the initial candidate set. In the revised manuscript we will add a clarifying paragraph in §3 that states this assumption explicitly and discusses its implications for offline-to-online transfer. We will also include additional empirical results demonstrating that the approximation remains stable in practice as the policy adapts. A complete dynamic analysis of the bound under shifting targets is beyond the current scope and is noted as future work. revision: partial
Referee: [Experiments] §4 (Experiments): the claim of stable improvement and highest Jericho score is presented without reported tables of per-task returns, variance across seeds, or direct comparison metrics against the pretrained-LM baselines. The controlled experiments on path-space penalty boundedness and faster CTMC adaptation are referenced but lack quantitative details (e.g., penalty values over fine-tuning steps or adaptation speed metrics) needed to verify the supporting claims.

Authors: We agree that the experimental section would be strengthened by more detailed quantitative reporting. In the revised manuscript we will expand §4 to include tables of per-task returns (means and standard deviations across seeds) together with explicit numerical comparison metrics against the pretrained-LM baselines. For the controlled experiments we will add figures and tables reporting the path-space penalty values at successive fine-tuning steps as well as quantitative adaptation-speed metrics (e.g., steps required to reach performance thresholds). These additions will directly support the claims of stability and faster adaptation. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation chain remains self-contained

full rationale

The paper derives the candidate-set error bound directly from CTMC generator properties and missing target mass, presented as an independent theoretical result rather than a redefinition of fitted quantities. The path-space penalty is introduced as an explicit regularization on the full trajectory distribution, and the advantage-weighted discrete flow matching loss is a forward update rule. No equation reduces by construction to its own inputs, no prediction is statistically forced by a prior fit, and no load-bearing uniqueness theorem or ansatz is smuggled via self-citation. The offline-to-online stability claims rest on the stated coverage assumption and experimental verification, keeping the chain independent of its outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides insufficient technical detail to enumerate concrete free parameters, axioms, or invented entities; the method name DRIFT itself is the primary new construct introduced.

pith-pipeline@v0.9.0 · 5570 in / 1112 out tokens · 121372 ms · 2026-05-13T05:41:20.238588+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
Our theoretical analysis shows that the candidate-set error is controlled by missing target probability mass, and the induced CTMC generator error decreases as the candidate set covers more high-probability actions.
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_injective unclear
the path-space KL divergence between path measures Puθ and Puref admits a tractable decomposition via the Radon–Nikodym derivative

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · 10 internal anchors

[1]

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

Offline reinforcement learning: Tutorial, review, and perspectives on open problems , author=. arXiv preprint arXiv:2005.01643 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2005
[2]

Advances in Neural Information Processing Systems , volume=

Cal-ql: Calibrated offline rl pre-training for efficient online fine-tuning , author=. Advances in Neural Information Processing Systems , volume=

work page
[3]

Conference on Robot Learning , pages=

Offline-to-online reinforcement learning via balanced replay and pessimistic q-ensemble , author=. Conference on Robot Learning , pages=. 2022 , organization=

work page 2022
[4]

The Eleventh International Conference on Learning Representations , year=

Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning , author=. The Eleventh International Conference on Learning Representations , year=

work page
[5]

The Thirteenth International Conference on Learning Representations , year=

Energy-Weighted Flow Matching for Offline Reinforcement Learning , author=. The Thirteenth International Conference on Learning Representations , year=

work page
[6]

arXiv preprint arXiv:2602.06138 , year=

Flow Matching for Offline Reinforcement Learning with Discrete Actions , author=. arXiv preprint arXiv:2602.06138 , year=

work page internal anchor Pith review arXiv
[7]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

ReinFlow: Fine-tuning Flow Matching Policy with Online Reinforcement Learning , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

work page
[8]

arXiv preprint arXiv:2507.21053 , year=

Flow matching policy gradients , author=. arXiv preprint arXiv:2507.21053 , year=

work page arXiv
[9]

Proceedings of the AAAI conference on artificial intelligence , volume=

Adaptive policy learning for offline-to-online reinforcement learning , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

work page
[10]

International Conference on Machine Learning , pages=

Generative Flows on Discrete State-Spaces: Enabling Multimodal Flows with Applications to Protein Co-Design , author=. International Conference on Machine Learning , pages=. 2024 , organization=

work page 2024
[11]

Advances in Neural Information Processing Systems , volume=

Discrete flow matching , author=. Advances in Neural Information Processing Systems , volume=

work page
[12]

The Thirteenth International Conference on Learning Representations , year=

Generator Matching: Generative modeling with arbitrary Markov processes , author=. The Thirteenth International Conference on Learning Representations , year=

work page
[13]

Flow Matching Guide and Code

Flow matching guide and code , author=. arXiv preprint arXiv:2412.06264 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[14]

nature , volume=

Human-level control through deep reinforcement learning , author=. nature , volume=. 2015 , publisher=

work page 2015
[15]

Playing Atari with Deep Reinforcement Learning

Playing atari with deep reinforcement learning , author=. arXiv preprint arXiv:1312.5602 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Proceedings of the AAAI conference on artificial intelligence , volume=

Deep reinforcement learning with double q-learning , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

work page
[17]

Proceedings of the AAAI conference on artificial intelligence , volume=

Rainbow: Combining improvements in deep reinforcement learning , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

work page
[18]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[19]

International conference on machine learning , pages=

Asynchronous methods for deep reinforcement learning , author=. International conference on machine learning , pages=. 2016 , organization=

work page 2016
[20]

International Conference on Learning Representations , year=

Offline Reinforcement Learning with Implicit Q-Learning , author=. International Conference on Learning Representations , year=

work page
[21]

Advances in neural information processing systems , volume=

Conservative q-learning for offline reinforcement learning , author=. Advances in neural information processing systems , volume=

work page
[22]

AWAC: Accelerating Online Reinforcement Learning with Offline Datasets

Awac: Accelerating online reinforcement learning with offline datasets , author=. arXiv preprint arXiv:2006.09359 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2006
[23]

The Thirteenth International Conference on Learning Representations , year=

Fine-Tuning Discrete Diffusion Models via Reward Optimization with Applications to DNA and Protein Design , author=. The Thirteenth International Conference on Learning Representations , year=

work page
[24]

IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies

Idql: Implicit q-learning as an actor-critic method with diffusion policies , author=. arXiv preprint arXiv:2304.10573 , year=

work page internal anchor Pith review arXiv
[25]

International Conference on Machine Learning , year=

Flow Q-Learning , author=. International Conference on Machine Learning , year=

work page
[26]

The Eleventh International Conference on Learning Representations , year=

Policy Expansion for Bridging Offline-to-Online Reinforcement Learning , author=. The Eleventh International Conference on Learning Representations , year=

work page
[27]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

State Proficiency-Based Adaptive Fine-Tuning for Offline-to-Online Reinforcement Learning , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page
[28]

Journal of artificial intelligence research , volume=

The arcade learning environment: An evaluation platform for general agents , author=. Journal of artificial intelligence research , volume=

work page
[29]

arXiv preprint arXiv:1903.03176 , year=

Minatar: An atari-inspired testbed for thorough and reproducible reinforcement learning experiments , author=. arXiv preprint arXiv:1903.03176 , year=

work page arXiv 1903
[30]

D4RL: Datasets for Deep Data-Driven Reinforcement Learning

D4rl: Datasets for deep data-driven reinforcement learning , author=. arXiv preprint arXiv:2004.07219 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2004
[31]

2023 , url =

Farama Foundation , title =. 2023 , url =

work page 2023
[32]

2012 IEEE/RSJ International Conference on Intelligent Robots and Systems , pages=

MuJoCo: A physics engine for model-based control , author=. 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems , pages=. 2012 , organization=

work page 2012
[33]

1998 , publisher=

Markov chains , author=. 1998 , publisher=

work page 1998
[34]

2013 , publisher=

Scaling limits of interacting particle systems , author=. 2013 , publisher=

work page 2013
[35]

Fine-Tuning Language Models from Human Preferences

Fine-tuning language models from human preferences , author=. arXiv preprint arXiv:1909.08593 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1909
[36]

Advances in Neural Information Processing Systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , volume=

work page
[37]

International conference on machine learning , pages=

Trust region policy optimization , author=. International conference on machine learning , pages=. 2015 , organization=

work page 2015
[38]

AAAI , year=

Relative Entropy Policy Search , author=. AAAI , year=

work page
[39]

The Eleventh International Conference on Learning Representations , year=

Flow Matching for Generative Modeling , author=. The Eleventh International Conference on Learning Representations , year=

work page
[40]

arXiv preprint arXiv:2209.14577 , year=

Rectified flow: A marginal preserving approach to optimal transport , author=. arXiv preprint arXiv:2209.14577 , year=

work page arXiv
[41]

Advances in Neural Information Processing Systems , volume=

Denoising Diffusion Probabilistic Models , author=. Advances in Neural Information Processing Systems , volume=

work page
[42]

The Eleventh International Conference on Learning Representations , year=

Offline Reinforcement Learning via High-Fidelity Generative Behavior Modeling , author=. The Eleventh International Conference on Learning Representations , year=

work page
[43]

Advances in Neural Information Processing Systems , volume=

Efficient diffusion policies for offline reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=

work page
[44]

International Conference on Machine Learning , pages=

Contrastive energy prediction for exact energy-guided diffusion sampling in offline reinforcement learning , author=. International Conference on Machine Learning , pages=. 2023 , organization=

work page 2023
[45]

Guided flows for generative modeling and decision making.arXiv preprint arXiv:2311.13443, 2023

Guided flows for generative modeling and decision making , author=. arXiv preprint arXiv:2311.13443 , year=

work page arXiv
[46]

Advances in neural information processing systems , volume=

Diffusion models beat gans on image synthesis , author=. Advances in neural information processing systems , volume=

work page
[47]

Classifier-Free Diffusion Guidance

Classifier-free diffusion guidance , author=. arXiv preprint arXiv:2207.12598 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[48]

The Thirteenth International Conference on Learning Representations , year=

OGBench: Benchmarking Offline Goal-Conditioned RL , author=. The Thirteenth International Conference on Learning Representations , year=

work page
[49]

Advances in neural information processing systems , volume=

Decision transformer: Reinforcement learning via sequence modeling , author=. Advances in neural information processing systems , volume=

work page
[50]

international conference on machine learning , pages=

Online decision transformer , author=. international conference on machine learning , pages=. 2022 , organization=

work page 2022
[51]

International conference on machine learning , pages=

Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor , author=. International conference on machine learning , pages=. 2018 , organization=

work page 2018
[52]

International conference on machine learning , pages=

Addressing function approximation error in actor-critic methods , author=. International conference on machine learning , pages=. 2018 , organization=

work page 2018
[53]

International Conference on Machine Learning , pages=

Offline-to-Online Reinforcement Learning with Classifier-Free Diffusion Generation , author=. International Conference on Machine Learning , pages=. 2025 , organization=

work page 2025
[54]

Advances in neural information processing systems , volume=

Linearly-solvable Markov decision problems , author=. Advances in neural information processing systems , volume=

work page
[55]

International conference on machine learning , pages=

Reinforcement learning with deep energy-based policies , author=. International conference on machine learning , pages=. 2017 , organization=

work page 2017
[56]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Relative entropy policy search , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page
[57]

1976 , publisher=

Principles of Mathematical Analysis , author=. 1976 , publisher=

work page 1976
[58]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Interactive fiction games: A colossal adventure , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page
[59]

Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Deep reinforcement learning with a natural language action space , author=. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[60]

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=

Keep calm and explore: Language models for action generation in text-based games , author=. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=

work page 2020
[61]

International Conference on Learning Representations , year=

Graph Constrained Reinforcement Learning for Natural Language Action Spaces , author=. International Conference on Learning Representations , year=

work page
[62]

International Conference on Learning Representations , year=

Multi-Stage Episodic Control for Strategic Exploration in Text Games , author=. International Conference on Learning Representations , year=

work page
[63]

Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics , pages=

Self-imitation learning for action generation in text-based games , author=. Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics , pages=

work page
[64]

International Conference on Learning Representations , year=

Score-Based Generative Modeling through Stochastic Differential Equations , author=. International Conference on Learning Representations , year=

work page