pith. machine review for the scientific record. sign in

arxiv: 2005.01643 · v3 · submitted 2020-05-04 · 💻 cs.LG · cs.AI· stat.ML

Recognition: 3 theorem links

· Lean Theorem

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

Authors on Pith no claims yet

Pith reviewed 2026-05-11 11:27 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML
keywords offline reinforcement learningdeep reinforcement learningpolicy optimizationstatic datasetsdecision makingreinforcement learning challengesopen problems in RL
0
0 comments X

The pith

Offline reinforcement learning can extract maximum-utility policies from fixed datasets without new data collection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Offline reinforcement learning trains decision policies solely from previously gathered data, avoiding any further interaction with the environment during learning. This setup promises to convert large existing datasets into effective automated decision systems across domains such as healthcare, education, and robotics. Current algorithms face limitations that prevent full extraction of high-utility policies, especially when using deep neural networks. The paper supplies conceptual tools to understand these issues, reviews solutions explored in recent studies, covers example applications, and outlines remaining open problems.

Core claim

Offline reinforcement learning algorithms hold promise for turning large datasets into powerful decision-making engines by extracting policies with the maximum possible utility out of available data. Effective methods would automate decision-making domains from healthcare to robotics. Limitations in current algorithms, particularly with modern deep reinforcement learning, make this extraction difficult. The work describes challenges, potential mitigating solutions from recent research, applications, and perspectives on open problems.

What carries the argument

Offline reinforcement learning, the paradigm that optimizes policies using only a static dataset of past experiences without any further online interaction or data gathering.

If this is right

  • Large static datasets from real-world logs can train agents for healthcare or robotics decisions without risky new interactions.
  • Policy optimization can proceed purely from recorded trajectories, separating data collection from learning.
  • Automation of decision domains becomes feasible once limitations are addressed through the reviewed techniques.
  • Research can focus on open problems to improve utility extraction from fixed data sources.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This approach could enable safer deployment of learned policies in settings where online exploration carries high cost or danger.
  • It opens connections to large-scale supervised learning on logged decision data from production systems.
  • Open problems identified may direct attention toward handling distribution shifts between dataset and deployment conditions.

Load-bearing premise

That the limitations of current offline algorithms can be overcome by the solutions explored in recent work, enabling effective extraction of maximum-utility policies from available data.

What would settle it

A controlled benchmark where applying all described mitigation techniques still yields offline policies whose performance falls short of online reinforcement learning baselines on standard control tasks.

read the original abstract

In this tutorial article, we aim to provide the reader with the conceptual tools needed to get started on research on offline reinforcement learning algorithms: reinforcement learning algorithms that utilize previously collected data, without additional online data collection. Offline reinforcement learning algorithms hold tremendous promise for making it possible to turn large datasets into powerful decision making engines. Effective offline reinforcement learning methods would be able to extract policies with the maximum possible utility out of the available data, thereby allowing automation of a wide range of decision-making domains, from healthcare and education to robotics. However, the limitations of current algorithms make this difficult. We will aim to provide the reader with an understanding of these challenges, particularly in the context of modern deep reinforcement learning methods, and describe some potential solutions that have been explored in recent work to mitigate these challenges, along with recent applications, and a discussion of perspectives on open problems in the field.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 1 minor

Summary. The paper is a tutorial and review on offline reinforcement learning algorithms that use previously collected data without additional online data collection. It aims to equip readers with conceptual tools to start research in this area, emphasizing the promise of turning large datasets into powerful decision-making engines for domains like healthcare, education, and robotics. The manuscript discusses limitations of current algorithms, particularly in deep RL, potential solutions from recent work, applications, and perspectives on open problems.

Significance. This review could be significant for the field by providing a consolidated overview and highlighting open problems, potentially guiding future research in data-driven RL. As a tutorial from active researchers, it offers reliable conceptual framing of the core promise and difficulties of offline RL.

minor comments (1)
  1. [Abstract] Abstract: the statement that the paper will 'describe some potential solutions that have been explored in recent work' is vague on scope and examples; a brief enumeration of the main approaches covered would improve reader orientation without altering the tutorial structure.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary of our tutorial on offline reinforcement learning, the assessment of its potential significance for the field, and the recommendation for minor revision. We are pleased that the manuscript is viewed as providing reliable conceptual framing and highlighting open problems to guide future research.

Circularity Check

0 steps flagged

No significant circularity: review paper with no derivations or self-referential claims

full rationale

This is a tutorial and review paper that catalogs existing offline RL methods, challenges, and open problems from the literature without presenting any new derivations, equations, fitted parameters, or predictions. No load-bearing steps reduce to self-citations or definitions by construction; all claims are descriptive summaries of prior work. The manuscript is self-contained as a survey and does not introduce novel results that could exhibit circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As a review and tutorial, the paper introduces no new free parameters, axioms, or invented entities; it summarizes prior offline RL research.

pith-pipeline@v0.9.0 · 5452 in / 946 out tokens · 42328 ms · 2026-05-11T11:27:57.485356+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

    cs.LG 2026-04 unverdicted novelty 8.0

    Lightning OPD enforces teacher consistency by precomputing log-probabilities over SFT rollouts, matching standard OPD performance with bounded gradient discrepancy and achieving 4x speedup on math and code reasoning tasks.

  2. D4RL: Datasets for Deep Data-Driven Reinforcement Learning

    cs.LG 2020-04 accept novelty 8.0

    D4RL supplies new offline RL benchmarks and datasets from expert and mixed sources to expose weaknesses in existing algorithms and standardize evaluation.

  3. Distributionally Robust Multi-Task Reinforcement Learning via Adaptive Task Sampling

    cs.LG 2026-05 unverdicted novelty 7.0

    DRATS derives a minimax objective from a feasibility formulation of MTRL to adaptively sample tasks with the largest return gaps, leading to better worst-task performance on MetaWorld benchmarks.

  4. Bridging Domain Gaps with Target-Aligned Generation for Offline Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 7.0

    TCE bridges domain gaps in offline RL by selectively using source data or generating target-aligned transitions via a dual score-based model, outperforming baselines in experiments.

  5. Aligning Flow Map Policies with Optimal Q-Guidance

    cs.LG 2026-05 unverdicted novelty 7.0

    Flow map policies enable fast one-step inference for flow-based RL policies, and FMQ provides an optimal closed-form Q-guided target for offline-to-online adaptation under trust-region constraints, achieving SOTA performance.

  6. TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching

    cs.CL 2026-05 unverdicted novelty 7.0

    TBPO posits a token-level Bradley-Terry model and derives a Bregman-divergence density-ratio matching loss that generalizes DPO while preserving token-level optimality.

  7. Sample-Mean Anchored Thompson Sampling for Offline-to-Online Learning with Distribution Shift

    cs.LG 2026-05 unverdicted novelty 7.0

    Anchor-TS defines arm indices as the median of an online posterior sample, a hybrid posterior sample, and the online sample mean to correct distribution-shift bias and safely accelerate online learning with offline data.

  8. Sample-Mean Anchored Thompson Sampling for Offline-to-Online Learning with Distribution Shift

    cs.LG 2026-05 unverdicted novelty 7.0

    Anchor-TS corrects bias from distribution shift in offline-to-online bandits by taking the median of an online posterior sample, a hybrid posterior sample, and the online sample mean.

  9. Fast Rates for Offline Contextual Bandits with Forward-KL Regularization under Single-Policy Concentrability

    cs.LG 2026-05 unverdicted novelty 7.0

    The paper establishes the first tilde O(epsilon^{-1}) upper bounds and matching lower bounds for forward-KL-regularized offline contextual bandits under single-policy concentrability in both tabular and general functi...

  10. Learning When to Stop: Selective Imitation Learning Under Arbitrary Dynamics Shift

    cs.LG 2026-05 unverdicted novelty 7.0

    SeqRejectron builds a stopping rule from a small set of validator policies to achieve horizon-free sample-complexity guarantees for selective imitation learning under arbitrary train-test dynamics shifts.

  11. Quantile-Coupled Flow Matching for Distributional Reinforcement Learning

    cs.LG 2026-05 conditional novelty 7.0

    FlowIQN is a quantile-coupled CFM critic that yields the first explicit Wasserstein-aligned approximate projection for distributional RL, with improved return-distribution accuracy and competitive offline RL performance.

  12. Zero-shot Imitation Learning by Latent Topology Mapping

    cs.LG 2026-05 unverdicted novelty 7.0

    ZALT learns latent hub states and hub-to-hub dynamics from demonstrations to plan zero-shot solutions for unseen start-goal tasks, achieving 55% success in a 3D maze versus 6% for baselines.

  13. Learning to Communicate Locally for Large-Scale Multi-Agent Pathfinding

    cs.AI 2026-05 unverdicted novelty 7.0

    LC-MAPF uses multi-round local communication between neighboring agents in a pre-trained model to outperform prior learning-based MAPF solvers on diverse unseen scenarios while preserving scalability.

  14. Learning Visual Feature-Based World Models via Residual Latent Action

    cs.CV 2026-05 unverdicted novelty 7.0

    RLA-WM predicts residual latent actions via flow matching to create visual feature world models that outperform prior feature-based and diffusion approaches while enabling offline video-based robot RL.

  15. Path-Coupled Bellman Flows for Distributional Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 7.0

    Path-Coupled Bellman Flows use source-consistent Bellman-coupled paths and a lambda-parameterized control-variate to learn return distributions via flow matching, improving fidelity and stability over prior DRL approaches.

  16. Dynamic Treatment on Networks

    stat.ML 2026-05 unverdicted novelty 7.0

    Q-Ising integrates Bayesian dynamic Ising modeling with offline RL to enable adaptive network treatment policies that outperform static centrality benchmarks under spillovers.

  17. Operator-Guided Invariance Learning for Continuous Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 7.0

    VPSD-RL discovers exact and approximate value-preserving Lie-group operators in continuous RL to stabilize learning via transition augmentation and consistency regularization.

  18. SOPE: Stabilizing Off-Policy Evaluation for Online RL with Prior Data

    cs.LG 2026-05 unverdicted novelty 7.0

    SOPE uses an actor-aligned OPE signal on a held-out validation split to dynamically stop offline stabilization phases in online RL, improving performance up to 45.6% and cutting TFLOPs up to 22x on 25 Minari tasks.

  19. Beyond Penalization: Diffusion-based Out-of-Distribution Detection and Selective Regularization in Offline Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 7.0

    DOSER detects OOD actions via diffusion-model denoising error and applies selective regularization based on predicted transitions, proving gamma-contraction with performance bounds and outperforming priors on offline ...

  20. Adaptive Estimation and Optimal Control in Offline Contextual MDPs without Stationarity

    stat.ML 2026-05 unverdicted novelty 7.0

    A T-estimation-based procedure for adaptive density estimation and optimal control in offline contextual MDPs without stationarity, providing oracle risk bounds under two loss functions and finite-sample cost guarantees.

  21. Towards Efficient and Expressive Offline RL via Flow-Anchored Noise-conditioned Q-Learning

    cs.LG 2026-05 unverdicted novelty 7.0

    FAN achieves state-of-the-art offline RL performance on robotic tasks by anchoring flow policies and using single-sample noise-conditioned Q-learning, with proven convergence and reduced runtimes.

  22. CoFlow: Coordinated Few-Step Flow for Offline Multi-Agent Decision Making

    cs.AI 2026-05 unverdicted novelty 7.0

    CoFlow achieves state-of-the-art coordination in offline MARL using single-pass joint velocity fields with Coordinated Velocity Attention and Adaptive Coordination Gating.

  23. CoFlow: Coordinated Few-Step Flow for Offline Multi-Agent Decision Making

    cs.AI 2026-05 unverdicted novelty 7.0

    CoFlow achieves state-of-the-art coordination quality in offline MARL using only 1-3 denoising steps by natively coupling velocity fields across agents via coordinated attention and gating.

  24. CODA: Coordination via On-Policy Diffusion for Multi-Agent Offline Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 7.0

    CODA augments offline multi-agent RL with on-policy diffusion trajectories that evolve with the joint policy to enable coordination.

  25. CASP: Support-Aware Offline Policy Selection for Two-Stage Recommender Systems

    cs.IR 2026-04 unverdicted novelty 7.0

    CASP selects lower-burden two-stage recommender policies by combining doubly robust estimation with a penalty for weak data support and provides theoretical guarantees for conservative selection.

  26. SceneOrchestra: Efficient Agentic 3D Scene Synthesis via Full Tool-Call Trajectory Generation

    cs.CV 2026-04 unverdicted novelty 7.0

    SceneOrchestra trains an orchestrator to generate full tool-call trajectories for 3D scene synthesis and uses a discriminator during training to select high-quality plans, yielding state-of-the-art results with lower runtime.

  27. Beyond Bellman: High-Order Generator Regression for Continuous-Time Policy Evaluation

    stat.ML 2026-04 unverdicted novelty 7.0

    High-order generator regression from multi-step trajectories yields a second-order accurate estimator for finite-horizon continuous-time policy evaluation that outperforms the Bellman baseline in calibration studies a...

  28. Locality, Not Spectral Mixing, Governs Direct Propagation in Distributed Offline Dynamic Programming

    cs.DC 2026-04 unverdicted novelty 7.0

    Locality sets the fundamental round lower bound L_ε = floor(log(1/2ε)/log(1/γ)) for ε-accuracy on large-diameter graphs; direct propagation achieves it while gossip averaging pays extra 1/gap(W) factors.

  29. ViVa: A Video-Generative Value Model for Robot Reinforcement Learning

    cs.RO 2026-04 unverdicted novelty 7.0

    ViVa turns a video generator into a value model for robot RL that jointly forecasts future states and task value, yielding better performance on real-world box assembly when integrated with RECAP.

  30. VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training

    cs.RO 2022-09 unverdicted novelty 7.0

    VIP learns a visual embedding from human videos whose distance defines dense, smooth rewards for arbitrary goal-image robot tasks without task-specific fine-tuning.

  31. Q-Flow: Stable and Expressive Reinforcement Learning with Flow-Based Policy

    cs.LG 2026-05 unverdicted novelty 6.0

    Q-Flow enables stable optimization of expressive flow-based policies in RL by propagating terminal values along deterministic flow dynamics to intermediate states for gradient updates without solver unrolling.

  32. Trust the Batch, On- or Off-Policy: Adaptive Policy Optimization for RL Post-Training

    cs.LG 2026-05 unverdicted novelty 6.0

    A new RL objective adapts trust-region and off-policy handling automatically via normalized effective sample size of batch policy ratios, matching tuned baselines without new hyperparameters.

  33. Discrete Flow Matching for Offline-to-Online Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 6.0

    DRIFT enables stable offline-to-online fine-tuning of CTMC policies in discrete RL via advantage-weighted discrete flow matching, path-space regularization, and candidate-set approximation.

  34. TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching

    cs.CL 2026-05 unverdicted novelty 6.0

    TBPO derives a token-level preference optimization objective from sequence-level pairwise data via Bregman divergence ratio matching that generalizes DPO and improves alignment quality.

  35. RankQ: Offline-to-Online Reinforcement Learning via Self-Supervised Action Ranking

    cs.AI 2026-05 unverdicted novelty 6.0

    RankQ adds a self-supervised ranking loss to Q-learning to learn structured action orderings, yielding competitive or better performance than prior methods on D4RL benchmarks and large gains in vision-based robot fine-tuning.

  36. Adaptive Action Chunking via Multi-Chunk Q Value Estimation

    cs.LG 2026-05 unverdicted novelty 6.0

    ACH lets RL policies dynamically pick action chunk lengths by jointly estimating Q-values for all candidate lengths via a single Transformer pass.

  37. ACSAC: Adaptive Chunk Size Actor-Critic with Causal Transformer Q-Network

    cs.LG 2026-05 unverdicted novelty 6.0

    ACSAC adaptively selects action chunk sizes via a causal Transformer Q-network in actor-critic RL, proves the Bellman operator is a contraction, and reports state-of-the-art results on long-horizon manipulation tasks.

  38. Learning to Communicate Locally for Large-Scale Multi-Agent Pathfinding

    cs.AI 2026-05 unverdicted novelty 6.0

    LC-MAPF is a decentralized MAPF solver that uses a learnable multi-round communication module among nearby agents to outperform prior IL and RL methods while preserving scalability.

  39. Offline Reinforcement Learning for Rotation Profile Control in Tokamaks

    cs.LG 2026-05 unverdicted novelty 6.0

    Offline RL policies trained solely on DIII-D historical data were deployed on the tokamak and produced promising real-world control of the plasma rotation profile.

  40. On the Role of Language Representations in Auto-Bidding: Findings and Implications

    cs.AI 2026-05 unverdicted novelty 6.0

    SemBid injects LLM-encoded Task, History, and Strategy semantics as tokens into offline bidding trajectories and uses self-attention to outperform numerical-only baselines in performance, constraint satisfaction, and ...

  41. Adaptive Q-Chunking for Offline-to-Online Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 6.0

    Adaptive Q-Chunking selects optimal action chunk sizes at each state via normalized advantage comparisons to outperform fixed chunk sizes in offline-to-online RL on robot benchmarks.

  42. When Life Gives You BC, Make Q-functions: Extracting Q-values from Behavior Cloning for On-Robot Reinforcement Learning

    cs.RO 2026-05 unverdicted novelty 6.0

    Q2RL extracts Q-functions from BC policies via minimal interactions and applies Q-gating to enable stable offline-to-online RL, outperforming baselines on manipulation benchmarks and achieving up to 100% success on-robot.

  43. Adaptive Policy Selection and Fine-Tuning under Interaction Budgets for Offline-to-Online Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 6.0

    An adaptive UCB-based policy selection and fine-tuning strategy improves performance over standard O2O-RL baselines under interaction budgets.

  44. On the Optimal Sample Complexity of Offline Multi-Armed Bandits with KL Regularization

    cs.LG 2026-05 unverdicted novelty 6.0

    Offline KL-regularized MABs require sample complexity scaling as O(η S A C^π*/ε) for large regularization and Ω(S A C^π*/ε²) for small regularization, with matching lower bounds across the full range.

  45. QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL

    cs.LG 2026-05 unverdicted novelty 6.0

    QHyer replaces return-to-go with a state-conditioned Q-estimator and adds a gated hybrid attention-mamba backbone to achieve state-of-the-art performance in offline goal-conditioned RL on both Markovian and non-Markov...

  46. QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL

    cs.LG 2026-05 unverdicted novelty 6.0

    QHyer achieves state-of-the-art results in offline goal-conditioned RL by replacing return-to-go with a state-conditioned Q-estimator and introducing a gated hybrid attention-mamba backbone for content-adaptive histor...

  47. CoFlow: Coordinated Few-Step Flow for Offline Multi-Agent Decision Making

    cs.AI 2026-05 unverdicted novelty 6.0

    CoFlow preserves inter-agent coordination in few-step offline MARL by using a natively joint velocity field with Coordinated Velocity Attention and Adaptive Coordination Gating, matching or exceeding baselines in 1-3 ...

  48. Model-Based Proactive Cost Generation for Learning Safe Policies Offline with Limited Violation Data

    cs.LG 2026-05 unverdicted novelty 6.0

    PROCO generates synthetic unsafe samples via model-based rollouts and LLM-grounded costs to enable safer policy learning from offline datasets containing few or no violations.

  49. TSN-Affinity: Similarity-Driven Parameter Reuse for Continual Offline Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 6.0

    TSN-Affinity enables continual offline RL via similarity-guided parameter reuse in sparse subnetworks, showing better retention than replay baselines on Atari and robotic arm tasks.

  50. SOLAR-RL: Semi-Online Long-horizon Assignment Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 6.0

    SOLAR-RL assigns dense step-level rewards from static trajectory data by detecting first failure points and applying target-aligned shaping to improve long-horizon GUI task completion without full online interactions.

  51. Beyond Bellman: High-Order Generator Regression for Continuous-Time Policy Evaluation

    stat.ML 2026-04 unverdicted novelty 6.0

    High-order moment-matching estimation of the time-dependent generator improves continuous-time policy evaluation accuracy over first-order Bellman recursion by canceling lower-order truncation terms, with supporting e...

  52. Distributional Off-Policy Evaluation with Deep Quantile Process Regression

    stat.ML 2026-04 unverdicted novelty 6.0

    DQPOPE estimates the entire return distribution in off-policy evaluation via deep quantile process regression, providing statistical advantages over standard single-value methods with equivalent sample sizes.

  53. Fisher Decorator: Refining Flow Policy via a Local Transport Map

    cs.LG 2026-04 unverdicted novelty 6.0

    Fisher Decorator refines flow policies in offline RL via a local transport map and Fisher-matrix quadratic approximation of the KL constraint, yielding controllable error near the optimum and SOTA benchmark results.

  54. Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

    cs.LG 2026-04 unverdicted novelty 6.0

    Lightning OPD is an offline on-policy distillation method that matches standard OPD performance at 4x efficiency by enforcing teacher consistency between SFT and distillation phases.

  55. Whole-Body Mobile Manipulation using Offline Reinforcement Learning on Sub-optimal Controllers

    cs.RO 2026-04 unverdicted novelty 6.0

    WHOLE-MoMa improves whole-body mobile manipulation by applying offline RL with Q-chunking to demonstrations from randomized sub-optimal controllers, outperforming baselines and transferring to real robots without tele...

  56. BankerToolBench: Evaluating AI Agents in End-to-End Investment Banking Workflows

    cs.AI 2026-04 unverdicted novelty 6.0

    BankerToolBench is a new open benchmark of end-to-end investment banking workflows developed with 502 bankers; even the best tested model (GPT-5.4) fails nearly half the expert rubric criteria and produces zero client...

  57. JD-BP: A Joint-Decision Generative Framework for Auto-Bidding and Pricing

    cs.GT 2026-04 unverdicted novelty 6.0

    JD-BP jointly generates bids and pricing corrections via generative models, memory-less return-to-go, trajectory augmentation, and energy-based DPO to improve auto-bidding performance despite prediction errors and latency.

  58. Cross-fitted Proximal Learning for Model-Based Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 6.0

    A K-fold cross-fitted proximal bridge estimator for reward-emission and observation-transition functions in confounded POMDPs, with an oracle-comparator error bound decomposed into nuisance and averaging terms.

  59. Offline RL for Adaptive Policy Retrieval in Prior Authorization

    cs.IR 2026-04 unverdicted novelty 6.0

    Offline RL policies trained on synthetic prior authorization data achieve 92% accuracy with up to 47% fewer retrieval steps than fixed top-K baselines.

  60. Adaptive Memory Crystallization for Autonomous AI Agent Learning in Dynamic Environments

    cs.LG 2026-04 unverdicted novelty 6.0

    AMC models memory consolidation via a Liquid-Glass-Crystal process governed by an SDE with proven convergence to a Beta distribution, yielding 34-43% better forward transfer and 67-80% less forgetting on standard cont...

Reference graph

Works this paper leans on

284 extracted references · 284 canonical work pages · cited by 72 Pith papers · 10 internal anchors

  1. [1]

    and Friedman, N

    Koller, D. and Friedman, N. , title =. 2009 , isbn =

  2. [2]

    2019 International Conference on Robotics and Automation (ICRA) , pages=

    Closing the sim-to-real loop: Adapting simulation randomization with real world experience , author=. 2019 International Conference on Robotics and Automation (ICRA) , pages=. 2019 , organization=

  3. [3]

    International journal of computer vision , volume=

    Imagenet large scale visual recognition challenge , author=. International journal of computer vision , volume=. 2015 , publisher=

  4. [4]

    Sim-to-Real: Learning Agile Locomotion For Quadruped Robots

    Sim-to-real: Learning agile locomotion for quadruped robots , author=. arXiv preprint arXiv:1804.10332 , year=

  5. [5]

    Sadeghi, Fereshteh and Levine, Sergey , booktitle=

  6. [6]

    High-Dimensional Continuous Control Using Generalized Advantage Estimation , author =

  7. [7]

    2019 , howpublished =

    Data-Driven Deep Reinforcement Learning , author =. 2019 , howpublished =

  8. [8]

    2020 , howpublished =

    Does On-Policy Data Collection Fix Errors in Reinforcement Learning? , author =. 2020 , howpublished =

  9. [9]

    The Journal of Machine Learning Research , volume=

    End-to-end training of deep visuomotor policies , author=. The Journal of Machine Learning Research , volume=. 2016 , publisher=

  10. [10]

    International Conference on Machine Learning , pages=

    Guided policy search , author=. International Conference on Machine Learning , pages=

  11. [11]

    Nature , volume=

    Mastering the game of go without human knowledge , author=. Nature , volume=. 2017 , publisher=

  12. [12]

    Advances in neural information processing systems , pages=

    A natural policy gradient , author=. Advances in neural information processing systems , pages=

  13. [13]

    Playing Atari with Deep Reinforcement Learning

    Playing atari with deep reinforcement learning , author=. arXiv preprint arXiv:1312.5602 , year=

  14. [14]

    Advances in neural information processing systems , pages=

    Policy gradient methods for reinforcement learning with function approximation , author=. Advances in neural information processing systems , pages=

  15. [15]

    Journal of Machine Learning Research , volume=

    Tree-based batch mode reinforcement learning , author=. Journal of Machine Learning Research , volume=

  16. [16]

    Machine learning , volume=

    Reinforcement learning in feedback control , author=. Machine learning , volume=. 2011 , publisher=

  17. [17]

    Neural fitted

    Riedmiller, Martin , booktitle=. Neural fitted. 2005 , organization=

  18. [18]

    Reinforcement learning , pages=

    Batch reinforcement learning , author=. Reinforcement learning , pages=. 2012 , publisher=

  19. [19]

    International Conference on Machine Learning (ICML) , year =

    Bias in Natural Actor-Critic Algorithms , author =. International Conference on Machine Learning (ICML) , year =

  20. [20]

    and Storkey, A

    Toussaint, M. and Storkey, A. , title =. International Conference on Machine Learning (ICML) , year =

  21. [21]

    Attias , title =

    H. Attias , title =. Proceedings of the 9th International Workshop on Artificial Intelligence and Statistics , year =

  22. [22]

    1994 , publisher=

    Tesauro, Gerald , journal=. 1994 , publisher=

  23. [23]

    European Workshop on Reinforcement Learning (EWRL) , year =

    Actor-Critic Reinforcement Learning with Energy-Based Policies , author =. European Workshop on Reinforcement Learning (EWRL) , year =

  24. [24]

    International Conference on Artificial Intelligence and Statistics (AISTATS) , pages=

    A reduction of imitation learning and structured prediction to no-regret online learning , author=. International Conference on Artificial Intelligence and Statistics (AISTATS) , pages=

  25. [25]

    International Conference on Artificial Intelligence and Statistics (AISTATS) , pages=

    Efficient reductions for imitation learning , author=. International Conference on Artificial Intelligence and Statistics (AISTATS) , pages=

  26. [26]

    International Conference on Machine Learning (ICML) , volume=

    Approximately optimal approximate reinforcement learning , author=. International Conference on Machine Learning (ICML) , volume=

  27. [27]

    Minka, T. P. , title =. Uncertainty in Artificial Intelligence (UAI) , year =

  28. [28]

    Maximum a Posteriori Policy Optimisation , author =

  29. [29]

    Williams, R. J. , title =. Machine Learning , issue_date =. 1992 , pages =

  30. [30]

    Williams, R. J. and Peng, J. , journal =

  31. [31]

    Sutton, R. S. and Barto, A. G. , title =. 1998 , isbn =

  32. [32]

    R. S. Sutton , title =. International Conference on Machine Learning (ICML) , year =

  33. [33]

    and Munos, R

    O'Donoghue, B. and Munos, R. and Kavukcuoglu, K. and Mnih, V. , year =. PGQ: Combining policy gradient and Q-learning , booktitle =

  34. [34]

    and Hinton, G

    Sallans, B. and Hinton, G. E. , title =. Journal of Machine Learning Research , volume =

  35. [35]

    L. P. Kaelbling and M. L. Littman and A. P. Moore. Reinforcement Learning: A Survey. Journal of Artificial Intelligence Research. 1996

  36. [36]

    Todorov , booktitle=

    E. Todorov , booktitle=. General duality between optimal control and estimation , year=

  37. [37]

    J. A. Bagnell and J. Schneider , title =. International Joint Conference on Artifical Intelligence (IJCAI) , year =

  38. [38]

    u lling, K. and Alt \

    Peters, J. and M \"u lling, K. and Alt \"u n, Y. Relative Entropy Policy Search. AAAI Conference on Artificial Intelligence (AAAI). 2010

  39. [39]

    Neural Information Processing Systems (NIPS) , year =

    Learning Neural Network Policies with Guided Policy Search under Unknown Dynamics , author =. Neural Information Processing Systems (NIPS) , year =

  40. [40]

    Todorov , title =

    E. Todorov , title =. Neural Information Processing Systems (NIPS) , year =

  41. [41]

    and Todorov, E

    Dvijotham, K. and Todorov, E. , title =. International Conference on International Conference on Machine Learning (ICML) , year =

  42. [42]

    Neural Information Processing Systems (NeurIPS) , year =

    Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction , author =. Neural Information Processing Systems (NeurIPS) , year =

  43. [43]

    arXiv preprint arXiv:1603.01312 , year=

    Learning physical intuition of block towers by example , author=. arXiv preprint arXiv:1603.01312 , year=

  44. [45]

    2017 IEEE International Conference on Robotics and Automation (ICRA) , pages=

    Deep visual foresight for planning robot motion , author=. 2017 IEEE International Conference on Robotics and Automation (ICRA) , pages=. 2017 , organization=

  45. [46]

    Advances in neural information processing systems , pages=

    Interaction networks for learning about objects, relations and physics , author=. Advances in neural information processing systems , pages=

  46. [47]

    International Conference on Learning Representations , year=

    Distributionally Robust Neural Networks , author=. International Conference on Learning Representations , year=

  47. [49]

    Certifying

    Certifying some distributional robustness with principled adversarial training , author=. arXiv preprint arXiv:1710.10571 , year=

  48. [50]

    Advances in neural information processing systems , pages=

    Semi-supervised learning with deep generative models , author=. Advances in neural information processing systems , pages=

  49. [51]

    Advances in neural information processing systems , pages=

    What uncertainties do we need in bayesian deep learning for computer vision? , author=. Advances in neural information processing systems , pages=

  50. [52]

    international conference on machine learning , pages=

    Dropout as a bayesian approximation: Representing model uncertainty in deep learning , author=. international conference on machine learning , pages=

  51. [53]

    arXiv preprint arXiv:1911.10500 , year=

    Causality for Machine Learning , author=. arXiv preprint arXiv:1911.10500 , year=

  52. [54]

    nature , volume=

    Deep learning , author=. nature , volume=. 2015 , publisher=

  53. [55]

    Deep reinforcement learning and the deadly triad

    Deep reinforcement learning and the deadly triad , author=. arXiv preprint arXiv:1812.02648 , year=

  54. [56]

    International Conference on Artificial Intelligence and Statistics (AISTATS 2010) , year =

    Learning Policy Improvements with Path Integrals , author =. International Conference on Artificial Intelligence and Statistics (AISTATS 2010) , year =

  55. [57]

    International Conference on Machine Learning (ICML) , year =

    Trust Region Policy Optimization , author =. International Conference on Machine Learning (ICML) , year =

  56. [58]

    , title =

    Levine, S. , title =

  57. [59]

    , title =

    Ziebart, B. , title =

  58. [60]

    H. J. Kappen , title =. Inference and Learning in Dynamic Models , year =

  59. [61]

    Advances in Neural Information Processing Systems , pages=

    Learning continuous control policies by stochastic value gradients , author=. Advances in Neural Information Processing Systems , pages=

  60. [62]

    Proceedings of the 34th International Conference on Machine Learning-Volume 70 , pages=

    Reinforcement learning with deep energy-based policies , author=. Proceedings of the 34th International Conference on Machine Learning-Volume 70 , pages=. 2017 , organization=

  61. [63]

    H. J. Kappen and V. G. Optimal control as a graphical model inference problem , journal =. 2012 , pages =

  62. [64]

    Rawlik and M

    K. Rawlik and M. Toussaint and S. Vijayakumar , title =. 2013 , booktitle =

  63. [65]

    Toussaint , title =

    M. Toussaint , title =. International Conference on Machine Learning (ICML) , year =

  64. [66]

    Uncertainty in Artificial Intelligence (UAI) , volume=

    Hierarchical POMDP Controller Optimization by Likelihood Maximization , author=. Uncertainty in Artificial Intelligence (UAI) , volume=

  65. [67]

    Kalman , title=

    R. Kalman , title=. ASME Transactions journal of basic engineering , volume=

  66. [68]

    Advances in neural information processing systems , pages=

    Actor-critic algorithms , author=. Advances in neural information processing systems , pages=

  67. [69]

    Machine learning , volume=

    Self-improving reactive agents based on reinforcement learning, planning and teaching , author=. Machine learning , volume=. 1992 , publisher=

  68. [70]

    Machine learning , volume=

    Q-learning , author=. Machine learning , volume=. 1992 , publisher=

  69. [71]

    Todorov , title =

    E. Todorov , title =. Advances in Neural Information Processing Systems (NIPS) , year =

  70. [72]

    and Schaal, S

    Peters, J. and Schaal, S. , title =. International Conference on Machine Learning (ICML) , year =

  71. [73]

    Neumann , title =

    G. Neumann , title =. International Conference on Machine Learning (ICML) , year =

  72. [74]

    Levine and V

    S. Levine and V. Koltun , title =. Advances in Neural Information Processing Systems (NIPS) , year =

  73. [75]

    European Conference on Machine Learning (ECML) , year =

    Efficient Sample Reuse in EM-Based Policy Search , author =. European Conference on Machine Learning (ECML) , year =

  74. [76]

    Journal of Machine Learning Research , volume=

    Variational message passing , author=. Journal of Machine Learning Research , volume=. 2005 , pages=

  75. [77]

    International Conference on Machine Learning (ICML) , year=

    Reinforcement Learning with Deep Energy-Based Policies , author=. International Conference on Machine Learning (ICML) , year=

  76. [78]

    2018 , booktitle =

    Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , author =. 2018 , booktitle =

  77. [79]

    2020 , booktitle =

    D4RL: Datasets for Deep Data-Driven Reinforcement Learning , author =. 2020 , booktitle =

  78. [80]

    2017 , booktitle =

    Bridging the Gap Between Value and Policy Based Reinforcement Learning , author =. 2017 , booktitle =

  79. [81]

    and Koltun, V

    Levine, S. and Koltun, V. , title =. International Conference on International Conference on Machine Learning (ICML) , year =

  80. [82]

    Ziebart, B. D. and Maas, A. and Bagnell, J. A. and Dey, A. K. , title =. International Conference on Artificial Intelligence (AAAI) , year =

Showing first 80 references.