Reinforcement Learning with Action Chunking

Qiyang Li; Sergey Levine; Zhiyuan Zhou

arxiv: 2507.07969 · v4 · submitted 2025-07-10 · 💻 cs.LG · cs.AI· cs.RO· stat.ML

Reinforcement Learning with Action Chunking

Qiyang Li , Zhiyuan Zhou , Sergey Levine This is my paper

Pith reviewed 2026-05-19 05:14 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.ROstat.ML

keywords reinforcement learningaction chunkingoffline-to-online RLQ-chunkingtemporal difference learninglong-horizon taskssparse rewardsexploration

0 comments

The pith

Running reinforcement learning in a chunked action space lets agents use consistent sequences from offline data for better exploration and more stable learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Q-chunking, a method that applies action chunking to temporal difference reinforcement learning in the offline-to-online setting. By running the algorithm directly in a space of action sequences rather than single steps, the agent can draw on temporally consistent behaviors present in an offline dataset to guide exploration. This setup also permits unbiased n-step backups that stabilize value estimates and speed up learning. The approach targets long-horizon tasks with sparse rewards, where standard methods often fail due to poor exploration. Experiments indicate stronger offline performance and higher online sample efficiency than prior offline-to-online techniques on manipulation tasks.

Core claim

Q-chunking adopts action chunking by directly running RL in a 'chunked' action space, enabling the agent to leverage temporally consistent behaviors from offline data for more effective online exploration and to use unbiased n-step backups for more stable and efficient TD learning.

What carries the argument

The chunked action space, in which the policy selects sequences of future actions rather than one action at each timestep.

If this is right

Q-chunking achieves strong performance on the offline dataset and high sample efficiency during the online phase.
The method outperforms prior best offline-to-online RL algorithms on long-horizon sparse-reward manipulation tasks.
Temporal difference learning becomes more stable and efficient through the use of unbiased n-step backups.
Online exploration improves because the agent can commit to temporally consistent action sequences drawn from the prior data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Chunking might still help when offline data is somewhat noisy if paired with simple filtering of inconsistent sequences.
The same idea could link imitation-learning techniques more directly to value-based online optimization without extra machinery.
Applying the approach to navigation or locomotion tasks would test whether the benefits extend past the manipulation domains evaluated here.

Load-bearing premise

The offline dataset contains temporally consistent action sequences that remain useful when the policy is optimized online inside the chunked action space.

What would settle it

Experiments on long-horizon sparse-reward tasks that use an offline dataset lacking consistent action chunks at the sequence level, showing no gains in online exploration or sample efficiency compared to standard non-chunked RL.

Figures

Figures reproduced from arXiv: 2507.07969 by Qiyang Li, Sergey Levine, Zhiyuan Zhou.

**Figure 1.** Figure 1: Q-chunking uses action chunking to enable fast value backups and effective exploration with temporally coherent actions. left: an overview of our approach: Q-chunking operates in a temporally extended action space that allows for (1) efficient value backups and (2) effective exploration via temporally coherent actions; right: Our method (QC) first pre-trains on an offline dataset for 1M steps (grey) and th… view at source ↗

**Figure 2.** Figure 2: Naïvely using action chunking for online RL with Gaussian policies leads to poor performance. (1) RLPD runs online RL on both offline data and online replay buffer [7]. (2) RLPD-AC is the same algorithm as RLPD but operates in a temporally extended action space (action chunk size of 5). (3) QC-RLPD additionally uses a behavior cloning loss on the actor (4 seeds). the temporally extended action space compar… view at source ↗

**Figure 3.** Figure 3: Aggregated performance per OGBench domain. Our method, QC, achieves strong performance across all five challenging OGBench domains. We also include an aggregation performance plot for all the domains at the bottom right. The first 1M steps are offline training and the next 1M steps are online training with one environment step per training step (4 seeds per task; 5 tasks per domain). Algorithm 1 QC Input: … view at source ↗

**Figure 4.** Figure 4: Robomimic results. QC achieves strong performance across all three robomimic tasks. The first 1M steps are offline and the next 1M steps are online with one environment step per training step (5 seeds). 0.0 0.5 1.0 1.5 2.0 Steps (×10 6 ) 0.0 0.2 0.4 0.6 0.8 1.0 Success Rate OGBench (25 tasks) QC QC-FQL BFN-n FQL-n BFN FQL 0.0 0.5 1.0 1.5 2.0 Steps (×10 6 ) 0.0 0.2 0.4 0.6 0.8 1.0 Success Rate robomimic (3 … view at source ↗

**Figure 5.** Figure 5: QC-FQL and n-step return on OGBench and robomimic. QC-FQL obtains a similar performance compared to QC. QC is slightly better than QC-FQL on OGBench offline and robomimic online, and slightly worse than QC-FQL on robomimic offline (4 seeds for OGBench, 5 seeds for robomimic). See Appendix D, [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Sensitivity analysis: action chunk size (h), critic ensemble size (K), and update-to-data ratio (UTD). Left: QC-FQL with different h on all 5 cube-triple tasks (5 seeds). QC-FQL with h = 1 is equivalent to FQL. Center: Increasing the ensemble size to K = 10 improves performance of both QC and BFN on cube-triple-task3 (5 seeds). Right: QC with UTD of 5 on cube-triple-task3 (5 seeds). We report only the onli… view at source ↗

**Figure 7.** Figure 7: End-effector movements early in the training and temporal coherency analysis on cube-triple-task3. Left: QC covers a more diverse set of states compared to BFN in the first 1000 environment steps. Right: QC exhibits a higher temporal coherency in end-effector compared to BFN (4 seeds). methods. In the online phase (in white), QC shows strong sample-efficiency, especially on the two hardest OGBench domains … view at source ↗

**Figure 8.** Figure 8: We experiment on several challenging long-horizon, sparse-reward domains. See detailed task description for each domain in Appendix A. The rendered images of the robomimic tasks above are taken from Mandlekar et al. [42]. • square: This task requires the robot arm to pick a square nut and place it on a rod. The nut is slightly bigger than the rod and requires the arm to move precisely to complete the task … view at source ↗

**Figure 9.** Figure 9: End-effector trajectory early in the training. Each subplot above shows the trajectory for a consecutive of 1000 time steps. We include up to Step 9000. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗

**Figure 10.** Figure 10: End-effector trajectory visualization late in the training. Each subplot above shows the trajectory for a consecutive of 1000 time steps. We include the trajectories from Step 900000 to Step 99000. D.2 OGBench results by individual task Main results by task. The following plot ( [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗

**Figure 11.** Figure 11: Full OGBench results by task. For each method on each task, we use 4 seeds. Ablation results by task. The following plot ( [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗

**Figure 12.** Figure 12: Full OGBench results by task. For each method on each task, we use 4 seeds. Q-chunking with Gausian policies. The following plot shows the performance breakdown for [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗

**Figure 13.** Figure 13: Full RLPD results by task. For each method on each task, we use 4 seeds. QC-RLPD is RLPD-AC (RLPD on the temporally extended action space) where we additionally add a fixed behavior cloning coefficient of 0.01. D.3 Robomimic ablation results [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗

**Figure 14.** Figure 14: Full robomimic ablation by task. For each method on each task, we use 5 seeds. D.4 How computationally efficient is Q-chunking? In [PITH_FULL_IMAGE:figures/full_fig_p025_14.png] view at source ↗

**Figure 15.** Figure 15: How long does each method take for one step in milliseconds. Left: offline. Right: online (one agent training step and an environment step). The runtime is measured using the default hyperparameters in our paper on cube-triple-task1 on a single RTX-A5000. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_15.png] view at source ↗

read the original abstract

We present Q-chunking, a simple yet effective recipe for improving reinforcement learning (RL) algorithms for long-horizon, sparse-reward tasks. Our recipe is designed for the offline-to-online RL setting, where the goal is to leverage an offline prior dataset to maximize the sample-efficiency of online learning. Effective exploration and sample-efficient learning remain central challenges in this setting, as it is not obvious how the offline data should be utilized to acquire a good exploratory policy. Our key insight is that action chunking, a technique popularized in imitation learning where sequences of future actions are predicted rather than a single action at each timestep, can be applied to temporal difference (TD)-based RL methods to mitigate the exploration challenge. Q-chunking adopts action chunking by directly running RL in a 'chunked' action space, enabling the agent to (1) leverage temporally consistent behaviors from offline data for more effective online exploration and (2) use unbiased $n$-step backups for more stable and efficient TD learning. Our experimental results demonstrate that Q-chunking exhibits strong offline performance and online sample efficiency, outperforming prior best offline-to-online methods on a range of long-horizon, sparse-reward manipulation tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Q-chunking runs TD learning inside a chunked action space to get better exploration and unbiased n-step returns from offline data, but the gains depend on the data already having consistent chunks.

read the letter

The paper's main move is to take action chunking, already common in imitation learning, and apply it directly to TD-based RL in the offline-to-online setting. Instead of predicting one action at a time, the agent works in a space where each decision is a sequence of future actions. This is supposed to let the policy borrow temporally consistent behaviors from the offline dataset for exploration and to run unbiased n-step backups without the usual bias from function approximation.

Referee Report

2 major / 2 minor

Summary. The paper introduces Q-chunking, a recipe for offline-to-online RL on long-horizon sparse-reward tasks. It runs TD-based RL directly inside a chunked action space so that the agent can (1) exploit temporally consistent action sequences present in the offline dataset for more effective online exploration and (2) perform unbiased n-step backups. Experiments are reported to show improved offline performance and online sample efficiency relative to prior offline-to-online baselines on manipulation tasks.

Significance. If the central claims hold, the work supplies a lightweight, algorithm-agnostic way to convert existing TD methods into more sample-efficient offline-to-online learners by borrowing the action-chunking idea from imitation learning. The absence of new hyperparameters and the direct applicability to standard TD updates are practical strengths that could affect how practitioners initialize exploration from offline data.

major comments (2)

The load-bearing assumption that offline trajectories contain reusable, temporally consistent chunks whose internal structure survives online policy optimization inside the coarser chunked action space is stated but not empirically tested. No chunk-level consistency metric, filtering step, or regularization term is introduced to enforce or recover this property when the original data policy varies within what becomes a single chunk.
The claim of unbiased n-step backups (Abstract and method description) requires a formal argument showing that the chunked transition and reward definitions preserve the unbiasedness of the multi-step estimator; without this derivation or an explicit equation relating the chunked Bellman operator to the original one, it is unclear whether the reported stability gain is a consequence of chunking or of other implementation choices.

minor comments (2)

Add error bars, number of seeds, and a clear statement of the full experimental protocol (including how chunks are formed from the offline dataset) so that the outperformance numbers can be reproduced and statistically evaluated.
Clarify the precise definition of the chunked action space and the corresponding state-transition and reward functions; a short pseudocode block or equation would remove ambiguity about how standard TD updates are applied inside the new space.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and describe the revisions we will incorporate to improve the manuscript.

read point-by-point responses

Referee: The load-bearing assumption that offline trajectories contain reusable, temporally consistent chunks whose internal structure survives online policy optimization inside the coarser chunked action space is stated but not empirically tested. No chunk-level consistency metric, filtering step, or regularization term is introduced to enforce or recover this property when the original data policy varies within what becomes a single chunk.

Authors: We agree that a direct empirical test of the chunk-consistency assumption would strengthen the presentation. Although the reported performance gains on long-horizon manipulation tasks are consistent with the assumption that temporally coherent sequences in the offline data remain useful under chunked online optimization, the manuscript does not contain an explicit consistency metric or analysis. In the revised version we will add a short empirical subsection that quantifies intra-chunk action variance both in the offline dataset and throughout online training, thereby providing concrete evidence that the low-variance structure is present and largely preserved. revision: yes
Referee: The claim of unbiased n-step backups (Abstract and method description) requires a formal argument showing that the chunked transition and reward definitions preserve the unbiasedness of the multi-step estimator; without this derivation or an explicit equation relating the chunked Bellman operator to the original one, it is unclear whether the reported stability gain is a consequence of chunking or of other implementation choices.

Authors: We appreciate the request for a formal justification. In the chunked formulation the agent executes a fixed sequence of actions over the chunk horizon; the accumulated reward is the sum of the per-step rewards and the successor state is the state reached after the entire chunk. This construction yields an n-step return that is unbiased for the value of the policy that repeats the chosen chunk, and the corresponding Bellman operator remains a contraction with the same fixed point as the original MDP. We will insert a concise derivation (including the explicit relation between the chunked and standard n-step targets) into the methods section of the revision to clarify this point. revision: yes

Circularity Check

0 steps flagged

No circularity: method is a direct recipe on existing TD algorithms

full rationale

The paper presents Q-chunking as applying the existing concept of action chunking directly to the action space of standard TD-based RL algorithms in the offline-to-online setting. The two claimed benefits (leveraging consistent behaviors from offline data and unbiased n-step backups) follow immediately from the definition of operating in a chunked action space; they are not derived via any fitted parameter, self-referential equation, or load-bearing self-citation that reduces the result to its own inputs. No equations or steps in the provided abstract or description collapse the claimed improvements back onto data used for evaluation. The derivation remains self-contained against external RL benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on standard RL assumptions plus the domain assumption that offline data contains temporally consistent chunks; no new free parameters or invented entities are introduced in the abstract description.

axioms (2)

domain assumption Offline dataset contains temporally consistent action sequences usable for exploration
Invoked in the key insight paragraph to justify effective online exploration.
domain assumption n-step backups remain unbiased when performed over action chunks
Stated as part of the benefit of the chunked space.

pith-pipeline@v0.9.0 · 5747 in / 1283 out tokens · 33123 ms · 2026-05-19T05:14:27.284241+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Q-chunking adopts action chunking by directly running RL in a 'chunked' action space, enabling the agent to (1) leverage temporally consistent behaviors from offline data for more effective online exploration and (2) use unbiased n-step backups for more stable and efficient TD learning.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 12 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Aligning Flow Map Policies with Optimal Q-Guidance
cs.LG 2026-05 unverdicted novelty 7.0

Flow map policies enable fast one-step inference for flow-based RL policies, and FMQ provides an optimal closed-form Q-guided target for offline-to-online adaptation under trust-region constraints, achieving SOTA performance.
Revisiting Policy Gradients for Restricted Policy Classes: Escaping Myopic Local Optima with $k$-step Policy Gradients
cs.LG 2026-05 unverdicted novelty 6.0

The k-step policy gradient converges exponentially close to the optimal deterministic policy in restricted classes, achieving O(1/T) rates under smoothness assumptions without distribution mismatch factors.
Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities
cs.AI 2026-05 unverdicted novelty 6.0

LQL stabilizes Q-learning by penalizing violations of n-step action-sequence lower bounds with a hinge loss computed from standard network outputs.
Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities
cs.AI 2026-05 unverdicted novelty 6.0

LQL turns n-step action-sequence lower bounds into a practical hinge-loss stabilizer for off-policy Q-learning without extra networks or forward passes.
OGPO: Sample Efficient Full-Finetuning of Generative Control Policies
cs.LG 2026-05 unverdicted novelty 6.0

OGPO is a sample-efficient off-policy method for full finetuning of generative control policies that reaches SOTA on robotic manipulation tasks and can recover from poor behavior-cloning initializations without expert data.
GSDrive: Reinforcing Driving Policies by Multi-mode Future Trajectory Probing with 3D Gaussian Splatting Environment
cs.RO 2026-04 unverdicted novelty 6.0

GSDrive combines IL priors with RL feedback by probing multi-mode futures inside a 3D Gaussian Splatting simulator to supply dense rewards for closed-loop driving policy improvement on nuScenes.
GSDrive: Reinforcing Driving Policies by Multi-mode Future Trajectory Probing with 3D Gaussian Splatting Environment
cs.RO 2026-04 unverdicted novelty 6.0

GSDrive improves end-to-end driving policies through 3D Gaussian Splatting simulation and multi-mode trajectory probing that supplies dense, differentiable rewards for reinforcement learning.
Empowering Multi-Robot Cooperation via Sequential World Models
cs.RO 2025-09 unverdicted novelty 6.0

SeqWM introduces sequential autoregressive agent-wise world models for multi-robot MBRL, outperforming baselines in performance and sample efficiency on Bi-DexHands and Multi-Quadruped tasks with physical robot deployment.
COOPO: Cyclic Offline-Online Policy Optimization Algorithm
cs.LG 2026-05 unverdicted novelty 5.0

COOPO is a cyclic offline-online RL algorithm that repeatedly anchors the policy to a dataset via KL-regularized updates then fine-tunes online, claiming better sample efficiency and monotonic improvement under covera...
DyGRO-VLA: Cross-Task Scaling of Vision-Language-Action Models via Dynamic Grouped Residual Optimization
cs.RO 2026-05 unverdicted novelty 5.0

DyGRO-VLA is a two-stage optimization framework for cross-task scaling of Vision-Language-Action models via dynamic grouped residual optimization in RL.
ProcVLM: Learning Procedure-Grounded Progress Rewards for Robotic Manipulation
cs.RO 2026-05 unverdicted novelty 5.0

ProcVLM learns procedure-grounded dense progress rewards for robotic manipulation via a reasoning-before-estimation VLM trained on a 60M-frame synthesized corpus from 30 embodied datasets.
RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework
cs.CV 2026-04 unverdicted novelty 5.0

RAD-2 uses a diffusion generator and RL discriminator to cut collision rates by 56% in closed-loop autonomous driving planning.

Reference graph

Works this paper leans on

93 extracted references · 93 canonical work pages · cited by 10 Pith papers · 10 internal anchors

[1]

Reincarnating reinforcement learning: Reusing prior computation to accelerate progress

Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro, Aaron C Courville, and Marc Belle- mare. Reincarnating reinforcement learning: Reusing prior computation to accelerate progress. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 28955–28971. Curran Associat...

work page 2022
[2]

Reincarnating reinforcement learning: Reusing prior computation to accelerate progress

Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro, Aaron C Courville, and Marc Belle- mare. Reincarnating reinforcement learning: Reusing prior computation to accelerate progress. Advances in neural information processing systems, 35:28955–28971, 2022

work page 2022
[3]

OPAL: Offline primitive discovery for accelerating offline reinforcement learning

Anurag Ajay, Aviral Kumar, Pulkit Agrawal, Sergey Levine, and Ofir Nachum. OPAL: Offline primitive discovery for accelerating offline reinforcement learning. In International Confer- ence on Learning Representations, 2021. URL https://openreview.net/forum?id= V69LGwJ0lIN

work page 2021
[4]

The option-critic architecture

Pierre-Luc Bacon, Jean Harb, and Doina Precup. The option-critic architecture. In Proceedings of the AAAI conference on artificial intelligence, volume 31, 2017

work page 2017
[5]

Option discovery using deep skill chaining

Akhil Bagaria and George Konidaris. Option discovery using deep skill chaining. In Interna- tional Conference on Learning Representations, 2019. 11

work page 2019
[6]

Effectively learning initiation sets in hierarchical reinforcement learning

Akhil Bagaria, Ben Abbatematteo, Omer Gottesman, Matt Corsaro, Sreehari Rammohan, and George Konidaris. Effectively learning initiation sets in hierarchical reinforcement learning. Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[7]

Efficient online reinforcement learning with offline data

Philip J Ball, Laura Smith, Ilya Kostrikov, and Sergey Levine. Efficient online reinforcement learning with offline data. In International Conference on Machine Learning, pages 1577–1594. PMLR, 2023

work page 2023
[8]

Roboagent: Generalization and efficiency in robot manipulation via semantic augmentations and action chunking

Homanga Bharadhwaj, Jay Vakil, Mohit Sharma, Abhinav Gupta, Shubham Tulsiani, and Vikash Kumar. Roboagent: Generalization and efficiency in robot manipulation via semantic augmentations and action chunking. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 4788–4795. IEEE, 2024

work page 2024
[9]

Self- supervised reinforcement learning that transfers using random features

Boyuan Chen, Chuning Zhu, Pulkit Agrawal, Kaiqing Zhang, and Abhishek Gupta. Self- supervised reinforcement learning that transfers using random features. Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[10]

Intrinsically motivated reinforcement learning

Nuttapong Chentanez, Andrew Barto, and Satinder Singh. Intrinsically motivated reinforcement learning. Advances in neural information processing systems, 17, 2004

work page 2004
[11]

Diffusion policy: Visuomotor policy learning via action diffusion

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research, page 02783649241273668, 2023

work page 2023
[12]

Accelerating robotic reinforcement learning via parameterized action primitives

Murtaza Dalal, Deepak Pathak, and Russ R Salakhutdinov. Accelerating robotic reinforcement learning via parameterized action primitives. Advances in Neural Information Processing Systems, 34:21847–21859, 2021

work page 2021
[13]

Hierarchical relative entropy policy search

Christian Daniel, Gerhard Neumann, Oliver Kroemer, and Jan Peters. Hierarchical relative entropy policy search. Journal of Machine Learning Research, 17(93):1–50, 2016

work page 2016
[14]

Feudal reinforcement learning

Peter Dayan and Geoffrey E Hinton. Feudal reinforcement learning. Advances in neural information processing systems, 5, 1992

work page 1992
[15]

Learning transferable sub-goals by hypothesizing generalizing features

Anita de Mello Koch, Akhil Bagaria, Bingnan Huo, Zhiyuan Zhou, Cameron Allen, and George Konidaris. Learning transferable sub-goals by hypothesizing generalizing features. 2025

work page 2025
[16]

Hierarchical reinforcement learning with the maxq value function decomposition

Thomas G Dietterich. Hierarchical reinforcement learning with the maxq value function decomposition. Journal of artificial intelligence research, 13:227–303, 2000

work page 2000
[17]

Revisiting fundamentals of experience replay

William Fedus, Prajit Ramachandran, Rishabh Agarwal, Yoshua Bengio, Hugo Larochelle, Mark Rowland, and Will Dabney. Revisiting fundamentals of experience replay. InInternational conference on machine learning, pages 3061–3071. PMLR, 2020

work page 2020
[18]

Multi-Level Discovery of Deep Options

Roy Fox, Sanjay Krishnan, Ion Stoica, and Ken Goldberg. Multi-level discovery of deep options. arXiv preprint arXiv:1703.08294, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[19]

Unsupervised zero-shot rein- forcement learning via functional reward encodings

Kevin Frans, Seohong Park, Pieter Abbeel, and Sergey Levine. Unsupervised zero-shot rein- forcement learning via functional reward encodings. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, edi- tors, Proceedings of the 41st International Conference on Machine Learning, volume 2...

work page 2024
[20]

D4RL: Datasets for Deep Data-Driven Reinforcement Learning

Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4RL: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2004
[21]

Hierarchical skills for efficient exploration

Jonas Gehring, Gabriel Synnaeve, Andreas Krause, and Nicolas Usunier. Hierarchical skills for efficient exploration. Advances in Neural Information Processing Systems, 34:11553–11564, 2021

work page 2021
[22]

One act play: Single demonstration behavior cloning with action chunking transformers

Abraham George and Amir Barati Farimani. One act play: Single demonstration behavior cloning with action chunking transformers. arXiv preprint arXiv:2309.10175, 2023. 12

work page arXiv 2023
[23]

Emaq: Expected-max q-learning operator for simple yet effective offline and online rl

Seyed Kamyar Seyed Ghasemipour, Dale Schuurmans, and Shixiang Shane Gu. Emaq: Expected-max q-learning operator for simple yet effective offline and online rl. In International Conference on Machine Learning, pages 3682–3691. PMLR, 2021

work page 2021
[24]

Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pages 1861–1870. PMLR, 2018

work page 2018
[25]

Rainbow: Combining improve- ments in deep reinforcement learning

Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining improve- ments in deep reinforcement learning. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018

work page 2018
[26]

Kl divergence of max-of-n, 2023

Jacob Hilton. Kl divergence of max-of-n, 2023. URL https://www.jacobh.co.uk/ bon_kl.pdf

work page 2023
[27]

Unsupervised behavior extraction via random intent priors

Hao Hu, Yiqin Yang, Jianing Ye, Ziqing Mai, and Chongjie Zhang. Unsupervised behavior extraction via random intent priors. In Thirty-seventh Conference on Neural Information Pro- cessing Systems, 2023. URL https://openreview.net/forum?id=4vGVQVz5KG

work page 2023
[28]

Recurrent experience replay in distributed reinforcement learning

Steven Kapturowski, Georg Ostrovski, John Quan, Remi Munos, and Will Dabney. Recurrent experience replay in distributed reinforcement learning. In International conference on learning representations, 2018

work page 2018
[29]

Variational temporal abstraction

Taesup Kim, Sungjin Ahn, and Yoshua Bengio. Variational temporal abstraction. Advances in Neural Information Processing Systems, 32, 2019

work page 2019
[30]

Policy search for motor primitives in robotics

Jens Kober and Jan Peters. Policy search for motor primitives in robotics. Advances in neural information processing systems, 21, 2008

work page 2008
[31]

Autonomous robot skill acquisition

George Dimitri Konidaris. Autonomous robot skill acquisition. University of Massachusetts Amherst, 2011

work page 2011
[32]

Offline Reinforcement Learning with Implicit Q-Learning

Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit Q-learning. arXiv preprint arXiv:2110.06169, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[33]

Revisiting peng’s q (λ) for modern reinforcement learning

Tadashi Kozuno, Yunhao Tang, Mark Rowland, Rémi Munos, Steven Kapturowski, Will Dabney, Michal Valko, and David Abel. Revisiting peng’s q (λ) for modern reinforcement learning. In International Conference on Machine Learning, pages 5794–5804. PMLR, 2021

work page 2021
[34]

Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation

Tejas D Kulkarni, Karthik Narasimhan, Ardavan Saeedi, and Josh Tenenbaum. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. Advances in neural information processing systems, 29, 2016

work page 2016
[35]

Conservative Q-learning for offline reinforcement learning

Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative Q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 33: 1179–1191, 2020

work page 2020
[36]

Offline-to-online reinforcement learning via balanced replay and pessimistic Q-ensemble

Seunghyun Lee, Younggyo Seo, Kimin Lee, Pieter Abbeel, and Jinwoo Shin. Offline-to-online reinforcement learning via balanced replay and pessimistic Q-ensemble. In Conference on Robot Learning, pages 1702–1712. PMLR, 2022

work page 2022
[37]

TOP-ERL: Transformer-based off-policy episodic reinforcement learning

Ge Li, Dong Tian, Hongyi Zhou, Xinkai Jiang, Rudolf Lioutikov, and Gerhard Neumann. TOP-ERL: Transformer-based off-policy episodic reinforcement learning. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview. net/forum?id=N4NhVN30ph

work page 2025
[38]

Accelerating ex- ploration with unlabeled prior data

Qiyang Li, Jason Zhang, Dibya Ghosh, Amy Zhang, and Sergey Levine. Accelerating ex- ploration with unlabeled prior data. Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[39]

Continuous control with deep reinforcement learning

Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015. 13

work page internal anchor Pith review Pith/arXiv arXiv 2015
[40]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[41]

Finetuning from offline reinforcement learning: Challenges, trade-offs and practical solutions

Yicheng Luo, Jackie Kay, Edward Grefenstette, and Marc Peter Deisenroth. Finetuning from offline reinforcement learning: Challenges, trade-offs and practical solutions. arXiv preprint arXiv:2303.17396, 2023

work page arXiv 2023
[42]

What Matters in Learning from Offline Human Demonstrations for Robot Manipulation

Ajay Mandlekar, Danfei Xu, Josiah Wong, Soroush Nasiriany, Chen Wang, Rohun Kulkarni, Li Fei-Fei, Silvio Savarese, Yuke Zhu, and Roberto Martín-Martín. What matters in learning from offline human demonstrations for robot manipulation. In arXiv preprint arXiv:2108.03298, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[43]

Dynamic abstraction in reinforcement learning via clustering

Shie Mannor, Ishai Menache, Amit Hoze, and Uri Klein. Dynamic abstraction in reinforcement learning via clustering. In Proceedings of the twenty-first international conference on Machine learning, page 71, 2004

work page 2004
[44]

Q-cut—dynamic discovery of sub-goals in reinforcement learning

Ishai Menache, Shie Mannor, and Nahum Shimkin. Q-cut—dynamic discovery of sub-goals in reinforcement learning. In Machine Learning: ECML 2002: 13th European Conference on Machine Learning Helsinki, Finland, August 19–23, 2002 Proceedings 13, pages 295–306. Springer, 2002

work page 2002
[45]

Neural probabilistic motor primitives for humanoid control

Josh Merel, Leonard Hasenclever, Alexandre Galashov, Arun Ahuja, Vu Pham, Greg Wayne, Yee Whye Teh, and Nicolas Heess. Neural probabilistic motor primitives for humanoid control. arXiv preprint arXiv:1811.11711, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[46]

Asynchronous methods for deep reinforce- ment learning

V olodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforce- ment learning. In International conference on machine learning, pages 1928–1937. PmLR, 2016

work page 1928
[47]

Data-efficient hierarchical reinforcement learning

Ofir Nachum, Shixiang Shane Gu, Honglak Lee, and Sergey Levine. Data-efficient hierarchical reinforcement learning. Advances in neural information processing systems, 31, 2018

work page 2018
[48]

AWAC: Accelerating Online Reinforcement Learning with Offline Datasets

Ashvin Nair, Abhishek Gupta, Murtaza Dalal, and Sergey Levine. Awac: Accelerating online reinforcement learning with offline datasets. arXiv preprint arXiv:2006.09359, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2006
[49]

Cal-QL: Calibrated offline RL pre-training for efficient online fine-tuning

Mitsuhiko Nakamoto, Simon Zhai, Anikait Singh, Max Sobol Mark, Yi Ma, Chelsea Finn, Aviral Kumar, and Sergey Levine. Cal-QL: Calibrated offline RL pre-training for efficient online fine-tuning. Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[50]

Learning and retrieval from prior data for skill-based imitation learning

Soroush Nasiriany, Tian Gao, Ajay Mandlekar, and Yuke Zhu. Learning and retrieval from prior data for skill-based imitation learning. In Conference on Robot Learning, 2022

work page 2022
[51]

Value prediction network

Junhyuk Oh, Satinder Singh, and Honglak Lee. Value prediction network. Advances in neural information processing systems, 30, 2017

work page 2017
[52]

Probabilistic movement primitives

Alexandros Paraschos, Christian Daniel, Jan R Peters, and Gerhard Neumann. Probabilistic movement primitives. Advances in neural information processing systems, 26, 2013

work page 2013
[53]

Ogbench: Benchmarking offline goal-conditioned rl

Seohong Park, Kevin Frans, Benjamin Eysenbach, and Sergey Levine. Ogbench: Benchmarking offline goal-conditioned rl. ArXiv, 2024

work page 2024
[54]

OGBench: Bench- marking Offline Goal-Conditioned RL, February 2025

Seohong Park, Kevin Frans, Benjamin Eysenbach, and Sergey Levine. Ogbench: Benchmarking offline goal-conditioned rl. arXiv preprint arXiv:2410.20092, 2024

work page arXiv 2024
[55]

Foundation policies with hilbert rep- resentations

Seohong Park, Tobias Kreiman, and Sergey Levine. Foundation policies with hilbert rep- resentations. In Forty-first International Conference on Machine Learning , 2024. URL https://openreview.net/forum?id=LhNsSaAKub

work page 2024
[56]

Flow Q-learning

Seohong Park, Qiyang Li, and Sergey Levine. Flow Q-learning. arXiv preprint arXiv:2502.02538, 2025

work page arXiv 2025
[57]

Deeploco: Dynamic locomotion skills using hierarchical deep reinforcement learning

Xue Bin Peng, Glen Berseth, KangKang Yin, and Michiel Van De Panne. Deeploco: Dynamic locomotion skills using hierarchical deep reinforcement learning. Acm transactions on graphics (tog), 36(4):1–13, 2017. 14

work page 2017
[58]

Accelerating reinforcement learning with learned skill priors

Karl Pertsch, Youngwoon Lee, and Joseph Lim. Accelerating reinforcement learning with learned skill priors. In Conference on robot learning, pages 188–204. PMLR, 2021

work page 2021
[59]

Diffusion Policy Policy Optimization

Allen Z Ren, Justin Lidard, Lars L Ankile, Anthony Simeonov, Pulkit Agrawal, Anirudha Majumdar, Benjamin Burchfiel, Hongkai Dai, and Max Simchowitz. Diffusion policy policy optimization. arXiv preprint arXiv:2409.00588, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[60]

Learning by playing solving sparse reward tasks from scratch

Martin Riedmiller, Roland Hafner, Thomas Lampe, Michael Neunert, Jonas Degrave, Tom Wiele, Vlad Mnih, Nicolas Heess, and Jost Tobias Springenberg. Learning by playing solving sparse reward tasks from scratch. In International conference on machine learning , pages 4344–4353. PMLR, 2018

work page 2018
[61]

Dynamic movement primitives-a framework for motor control in humans and humanoid robotics

Stefan Schaal. Dynamic movement primitives-a framework for motor control in humans and humanoid robotics. In Adaptive motion of animals and machines, pages 261–280. Springer, 2006

work page 2006
[62]

Mastering atari, go, chess and shogi by planning with a learned model

Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Si- mon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, et al. Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588(7839):604–609, 2020

work page 2020
[63]

Reinforcement learning with action sequence for data-efficient robot learning

Younggyo Seo and Pieter Abbeel. Reinforcement learning with action sequence for data-efficient robot learning. arXiv preprint arXiv:2411.12155, 2024

work page arXiv 2024
[64]

Continuous control with coarse-to-fine re- inforcement learning

Younggyo Seo, Jafar Uruç, and Stephen James. Continuous control with coarse-to-fine re- inforcement learning. In 8th Annual Conference on Robot Learning , 2024. URL https: //openreview.net/forum?id=WjDR48cL3O

work page 2024
[65]

Learning robot skills with temporal variational inference

Tanmay Shankar and Abhinav Gupta. Learning robot skills with temporal variational inference. In International Conference on Machine Learning, pages 8624–8633. PMLR, 2020

work page 2020
[66]

Using relative novelty to identify useful temporal abstrac- tions in reinforcement learning

Özgür ¸ Sim¸ sek and Andrew G Barto. Using relative novelty to identify useful temporal abstrac- tions in reinforcement learning. In Proceedings of the twenty-first international conference on Machine learning, page 95, 2004

work page 2004
[67]

Özgür ¸ Sim¸ sek and Andrew G. Barto. Betweenness centrality as a basis for forming skills. Workingpaper, University of Massachusetts Amherst, April 2007

work page 2007
[68]

Parrot: Data-driven behavioral priors for reinforcement learning

Avi Singh, Huihan Liu, Gaoyue Zhou, Albert Yu, Nicholas Rhinehart, and Sergey Levine. Parrot: Data-driven behavioral priors for reinforcement learning. In International Confer- ence on Learning Representations , 2021. URL https://openreview.net/forum? id=Ysuv-WOFeKR

work page 2021
[69]

Hybrid RL: Using both offline and online data can make RL efficient

Yuda Song, Yifei Zhou, Ayush Sekhari, Drew Bagnell, Akshay Krishnamurthy, and Wen Sun. Hybrid RL: Using both offline and online data can make RL efficient. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview. net/forum?id=yyBis80iUuU

work page 2023
[70]

Option discovery in hierarchical reinforcement learning using spatio-temporal clustering

Aravind Srinivas, Ramnandan Krishnamurthy, Peeyush Kumar, and Balaraman Ravindran. Option discovery in hierarchical reinforcement learning using spatio-temporal clustering. arXiv preprint arXiv:1605.05359, 2016

work page arXiv 2016
[71]

Learning to summarize with human feedback

Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human feedback. Advances in neural information processing systems, 33:3008–3021, 2020

work page 2020
[72]

Reinforcement learning: An introduction, volume 1

Richard S Sutton, Andrew G Barto, et al. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998

work page 1998
[73]

Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning

Richard S Sutton, Doina Precup, and Satinder Singh. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(1-2): 181–211, 1999

work page 1999
[74]

Revisiting the minimalist approach to offline reinforcement learning

Denis Tarasov, Vladislav Kurenkov, Alexander Nikulin, and Sergey Kolesnikov. Revisiting the minimalist approach to offline reinforcement learning. Advances in Neural Information Processing Systems, 36, 2024. 15

work page 2024
[75]

Chunking the critic: A transformer-based soft actor-critic with n-step returns

Dong Tian, Ge Li, Hongyi Zhou, Onur Celik, and Gerhard Neumann. Chunking the critic: A transformer-based soft actor-critic with n-step returns. arXiv preprint arXiv:2503.03660, 2025

work page arXiv 2025
[76]

Does zero-shot reinforcement learning exist? In The Eleventh International Conference on Learning Representations, 2022

Ahmed Touati, Jérémy Rapin, and Yann Ollivier. Does zero-shot reinforcement learning exist? In The Eleventh International Conference on Learning Representations, 2022

work page 2022
[77]

Strategic attentive writer for learning macro-actions

Alexander Vezhnevets, V olodymyr Mnih, Simon Osindero, Alex Graves, Oriol Vinyals, John Agapiou, et al. Strategic attentive writer for learning macro-actions. Advances in neural information processing systems, 29, 2016

work page 2016
[78]

Feudal networks for hierarchical reinforcement learning

Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, and Koray Kavukcuoglu. Feudal networks for hierarchical reinforcement learning. In International conference on machine learning, pages 3540–3549. PMLR, 2017

work page 2017
[79]

Train once, get a family: State-adaptive balances for offline-to-online reinforcement learning

Shenzhi Wang, Qisen Yang, Jiawei Gao, Matthieu Lin, Hao Chen, Liwei Wu, Ning Jia, Shiji Song, and Gao Huang. Train once, get a family: State-adaptive balances for offline-to-online reinforcement learning. Advances in Neural Information Processing Systems, 36:47081–47104, 2023

work page 2023
[80]

Learning from delayed rewards

Christopher John Cornish Hellaby Watkins et al. Learning from delayed rewards. 1989

work page 1989

Showing first 80 references.

[1] [1]

Reincarnating reinforcement learning: Reusing prior computation to accelerate progress

Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro, Aaron C Courville, and Marc Belle- mare. Reincarnating reinforcement learning: Reusing prior computation to accelerate progress. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 28955–28971. Curran Associat...

work page 2022

[2] [2]

Reincarnating reinforcement learning: Reusing prior computation to accelerate progress

Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro, Aaron C Courville, and Marc Belle- mare. Reincarnating reinforcement learning: Reusing prior computation to accelerate progress. Advances in neural information processing systems, 35:28955–28971, 2022

work page 2022

[3] [3]

OPAL: Offline primitive discovery for accelerating offline reinforcement learning

Anurag Ajay, Aviral Kumar, Pulkit Agrawal, Sergey Levine, and Ofir Nachum. OPAL: Offline primitive discovery for accelerating offline reinforcement learning. In International Confer- ence on Learning Representations, 2021. URL https://openreview.net/forum?id= V69LGwJ0lIN

work page 2021

[4] [4]

The option-critic architecture

Pierre-Luc Bacon, Jean Harb, and Doina Precup. The option-critic architecture. In Proceedings of the AAAI conference on artificial intelligence, volume 31, 2017

work page 2017

[5] [5]

Option discovery using deep skill chaining

Akhil Bagaria and George Konidaris. Option discovery using deep skill chaining. In Interna- tional Conference on Learning Representations, 2019. 11

work page 2019

[6] [6]

Effectively learning initiation sets in hierarchical reinforcement learning

Akhil Bagaria, Ben Abbatematteo, Omer Gottesman, Matt Corsaro, Sreehari Rammohan, and George Konidaris. Effectively learning initiation sets in hierarchical reinforcement learning. Advances in Neural Information Processing Systems, 36, 2024

work page 2024

[7] [7]

Efficient online reinforcement learning with offline data

Philip J Ball, Laura Smith, Ilya Kostrikov, and Sergey Levine. Efficient online reinforcement learning with offline data. In International Conference on Machine Learning, pages 1577–1594. PMLR, 2023

work page 2023

[8] [8]

Roboagent: Generalization and efficiency in robot manipulation via semantic augmentations and action chunking

Homanga Bharadhwaj, Jay Vakil, Mohit Sharma, Abhinav Gupta, Shubham Tulsiani, and Vikash Kumar. Roboagent: Generalization and efficiency in robot manipulation via semantic augmentations and action chunking. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 4788–4795. IEEE, 2024

work page 2024

[9] [9]

Self- supervised reinforcement learning that transfers using random features

Boyuan Chen, Chuning Zhu, Pulkit Agrawal, Kaiqing Zhang, and Abhishek Gupta. Self- supervised reinforcement learning that transfers using random features. Advances in Neural Information Processing Systems, 36, 2024

work page 2024

[10] [10]

Intrinsically motivated reinforcement learning

Nuttapong Chentanez, Andrew Barto, and Satinder Singh. Intrinsically motivated reinforcement learning. Advances in neural information processing systems, 17, 2004

work page 2004

[11] [11]

Diffusion policy: Visuomotor policy learning via action diffusion

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research, page 02783649241273668, 2023

work page 2023

[12] [12]

Accelerating robotic reinforcement learning via parameterized action primitives

Murtaza Dalal, Deepak Pathak, and Russ R Salakhutdinov. Accelerating robotic reinforcement learning via parameterized action primitives. Advances in Neural Information Processing Systems, 34:21847–21859, 2021

work page 2021

[13] [13]

Hierarchical relative entropy policy search

Christian Daniel, Gerhard Neumann, Oliver Kroemer, and Jan Peters. Hierarchical relative entropy policy search. Journal of Machine Learning Research, 17(93):1–50, 2016

work page 2016

[14] [14]

Feudal reinforcement learning

Peter Dayan and Geoffrey E Hinton. Feudal reinforcement learning. Advances in neural information processing systems, 5, 1992

work page 1992

[15] [15]

Learning transferable sub-goals by hypothesizing generalizing features

Anita de Mello Koch, Akhil Bagaria, Bingnan Huo, Zhiyuan Zhou, Cameron Allen, and George Konidaris. Learning transferable sub-goals by hypothesizing generalizing features. 2025

work page 2025

[16] [16]

Hierarchical reinforcement learning with the maxq value function decomposition

Thomas G Dietterich. Hierarchical reinforcement learning with the maxq value function decomposition. Journal of artificial intelligence research, 13:227–303, 2000

work page 2000

[17] [17]

Revisiting fundamentals of experience replay

William Fedus, Prajit Ramachandran, Rishabh Agarwal, Yoshua Bengio, Hugo Larochelle, Mark Rowland, and Will Dabney. Revisiting fundamentals of experience replay. InInternational conference on machine learning, pages 3061–3071. PMLR, 2020

work page 2020

[18] [18]

Multi-Level Discovery of Deep Options

Roy Fox, Sanjay Krishnan, Ion Stoica, and Ken Goldberg. Multi-level discovery of deep options. arXiv preprint arXiv:1703.08294, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[19] [19]

Unsupervised zero-shot rein- forcement learning via functional reward encodings

Kevin Frans, Seohong Park, Pieter Abbeel, and Sergey Levine. Unsupervised zero-shot rein- forcement learning via functional reward encodings. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, edi- tors, Proceedings of the 41st International Conference on Machine Learning, volume 2...

work page 2024

[20] [20]

D4RL: Datasets for Deep Data-Driven Reinforcement Learning

Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4RL: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2004

[21] [21]

Hierarchical skills for efficient exploration

Jonas Gehring, Gabriel Synnaeve, Andreas Krause, and Nicolas Usunier. Hierarchical skills for efficient exploration. Advances in Neural Information Processing Systems, 34:11553–11564, 2021

work page 2021

[22] [22]

One act play: Single demonstration behavior cloning with action chunking transformers

Abraham George and Amir Barati Farimani. One act play: Single demonstration behavior cloning with action chunking transformers. arXiv preprint arXiv:2309.10175, 2023. 12

work page arXiv 2023

[23] [23]

Emaq: Expected-max q-learning operator for simple yet effective offline and online rl

Seyed Kamyar Seyed Ghasemipour, Dale Schuurmans, and Shixiang Shane Gu. Emaq: Expected-max q-learning operator for simple yet effective offline and online rl. In International Conference on Machine Learning, pages 3682–3691. PMLR, 2021

work page 2021

[24] [24]

Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pages 1861–1870. PMLR, 2018

work page 2018

[25] [25]

Rainbow: Combining improve- ments in deep reinforcement learning

Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining improve- ments in deep reinforcement learning. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018

work page 2018

[26] [26]

Kl divergence of max-of-n, 2023

Jacob Hilton. Kl divergence of max-of-n, 2023. URL https://www.jacobh.co.uk/ bon_kl.pdf

work page 2023

[27] [27]

Unsupervised behavior extraction via random intent priors

Hao Hu, Yiqin Yang, Jianing Ye, Ziqing Mai, and Chongjie Zhang. Unsupervised behavior extraction via random intent priors. In Thirty-seventh Conference on Neural Information Pro- cessing Systems, 2023. URL https://openreview.net/forum?id=4vGVQVz5KG

work page 2023

[28] [28]

Recurrent experience replay in distributed reinforcement learning

Steven Kapturowski, Georg Ostrovski, John Quan, Remi Munos, and Will Dabney. Recurrent experience replay in distributed reinforcement learning. In International conference on learning representations, 2018

work page 2018

[29] [29]

Variational temporal abstraction

Taesup Kim, Sungjin Ahn, and Yoshua Bengio. Variational temporal abstraction. Advances in Neural Information Processing Systems, 32, 2019

work page 2019

[30] [30]

Policy search for motor primitives in robotics

Jens Kober and Jan Peters. Policy search for motor primitives in robotics. Advances in neural information processing systems, 21, 2008

work page 2008

[31] [31]

Autonomous robot skill acquisition

George Dimitri Konidaris. Autonomous robot skill acquisition. University of Massachusetts Amherst, 2011

work page 2011

[32] [32]

Offline Reinforcement Learning with Implicit Q-Learning

Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit Q-learning. arXiv preprint arXiv:2110.06169, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[33] [33]

Revisiting peng’s q (λ) for modern reinforcement learning

Tadashi Kozuno, Yunhao Tang, Mark Rowland, Rémi Munos, Steven Kapturowski, Will Dabney, Michal Valko, and David Abel. Revisiting peng’s q (λ) for modern reinforcement learning. In International Conference on Machine Learning, pages 5794–5804. PMLR, 2021

work page 2021

[34] [34]

Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation

Tejas D Kulkarni, Karthik Narasimhan, Ardavan Saeedi, and Josh Tenenbaum. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. Advances in neural information processing systems, 29, 2016

work page 2016

[35] [35]

Conservative Q-learning for offline reinforcement learning

Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative Q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 33: 1179–1191, 2020

work page 2020

[36] [36]

Offline-to-online reinforcement learning via balanced replay and pessimistic Q-ensemble

Seunghyun Lee, Younggyo Seo, Kimin Lee, Pieter Abbeel, and Jinwoo Shin. Offline-to-online reinforcement learning via balanced replay and pessimistic Q-ensemble. In Conference on Robot Learning, pages 1702–1712. PMLR, 2022

work page 2022

[37] [37]

TOP-ERL: Transformer-based off-policy episodic reinforcement learning

Ge Li, Dong Tian, Hongyi Zhou, Xinkai Jiang, Rudolf Lioutikov, and Gerhard Neumann. TOP-ERL: Transformer-based off-policy episodic reinforcement learning. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview. net/forum?id=N4NhVN30ph

work page 2025

[38] [38]

Accelerating ex- ploration with unlabeled prior data

Qiyang Li, Jason Zhang, Dibya Ghosh, Amy Zhang, and Sergey Levine. Accelerating ex- ploration with unlabeled prior data. Advances in Neural Information Processing Systems, 36, 2024

work page 2024

[39] [39]

Continuous control with deep reinforcement learning

Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015. 13

work page internal anchor Pith review Pith/arXiv arXiv 2015

[40] [40]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[41] [41]

Finetuning from offline reinforcement learning: Challenges, trade-offs and practical solutions

Yicheng Luo, Jackie Kay, Edward Grefenstette, and Marc Peter Deisenroth. Finetuning from offline reinforcement learning: Challenges, trade-offs and practical solutions. arXiv preprint arXiv:2303.17396, 2023

work page arXiv 2023

[42] [42]

What Matters in Learning from Offline Human Demonstrations for Robot Manipulation

Ajay Mandlekar, Danfei Xu, Josiah Wong, Soroush Nasiriany, Chen Wang, Rohun Kulkarni, Li Fei-Fei, Silvio Savarese, Yuke Zhu, and Roberto Martín-Martín. What matters in learning from offline human demonstrations for robot manipulation. In arXiv preprint arXiv:2108.03298, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[43] [43]

Dynamic abstraction in reinforcement learning via clustering

Shie Mannor, Ishai Menache, Amit Hoze, and Uri Klein. Dynamic abstraction in reinforcement learning via clustering. In Proceedings of the twenty-first international conference on Machine learning, page 71, 2004

work page 2004

[44] [44]

Q-cut—dynamic discovery of sub-goals in reinforcement learning

Ishai Menache, Shie Mannor, and Nahum Shimkin. Q-cut—dynamic discovery of sub-goals in reinforcement learning. In Machine Learning: ECML 2002: 13th European Conference on Machine Learning Helsinki, Finland, August 19–23, 2002 Proceedings 13, pages 295–306. Springer, 2002

work page 2002

[45] [45]

Neural probabilistic motor primitives for humanoid control

Josh Merel, Leonard Hasenclever, Alexandre Galashov, Arun Ahuja, Vu Pham, Greg Wayne, Yee Whye Teh, and Nicolas Heess. Neural probabilistic motor primitives for humanoid control. arXiv preprint arXiv:1811.11711, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[46] [46]

Asynchronous methods for deep reinforce- ment learning

V olodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforce- ment learning. In International conference on machine learning, pages 1928–1937. PmLR, 2016

work page 1928

[47] [47]

Data-efficient hierarchical reinforcement learning

Ofir Nachum, Shixiang Shane Gu, Honglak Lee, and Sergey Levine. Data-efficient hierarchical reinforcement learning. Advances in neural information processing systems, 31, 2018

work page 2018

[48] [48]

AWAC: Accelerating Online Reinforcement Learning with Offline Datasets

Ashvin Nair, Abhishek Gupta, Murtaza Dalal, and Sergey Levine. Awac: Accelerating online reinforcement learning with offline datasets. arXiv preprint arXiv:2006.09359, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2006

[49] [49]

Cal-QL: Calibrated offline RL pre-training for efficient online fine-tuning

Mitsuhiko Nakamoto, Simon Zhai, Anikait Singh, Max Sobol Mark, Yi Ma, Chelsea Finn, Aviral Kumar, and Sergey Levine. Cal-QL: Calibrated offline RL pre-training for efficient online fine-tuning. Advances in Neural Information Processing Systems, 36, 2024

work page 2024

[50] [50]

Learning and retrieval from prior data for skill-based imitation learning

Soroush Nasiriany, Tian Gao, Ajay Mandlekar, and Yuke Zhu. Learning and retrieval from prior data for skill-based imitation learning. In Conference on Robot Learning, 2022

work page 2022

[51] [51]

Value prediction network

Junhyuk Oh, Satinder Singh, and Honglak Lee. Value prediction network. Advances in neural information processing systems, 30, 2017

work page 2017

[52] [52]

Probabilistic movement primitives

Alexandros Paraschos, Christian Daniel, Jan R Peters, and Gerhard Neumann. Probabilistic movement primitives. Advances in neural information processing systems, 26, 2013

work page 2013

[53] [53]

Ogbench: Benchmarking offline goal-conditioned rl

Seohong Park, Kevin Frans, Benjamin Eysenbach, and Sergey Levine. Ogbench: Benchmarking offline goal-conditioned rl. ArXiv, 2024

work page 2024

[54] [54]

OGBench: Bench- marking Offline Goal-Conditioned RL, February 2025

Seohong Park, Kevin Frans, Benjamin Eysenbach, and Sergey Levine. Ogbench: Benchmarking offline goal-conditioned rl. arXiv preprint arXiv:2410.20092, 2024

work page arXiv 2024

[55] [55]

Foundation policies with hilbert rep- resentations

Seohong Park, Tobias Kreiman, and Sergey Levine. Foundation policies with hilbert rep- resentations. In Forty-first International Conference on Machine Learning , 2024. URL https://openreview.net/forum?id=LhNsSaAKub

work page 2024

[56] [56]

Flow Q-learning

Seohong Park, Qiyang Li, and Sergey Levine. Flow Q-learning. arXiv preprint arXiv:2502.02538, 2025

work page arXiv 2025

[57] [57]

Deeploco: Dynamic locomotion skills using hierarchical deep reinforcement learning

Xue Bin Peng, Glen Berseth, KangKang Yin, and Michiel Van De Panne. Deeploco: Dynamic locomotion skills using hierarchical deep reinforcement learning. Acm transactions on graphics (tog), 36(4):1–13, 2017. 14

work page 2017

[58] [58]

Accelerating reinforcement learning with learned skill priors

Karl Pertsch, Youngwoon Lee, and Joseph Lim. Accelerating reinforcement learning with learned skill priors. In Conference on robot learning, pages 188–204. PMLR, 2021

work page 2021

[59] [59]

Diffusion Policy Policy Optimization

Allen Z Ren, Justin Lidard, Lars L Ankile, Anthony Simeonov, Pulkit Agrawal, Anirudha Majumdar, Benjamin Burchfiel, Hongkai Dai, and Max Simchowitz. Diffusion policy policy optimization. arXiv preprint arXiv:2409.00588, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[60] [60]

Learning by playing solving sparse reward tasks from scratch

Martin Riedmiller, Roland Hafner, Thomas Lampe, Michael Neunert, Jonas Degrave, Tom Wiele, Vlad Mnih, Nicolas Heess, and Jost Tobias Springenberg. Learning by playing solving sparse reward tasks from scratch. In International conference on machine learning , pages 4344–4353. PMLR, 2018

work page 2018

[61] [61]

Dynamic movement primitives-a framework for motor control in humans and humanoid robotics

Stefan Schaal. Dynamic movement primitives-a framework for motor control in humans and humanoid robotics. In Adaptive motion of animals and machines, pages 261–280. Springer, 2006

work page 2006

[62] [62]

Mastering atari, go, chess and shogi by planning with a learned model

Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Si- mon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, et al. Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588(7839):604–609, 2020

work page 2020

[63] [63]

Reinforcement learning with action sequence for data-efficient robot learning

Younggyo Seo and Pieter Abbeel. Reinforcement learning with action sequence for data-efficient robot learning. arXiv preprint arXiv:2411.12155, 2024

work page arXiv 2024

[64] [64]

Continuous control with coarse-to-fine re- inforcement learning

Younggyo Seo, Jafar Uruç, and Stephen James. Continuous control with coarse-to-fine re- inforcement learning. In 8th Annual Conference on Robot Learning , 2024. URL https: //openreview.net/forum?id=WjDR48cL3O

work page 2024

[65] [65]

Learning robot skills with temporal variational inference

Tanmay Shankar and Abhinav Gupta. Learning robot skills with temporal variational inference. In International Conference on Machine Learning, pages 8624–8633. PMLR, 2020

work page 2020

[66] [66]

Using relative novelty to identify useful temporal abstrac- tions in reinforcement learning

Özgür ¸ Sim¸ sek and Andrew G Barto. Using relative novelty to identify useful temporal abstrac- tions in reinforcement learning. In Proceedings of the twenty-first international conference on Machine learning, page 95, 2004

work page 2004

[67] [67]

Özgür ¸ Sim¸ sek and Andrew G. Barto. Betweenness centrality as a basis for forming skills. Workingpaper, University of Massachusetts Amherst, April 2007

work page 2007

[68] [68]

Parrot: Data-driven behavioral priors for reinforcement learning

Avi Singh, Huihan Liu, Gaoyue Zhou, Albert Yu, Nicholas Rhinehart, and Sergey Levine. Parrot: Data-driven behavioral priors for reinforcement learning. In International Confer- ence on Learning Representations , 2021. URL https://openreview.net/forum? id=Ysuv-WOFeKR

work page 2021

[69] [69]

Hybrid RL: Using both offline and online data can make RL efficient

Yuda Song, Yifei Zhou, Ayush Sekhari, Drew Bagnell, Akshay Krishnamurthy, and Wen Sun. Hybrid RL: Using both offline and online data can make RL efficient. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview. net/forum?id=yyBis80iUuU

work page 2023

[70] [70]

Option discovery in hierarchical reinforcement learning using spatio-temporal clustering

Aravind Srinivas, Ramnandan Krishnamurthy, Peeyush Kumar, and Balaraman Ravindran. Option discovery in hierarchical reinforcement learning using spatio-temporal clustering. arXiv preprint arXiv:1605.05359, 2016

work page arXiv 2016

[71] [71]

Learning to summarize with human feedback

Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human feedback. Advances in neural information processing systems, 33:3008–3021, 2020

work page 2020

[72] [72]

Reinforcement learning: An introduction, volume 1

Richard S Sutton, Andrew G Barto, et al. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998

work page 1998

[73] [73]

Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning

Richard S Sutton, Doina Precup, and Satinder Singh. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(1-2): 181–211, 1999

work page 1999

[74] [74]

Revisiting the minimalist approach to offline reinforcement learning

Denis Tarasov, Vladislav Kurenkov, Alexander Nikulin, and Sergey Kolesnikov. Revisiting the minimalist approach to offline reinforcement learning. Advances in Neural Information Processing Systems, 36, 2024. 15

work page 2024

[75] [75]

Chunking the critic: A transformer-based soft actor-critic with n-step returns

Dong Tian, Ge Li, Hongyi Zhou, Onur Celik, and Gerhard Neumann. Chunking the critic: A transformer-based soft actor-critic with n-step returns. arXiv preprint arXiv:2503.03660, 2025

work page arXiv 2025

[76] [76]

Does zero-shot reinforcement learning exist? In The Eleventh International Conference on Learning Representations, 2022

Ahmed Touati, Jérémy Rapin, and Yann Ollivier. Does zero-shot reinforcement learning exist? In The Eleventh International Conference on Learning Representations, 2022

work page 2022

[77] [77]

Strategic attentive writer for learning macro-actions

Alexander Vezhnevets, V olodymyr Mnih, Simon Osindero, Alex Graves, Oriol Vinyals, John Agapiou, et al. Strategic attentive writer for learning macro-actions. Advances in neural information processing systems, 29, 2016

work page 2016

[78] [78]

Feudal networks for hierarchical reinforcement learning

Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, and Koray Kavukcuoglu. Feudal networks for hierarchical reinforcement learning. In International conference on machine learning, pages 3540–3549. PMLR, 2017

work page 2017

[79] [79]

Train once, get a family: State-adaptive balances for offline-to-online reinforcement learning

Shenzhi Wang, Qisen Yang, Jiawei Gao, Matthieu Lin, Hao Chen, Liwei Wu, Ning Jia, Shiji Song, and Gao Huang. Train once, get a family: State-adaptive balances for offline-to-online reinforcement learning. Advances in Neural Information Processing Systems, 36:47081–47104, 2023

work page 2023

[80] [80]

Learning from delayed rewards

Christopher John Cornish Hellaby Watkins et al. Learning from delayed rewards. 1989

work page 1989