pith. sign in

arxiv: 2507.07969 · v4 · submitted 2025-07-10 · 💻 cs.LG · cs.AI· cs.RO· stat.ML

Reinforcement Learning with Action Chunking

Pith reviewed 2026-05-19 05:14 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.ROstat.ML
keywords reinforcement learningaction chunkingoffline-to-online RLQ-chunkingtemporal difference learninglong-horizon taskssparse rewardsexploration
0
0 comments X

The pith

Running reinforcement learning in a chunked action space lets agents use consistent sequences from offline data for better exploration and more stable learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Q-chunking, a method that applies action chunking to temporal difference reinforcement learning in the offline-to-online setting. By running the algorithm directly in a space of action sequences rather than single steps, the agent can draw on temporally consistent behaviors present in an offline dataset to guide exploration. This setup also permits unbiased n-step backups that stabilize value estimates and speed up learning. The approach targets long-horizon tasks with sparse rewards, where standard methods often fail due to poor exploration. Experiments indicate stronger offline performance and higher online sample efficiency than prior offline-to-online techniques on manipulation tasks.

Core claim

Q-chunking adopts action chunking by directly running RL in a 'chunked' action space, enabling the agent to leverage temporally consistent behaviors from offline data for more effective online exploration and to use unbiased n-step backups for more stable and efficient TD learning.

What carries the argument

The chunked action space, in which the policy selects sequences of future actions rather than one action at each timestep.

If this is right

  • Q-chunking achieves strong performance on the offline dataset and high sample efficiency during the online phase.
  • The method outperforms prior best offline-to-online RL algorithms on long-horizon sparse-reward manipulation tasks.
  • Temporal difference learning becomes more stable and efficient through the use of unbiased n-step backups.
  • Online exploration improves because the agent can commit to temporally consistent action sequences drawn from the prior data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Chunking might still help when offline data is somewhat noisy if paired with simple filtering of inconsistent sequences.
  • The same idea could link imitation-learning techniques more directly to value-based online optimization without extra machinery.
  • Applying the approach to navigation or locomotion tasks would test whether the benefits extend past the manipulation domains evaluated here.

Load-bearing premise

The offline dataset contains temporally consistent action sequences that remain useful when the policy is optimized online inside the chunked action space.

What would settle it

Experiments on long-horizon sparse-reward tasks that use an offline dataset lacking consistent action chunks at the sequence level, showing no gains in online exploration or sample efficiency compared to standard non-chunked RL.

Figures

Figures reproduced from arXiv: 2507.07969 by Qiyang Li, Sergey Levine, Zhiyuan Zhou.

Figure 1
Figure 1. Figure 1: Q-chunking uses action chunking to enable fast value backups and effective exploration with temporally coherent actions. left: an overview of our approach: Q-chunking operates in a temporally extended action space that allows for (1) efficient value backups and (2) effective exploration via temporally coherent actions; right: Our method (QC) first pre-trains on an offline dataset for 1M steps (grey) and th… view at source ↗
Figure 2
Figure 2. Figure 2: Naïvely using action chunking for online RL with Gaussian policies leads to poor performance. (1) RLPD runs online RL on both offline data and online replay buffer [7]. (2) RLPD-AC is the same algorithm as RLPD but operates in a temporally extended action space (action chunk size of 5). (3) QC-RLPD additionally uses a behavior cloning loss on the actor (4 seeds). the temporally extended action space compar… view at source ↗
Figure 3
Figure 3. Figure 3: Aggregated performance per OGBench domain. Our method, QC, achieves strong performance across all five challenging OGBench domains. We also include an aggregation performance plot for all the domains at the bottom right. The first 1M steps are offline training and the next 1M steps are online training with one environment step per training step (4 seeds per task; 5 tasks per domain). Algorithm 1 QC Input: … view at source ↗
Figure 4
Figure 4. Figure 4: Robomimic results. QC achieves strong performance across all three robomimic tasks. The first 1M steps are offline and the next 1M steps are online with one environment step per training step (5 seeds). 0.0 0.5 1.0 1.5 2.0 Steps (×10 6 ) 0.0 0.2 0.4 0.6 0.8 1.0 Success Rate OGBench (25 tasks) QC QC-FQL BFN-n FQL-n BFN FQL 0.0 0.5 1.0 1.5 2.0 Steps (×10 6 ) 0.0 0.2 0.4 0.6 0.8 1.0 Success Rate robomimic (3 … view at source ↗
Figure 5
Figure 5. Figure 5: QC-FQL and n-step return on OGBench and robomimic. QC-FQL obtains a similar performance compared to QC. QC is slightly better than QC-FQL on OGBench offline and robomimic online, and slightly worse than QC-FQL on robomimic offline (4 seeds for OGBench, 5 seeds for robomimic). See Appendix D, [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Sensitivity analysis: action chunk size (h), critic ensemble size (K), and update-to-data ratio (UTD). Left: QC-FQL with different h on all 5 cube-triple tasks (5 seeds). QC-FQL with h = 1 is equivalent to FQL. Center: Increasing the ensemble size to K = 10 improves performance of both QC and BFN on cube-triple-task3 (5 seeds). Right: QC with UTD of 5 on cube-triple-task3 (5 seeds). We report only the onli… view at source ↗
Figure 7
Figure 7. Figure 7: End-effector movements early in the training and temporal coherency analysis on cube-triple-task3. Left: QC covers a more diverse set of states compared to BFN in the first 1000 environment steps. Right: QC exhibits a higher temporal coherency in end-effector compared to BFN (4 seeds). methods. In the online phase (in white), QC shows strong sample-efficiency, especially on the two hardest OGBench domains … view at source ↗
Figure 8
Figure 8. Figure 8: We experiment on several challenging long-horizon, sparse-reward domains. See detailed task description for each domain in Appendix A. The rendered images of the robomimic tasks above are taken from Mandlekar et al. [42]. • square: This task requires the robot arm to pick a square nut and place it on a rod. The nut is slightly bigger than the rod and requires the arm to move precisely to complete the task … view at source ↗
Figure 9
Figure 9. Figure 9: End-effector trajectory early in the training. Each subplot above shows the trajectory for a consecutive of 1000 time steps. We include up to Step 9000. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: End-effector trajectory visualization late in the training. Each subplot above shows the trajectory for a consecutive of 1000 time steps. We include the trajectories from Step 900000 to Step 99000. D.2 OGBench results by individual task Main results by task. The following plot ( [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Full OGBench results by task. For each method on each task, we use 4 seeds. Ablation results by task. The following plot ( [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Full OGBench results by task. For each method on each task, we use 4 seeds. Q-chunking with Gausian policies. The following plot shows the performance breakdown for [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Full RLPD results by task. For each method on each task, we use 4 seeds. QC-RLPD is RLPD-AC (RLPD on the temporally extended action space) where we additionally add a fixed behavior cloning coefficient of 0.01. D.3 Robomimic ablation results [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Full robomimic ablation by task. For each method on each task, we use 5 seeds. D.4 How computationally efficient is Q-chunking? In [PITH_FULL_IMAGE:figures/full_fig_p025_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: How long does each method take for one step in milliseconds. Left: offline. Right: online (one agent training step and an environment step). The runtime is measured using the default hyperparameters in our paper on cube-triple-task1 on a single RTX-A5000. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_15.png] view at source ↗
read the original abstract

We present Q-chunking, a simple yet effective recipe for improving reinforcement learning (RL) algorithms for long-horizon, sparse-reward tasks. Our recipe is designed for the offline-to-online RL setting, where the goal is to leverage an offline prior dataset to maximize the sample-efficiency of online learning. Effective exploration and sample-efficient learning remain central challenges in this setting, as it is not obvious how the offline data should be utilized to acquire a good exploratory policy. Our key insight is that action chunking, a technique popularized in imitation learning where sequences of future actions are predicted rather than a single action at each timestep, can be applied to temporal difference (TD)-based RL methods to mitigate the exploration challenge. Q-chunking adopts action chunking by directly running RL in a 'chunked' action space, enabling the agent to (1) leverage temporally consistent behaviors from offline data for more effective online exploration and (2) use unbiased $n$-step backups for more stable and efficient TD learning. Our experimental results demonstrate that Q-chunking exhibits strong offline performance and online sample efficiency, outperforming prior best offline-to-online methods on a range of long-horizon, sparse-reward manipulation tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Q-chunking, a recipe for offline-to-online RL on long-horizon sparse-reward tasks. It runs TD-based RL directly inside a chunked action space so that the agent can (1) exploit temporally consistent action sequences present in the offline dataset for more effective online exploration and (2) perform unbiased n-step backups. Experiments are reported to show improved offline performance and online sample efficiency relative to prior offline-to-online baselines on manipulation tasks.

Significance. If the central claims hold, the work supplies a lightweight, algorithm-agnostic way to convert existing TD methods into more sample-efficient offline-to-online learners by borrowing the action-chunking idea from imitation learning. The absence of new hyperparameters and the direct applicability to standard TD updates are practical strengths that could affect how practitioners initialize exploration from offline data.

major comments (2)
  1. The load-bearing assumption that offline trajectories contain reusable, temporally consistent chunks whose internal structure survives online policy optimization inside the coarser chunked action space is stated but not empirically tested. No chunk-level consistency metric, filtering step, or regularization term is introduced to enforce or recover this property when the original data policy varies within what becomes a single chunk.
  2. The claim of unbiased n-step backups (Abstract and method description) requires a formal argument showing that the chunked transition and reward definitions preserve the unbiasedness of the multi-step estimator; without this derivation or an explicit equation relating the chunked Bellman operator to the original one, it is unclear whether the reported stability gain is a consequence of chunking or of other implementation choices.
minor comments (2)
  1. Add error bars, number of seeds, and a clear statement of the full experimental protocol (including how chunks are formed from the offline dataset) so that the outperformance numbers can be reproduced and statistically evaluated.
  2. Clarify the precise definition of the chunked action space and the corresponding state-transition and reward functions; a short pseudocode block or equation would remove ambiguity about how standard TD updates are applied inside the new space.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and describe the revisions we will incorporate to improve the manuscript.

read point-by-point responses
  1. Referee: The load-bearing assumption that offline trajectories contain reusable, temporally consistent chunks whose internal structure survives online policy optimization inside the coarser chunked action space is stated but not empirically tested. No chunk-level consistency metric, filtering step, or regularization term is introduced to enforce or recover this property when the original data policy varies within what becomes a single chunk.

    Authors: We agree that a direct empirical test of the chunk-consistency assumption would strengthen the presentation. Although the reported performance gains on long-horizon manipulation tasks are consistent with the assumption that temporally coherent sequences in the offline data remain useful under chunked online optimization, the manuscript does not contain an explicit consistency metric or analysis. In the revised version we will add a short empirical subsection that quantifies intra-chunk action variance both in the offline dataset and throughout online training, thereby providing concrete evidence that the low-variance structure is present and largely preserved. revision: yes

  2. Referee: The claim of unbiased n-step backups (Abstract and method description) requires a formal argument showing that the chunked transition and reward definitions preserve the unbiasedness of the multi-step estimator; without this derivation or an explicit equation relating the chunked Bellman operator to the original one, it is unclear whether the reported stability gain is a consequence of chunking or of other implementation choices.

    Authors: We appreciate the request for a formal justification. In the chunked formulation the agent executes a fixed sequence of actions over the chunk horizon; the accumulated reward is the sum of the per-step rewards and the successor state is the state reached after the entire chunk. This construction yields an n-step return that is unbiased for the value of the policy that repeats the chosen chunk, and the corresponding Bellman operator remains a contraction with the same fixed point as the original MDP. We will insert a concise derivation (including the explicit relation between the chunked and standard n-step targets) into the methods section of the revision to clarify this point. revision: yes

Circularity Check

0 steps flagged

No circularity: method is a direct recipe on existing TD algorithms

full rationale

The paper presents Q-chunking as applying the existing concept of action chunking directly to the action space of standard TD-based RL algorithms in the offline-to-online setting. The two claimed benefits (leveraging consistent behaviors from offline data and unbiased n-step backups) follow immediately from the definition of operating in a chunked action space; they are not derived via any fitted parameter, self-referential equation, or load-bearing self-citation that reduces the result to its own inputs. No equations or steps in the provided abstract or description collapse the claimed improvements back onto data used for evaluation. The derivation remains self-contained against external RL benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on standard RL assumptions plus the domain assumption that offline data contains temporally consistent chunks; no new free parameters or invented entities are introduced in the abstract description.

axioms (2)
  • domain assumption Offline dataset contains temporally consistent action sequences usable for exploration
    Invoked in the key insight paragraph to justify effective online exploration.
  • domain assumption n-step backups remain unbiased when performed over action chunks
    Stated as part of the benefit of the chunked space.

pith-pipeline@v0.9.0 · 5747 in / 1283 out tokens · 33123 ms · 2026-05-19T05:14:27.284241+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Q-chunking adopts action chunking by directly running RL in a 'chunked' action space, enabling the agent to (1) leverage temporally consistent behaviors from offline data for more effective online exploration and (2) use unbiased n-step backups for more stable and efficient TD learning.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 12 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Aligning Flow Map Policies with Optimal Q-Guidance

    cs.LG 2026-05 unverdicted novelty 7.0

    Flow map policies enable fast one-step inference for flow-based RL policies, and FMQ provides an optimal closed-form Q-guided target for offline-to-online adaptation under trust-region constraints, achieving SOTA performance.

  2. Revisiting Policy Gradients for Restricted Policy Classes: Escaping Myopic Local Optima with $k$-step Policy Gradients

    cs.LG 2026-05 unverdicted novelty 6.0

    The k-step policy gradient converges exponentially close to the optimal deterministic policy in restricted classes, achieving O(1/T) rates under smoothness assumptions without distribution mismatch factors.

  3. Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities

    cs.AI 2026-05 unverdicted novelty 6.0

    LQL stabilizes Q-learning by penalizing violations of n-step action-sequence lower bounds with a hinge loss computed from standard network outputs.

  4. Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities

    cs.AI 2026-05 unverdicted novelty 6.0

    LQL turns n-step action-sequence lower bounds into a practical hinge-loss stabilizer for off-policy Q-learning without extra networks or forward passes.

  5. OGPO: Sample Efficient Full-Finetuning of Generative Control Policies

    cs.LG 2026-05 unverdicted novelty 6.0

    OGPO is a sample-efficient off-policy method for full finetuning of generative control policies that reaches SOTA on robotic manipulation tasks and can recover from poor behavior-cloning initializations without expert data.

  6. GSDrive: Reinforcing Driving Policies by Multi-mode Future Trajectory Probing with 3D Gaussian Splatting Environment

    cs.RO 2026-04 unverdicted novelty 6.0

    GSDrive combines IL priors with RL feedback by probing multi-mode futures inside a 3D Gaussian Splatting simulator to supply dense rewards for closed-loop driving policy improvement on nuScenes.

  7. GSDrive: Reinforcing Driving Policies by Multi-mode Future Trajectory Probing with 3D Gaussian Splatting Environment

    cs.RO 2026-04 unverdicted novelty 6.0

    GSDrive improves end-to-end driving policies through 3D Gaussian Splatting simulation and multi-mode trajectory probing that supplies dense, differentiable rewards for reinforcement learning.

  8. Empowering Multi-Robot Cooperation via Sequential World Models

    cs.RO 2025-09 unverdicted novelty 6.0

    SeqWM introduces sequential autoregressive agent-wise world models for multi-robot MBRL, outperforming baselines in performance and sample efficiency on Bi-DexHands and Multi-Quadruped tasks with physical robot deployment.

  9. COOPO: Cyclic Offline-Online Policy Optimization Algorithm

    cs.LG 2026-05 unverdicted novelty 5.0

    COOPO is a cyclic offline-online RL algorithm that repeatedly anchors the policy to a dataset via KL-regularized updates then fine-tunes online, claiming better sample efficiency and monotonic improvement under covera...

  10. DyGRO-VLA: Cross-Task Scaling of Vision-Language-Action Models via Dynamic Grouped Residual Optimization

    cs.RO 2026-05 unverdicted novelty 5.0

    DyGRO-VLA is a two-stage optimization framework for cross-task scaling of Vision-Language-Action models via dynamic grouped residual optimization in RL.

  11. ProcVLM: Learning Procedure-Grounded Progress Rewards for Robotic Manipulation

    cs.RO 2026-05 unverdicted novelty 5.0

    ProcVLM learns procedure-grounded dense progress rewards for robotic manipulation via a reasoning-before-estimation VLM trained on a 60M-frame synthesized corpus from 30 embodied datasets.

  12. RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework

    cs.CV 2026-04 unverdicted novelty 5.0

    RAD-2 uses a diffusion generator and RL discriminator to cut collision rates by 56% in closed-loop autonomous driving planning.

Reference graph

Works this paper leans on

93 extracted references · 93 canonical work pages · cited by 10 Pith papers · 10 internal anchors

  1. [1]

    Reincarnating reinforcement learning: Reusing prior computation to accelerate progress

    Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro, Aaron C Courville, and Marc Belle- mare. Reincarnating reinforcement learning: Reusing prior computation to accelerate progress. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 28955–28971. Curran Associat...

  2. [2]

    Reincarnating reinforcement learning: Reusing prior computation to accelerate progress

    Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro, Aaron C Courville, and Marc Belle- mare. Reincarnating reinforcement learning: Reusing prior computation to accelerate progress. Advances in neural information processing systems, 35:28955–28971, 2022

  3. [3]

    OPAL: Offline primitive discovery for accelerating offline reinforcement learning

    Anurag Ajay, Aviral Kumar, Pulkit Agrawal, Sergey Levine, and Ofir Nachum. OPAL: Offline primitive discovery for accelerating offline reinforcement learning. In International Confer- ence on Learning Representations, 2021. URL https://openreview.net/forum?id= V69LGwJ0lIN

  4. [4]

    The option-critic architecture

    Pierre-Luc Bacon, Jean Harb, and Doina Precup. The option-critic architecture. In Proceedings of the AAAI conference on artificial intelligence, volume 31, 2017

  5. [5]

    Option discovery using deep skill chaining

    Akhil Bagaria and George Konidaris. Option discovery using deep skill chaining. In Interna- tional Conference on Learning Representations, 2019. 11

  6. [6]

    Effectively learning initiation sets in hierarchical reinforcement learning

    Akhil Bagaria, Ben Abbatematteo, Omer Gottesman, Matt Corsaro, Sreehari Rammohan, and George Konidaris. Effectively learning initiation sets in hierarchical reinforcement learning. Advances in Neural Information Processing Systems, 36, 2024

  7. [7]

    Efficient online reinforcement learning with offline data

    Philip J Ball, Laura Smith, Ilya Kostrikov, and Sergey Levine. Efficient online reinforcement learning with offline data. In International Conference on Machine Learning, pages 1577–1594. PMLR, 2023

  8. [8]

    Roboagent: Generalization and efficiency in robot manipulation via semantic augmentations and action chunking

    Homanga Bharadhwaj, Jay Vakil, Mohit Sharma, Abhinav Gupta, Shubham Tulsiani, and Vikash Kumar. Roboagent: Generalization and efficiency in robot manipulation via semantic augmentations and action chunking. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 4788–4795. IEEE, 2024

  9. [9]

    Self- supervised reinforcement learning that transfers using random features

    Boyuan Chen, Chuning Zhu, Pulkit Agrawal, Kaiqing Zhang, and Abhishek Gupta. Self- supervised reinforcement learning that transfers using random features. Advances in Neural Information Processing Systems, 36, 2024

  10. [10]

    Intrinsically motivated reinforcement learning

    Nuttapong Chentanez, Andrew Barto, and Satinder Singh. Intrinsically motivated reinforcement learning. Advances in neural information processing systems, 17, 2004

  11. [11]

    Diffusion policy: Visuomotor policy learning via action diffusion

    Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research, page 02783649241273668, 2023

  12. [12]

    Accelerating robotic reinforcement learning via parameterized action primitives

    Murtaza Dalal, Deepak Pathak, and Russ R Salakhutdinov. Accelerating robotic reinforcement learning via parameterized action primitives. Advances in Neural Information Processing Systems, 34:21847–21859, 2021

  13. [13]

    Hierarchical relative entropy policy search

    Christian Daniel, Gerhard Neumann, Oliver Kroemer, and Jan Peters. Hierarchical relative entropy policy search. Journal of Machine Learning Research, 17(93):1–50, 2016

  14. [14]

    Feudal reinforcement learning

    Peter Dayan and Geoffrey E Hinton. Feudal reinforcement learning. Advances in neural information processing systems, 5, 1992

  15. [15]

    Learning transferable sub-goals by hypothesizing generalizing features

    Anita de Mello Koch, Akhil Bagaria, Bingnan Huo, Zhiyuan Zhou, Cameron Allen, and George Konidaris. Learning transferable sub-goals by hypothesizing generalizing features. 2025

  16. [16]

    Hierarchical reinforcement learning with the maxq value function decomposition

    Thomas G Dietterich. Hierarchical reinforcement learning with the maxq value function decomposition. Journal of artificial intelligence research, 13:227–303, 2000

  17. [17]

    Revisiting fundamentals of experience replay

    William Fedus, Prajit Ramachandran, Rishabh Agarwal, Yoshua Bengio, Hugo Larochelle, Mark Rowland, and Will Dabney. Revisiting fundamentals of experience replay. InInternational conference on machine learning, pages 3061–3071. PMLR, 2020

  18. [18]

    Multi-Level Discovery of Deep Options

    Roy Fox, Sanjay Krishnan, Ion Stoica, and Ken Goldberg. Multi-level discovery of deep options. arXiv preprint arXiv:1703.08294, 2017

  19. [19]

    Unsupervised zero-shot rein- forcement learning via functional reward encodings

    Kevin Frans, Seohong Park, Pieter Abbeel, and Sergey Levine. Unsupervised zero-shot rein- forcement learning via functional reward encodings. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, edi- tors, Proceedings of the 41st International Conference on Machine Learning, volume 2...

  20. [20]

    D4RL: Datasets for Deep Data-Driven Reinforcement Learning

    Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4RL: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219, 2020

  21. [21]

    Hierarchical skills for efficient exploration

    Jonas Gehring, Gabriel Synnaeve, Andreas Krause, and Nicolas Usunier. Hierarchical skills for efficient exploration. Advances in Neural Information Processing Systems, 34:11553–11564, 2021

  22. [22]

    One act play: Single demonstration behavior cloning with action chunking transformers

    Abraham George and Amir Barati Farimani. One act play: Single demonstration behavior cloning with action chunking transformers. arXiv preprint arXiv:2309.10175, 2023. 12

  23. [23]

    Emaq: Expected-max q-learning operator for simple yet effective offline and online rl

    Seyed Kamyar Seyed Ghasemipour, Dale Schuurmans, and Shixiang Shane Gu. Emaq: Expected-max q-learning operator for simple yet effective offline and online rl. In International Conference on Machine Learning, pages 3682–3691. PMLR, 2021

  24. [24]

    Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor

    Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pages 1861–1870. PMLR, 2018

  25. [25]

    Rainbow: Combining improve- ments in deep reinforcement learning

    Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining improve- ments in deep reinforcement learning. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018

  26. [26]

    Kl divergence of max-of-n, 2023

    Jacob Hilton. Kl divergence of max-of-n, 2023. URL https://www.jacobh.co.uk/ bon_kl.pdf

  27. [27]

    Unsupervised behavior extraction via random intent priors

    Hao Hu, Yiqin Yang, Jianing Ye, Ziqing Mai, and Chongjie Zhang. Unsupervised behavior extraction via random intent priors. In Thirty-seventh Conference on Neural Information Pro- cessing Systems, 2023. URL https://openreview.net/forum?id=4vGVQVz5KG

  28. [28]

    Recurrent experience replay in distributed reinforcement learning

    Steven Kapturowski, Georg Ostrovski, John Quan, Remi Munos, and Will Dabney. Recurrent experience replay in distributed reinforcement learning. In International conference on learning representations, 2018

  29. [29]

    Variational temporal abstraction

    Taesup Kim, Sungjin Ahn, and Yoshua Bengio. Variational temporal abstraction. Advances in Neural Information Processing Systems, 32, 2019

  30. [30]

    Policy search for motor primitives in robotics

    Jens Kober and Jan Peters. Policy search for motor primitives in robotics. Advances in neural information processing systems, 21, 2008

  31. [31]

    Autonomous robot skill acquisition

    George Dimitri Konidaris. Autonomous robot skill acquisition. University of Massachusetts Amherst, 2011

  32. [32]

    Offline Reinforcement Learning with Implicit Q-Learning

    Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit Q-learning. arXiv preprint arXiv:2110.06169, 2021

  33. [33]

    Revisiting peng’s q (λ) for modern reinforcement learning

    Tadashi Kozuno, Yunhao Tang, Mark Rowland, Rémi Munos, Steven Kapturowski, Will Dabney, Michal Valko, and David Abel. Revisiting peng’s q (λ) for modern reinforcement learning. In International Conference on Machine Learning, pages 5794–5804. PMLR, 2021

  34. [34]

    Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation

    Tejas D Kulkarni, Karthik Narasimhan, Ardavan Saeedi, and Josh Tenenbaum. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. Advances in neural information processing systems, 29, 2016

  35. [35]

    Conservative Q-learning for offline reinforcement learning

    Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative Q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 33: 1179–1191, 2020

  36. [36]

    Offline-to-online reinforcement learning via balanced replay and pessimistic Q-ensemble

    Seunghyun Lee, Younggyo Seo, Kimin Lee, Pieter Abbeel, and Jinwoo Shin. Offline-to-online reinforcement learning via balanced replay and pessimistic Q-ensemble. In Conference on Robot Learning, pages 1702–1712. PMLR, 2022

  37. [37]

    TOP-ERL: Transformer-based off-policy episodic reinforcement learning

    Ge Li, Dong Tian, Hongyi Zhou, Xinkai Jiang, Rudolf Lioutikov, and Gerhard Neumann. TOP-ERL: Transformer-based off-policy episodic reinforcement learning. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview. net/forum?id=N4NhVN30ph

  38. [38]

    Accelerating ex- ploration with unlabeled prior data

    Qiyang Li, Jason Zhang, Dibya Ghosh, Amy Zhang, and Sergey Levine. Accelerating ex- ploration with unlabeled prior data. Advances in Neural Information Processing Systems, 36, 2024

  39. [39]

    Continuous control with deep reinforcement learning

    Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015. 13

  40. [40]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003, 2022

  41. [41]

    Finetuning from offline reinforcement learning: Challenges, trade-offs and practical solutions

    Yicheng Luo, Jackie Kay, Edward Grefenstette, and Marc Peter Deisenroth. Finetuning from offline reinforcement learning: Challenges, trade-offs and practical solutions. arXiv preprint arXiv:2303.17396, 2023

  42. [42]

    What Matters in Learning from Offline Human Demonstrations for Robot Manipulation

    Ajay Mandlekar, Danfei Xu, Josiah Wong, Soroush Nasiriany, Chen Wang, Rohun Kulkarni, Li Fei-Fei, Silvio Savarese, Yuke Zhu, and Roberto Martín-Martín. What matters in learning from offline human demonstrations for robot manipulation. In arXiv preprint arXiv:2108.03298, 2021

  43. [43]

    Dynamic abstraction in reinforcement learning via clustering

    Shie Mannor, Ishai Menache, Amit Hoze, and Uri Klein. Dynamic abstraction in reinforcement learning via clustering. In Proceedings of the twenty-first international conference on Machine learning, page 71, 2004

  44. [44]

    Q-cut—dynamic discovery of sub-goals in reinforcement learning

    Ishai Menache, Shie Mannor, and Nahum Shimkin. Q-cut—dynamic discovery of sub-goals in reinforcement learning. In Machine Learning: ECML 2002: 13th European Conference on Machine Learning Helsinki, Finland, August 19–23, 2002 Proceedings 13, pages 295–306. Springer, 2002

  45. [45]

    Neural probabilistic motor primitives for humanoid control

    Josh Merel, Leonard Hasenclever, Alexandre Galashov, Arun Ahuja, Vu Pham, Greg Wayne, Yee Whye Teh, and Nicolas Heess. Neural probabilistic motor primitives for humanoid control. arXiv preprint arXiv:1811.11711, 2018

  46. [46]

    Asynchronous methods for deep reinforce- ment learning

    V olodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforce- ment learning. In International conference on machine learning, pages 1928–1937. PmLR, 2016

  47. [47]

    Data-efficient hierarchical reinforcement learning

    Ofir Nachum, Shixiang Shane Gu, Honglak Lee, and Sergey Levine. Data-efficient hierarchical reinforcement learning. Advances in neural information processing systems, 31, 2018

  48. [48]

    AWAC: Accelerating Online Reinforcement Learning with Offline Datasets

    Ashvin Nair, Abhishek Gupta, Murtaza Dalal, and Sergey Levine. Awac: Accelerating online reinforcement learning with offline datasets. arXiv preprint arXiv:2006.09359, 2020

  49. [49]

    Cal-QL: Calibrated offline RL pre-training for efficient online fine-tuning

    Mitsuhiko Nakamoto, Simon Zhai, Anikait Singh, Max Sobol Mark, Yi Ma, Chelsea Finn, Aviral Kumar, and Sergey Levine. Cal-QL: Calibrated offline RL pre-training for efficient online fine-tuning. Advances in Neural Information Processing Systems, 36, 2024

  50. [50]

    Learning and retrieval from prior data for skill-based imitation learning

    Soroush Nasiriany, Tian Gao, Ajay Mandlekar, and Yuke Zhu. Learning and retrieval from prior data for skill-based imitation learning. In Conference on Robot Learning, 2022

  51. [51]

    Value prediction network

    Junhyuk Oh, Satinder Singh, and Honglak Lee. Value prediction network. Advances in neural information processing systems, 30, 2017

  52. [52]

    Probabilistic movement primitives

    Alexandros Paraschos, Christian Daniel, Jan R Peters, and Gerhard Neumann. Probabilistic movement primitives. Advances in neural information processing systems, 26, 2013

  53. [53]

    Ogbench: Benchmarking offline goal-conditioned rl

    Seohong Park, Kevin Frans, Benjamin Eysenbach, and Sergey Levine. Ogbench: Benchmarking offline goal-conditioned rl. ArXiv, 2024

  54. [54]

    OGBench: Bench- marking Offline Goal-Conditioned RL, February 2025

    Seohong Park, Kevin Frans, Benjamin Eysenbach, and Sergey Levine. Ogbench: Benchmarking offline goal-conditioned rl. arXiv preprint arXiv:2410.20092, 2024

  55. [55]

    Foundation policies with hilbert rep- resentations

    Seohong Park, Tobias Kreiman, and Sergey Levine. Foundation policies with hilbert rep- resentations. In Forty-first International Conference on Machine Learning , 2024. URL https://openreview.net/forum?id=LhNsSaAKub

  56. [56]

    Flow Q-learning

    Seohong Park, Qiyang Li, and Sergey Levine. Flow Q-learning. arXiv preprint arXiv:2502.02538, 2025

  57. [57]

    Deeploco: Dynamic locomotion skills using hierarchical deep reinforcement learning

    Xue Bin Peng, Glen Berseth, KangKang Yin, and Michiel Van De Panne. Deeploco: Dynamic locomotion skills using hierarchical deep reinforcement learning. Acm transactions on graphics (tog), 36(4):1–13, 2017. 14

  58. [58]

    Accelerating reinforcement learning with learned skill priors

    Karl Pertsch, Youngwoon Lee, and Joseph Lim. Accelerating reinforcement learning with learned skill priors. In Conference on robot learning, pages 188–204. PMLR, 2021

  59. [59]

    Diffusion Policy Policy Optimization

    Allen Z Ren, Justin Lidard, Lars L Ankile, Anthony Simeonov, Pulkit Agrawal, Anirudha Majumdar, Benjamin Burchfiel, Hongkai Dai, and Max Simchowitz. Diffusion policy policy optimization. arXiv preprint arXiv:2409.00588, 2024

  60. [60]

    Learning by playing solving sparse reward tasks from scratch

    Martin Riedmiller, Roland Hafner, Thomas Lampe, Michael Neunert, Jonas Degrave, Tom Wiele, Vlad Mnih, Nicolas Heess, and Jost Tobias Springenberg. Learning by playing solving sparse reward tasks from scratch. In International conference on machine learning , pages 4344–4353. PMLR, 2018

  61. [61]

    Dynamic movement primitives-a framework for motor control in humans and humanoid robotics

    Stefan Schaal. Dynamic movement primitives-a framework for motor control in humans and humanoid robotics. In Adaptive motion of animals and machines, pages 261–280. Springer, 2006

  62. [62]

    Mastering atari, go, chess and shogi by planning with a learned model

    Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Si- mon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, et al. Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588(7839):604–609, 2020

  63. [63]

    Reinforcement learning with action sequence for data-efficient robot learning

    Younggyo Seo and Pieter Abbeel. Reinforcement learning with action sequence for data-efficient robot learning. arXiv preprint arXiv:2411.12155, 2024

  64. [64]

    Continuous control with coarse-to-fine re- inforcement learning

    Younggyo Seo, Jafar Uruç, and Stephen James. Continuous control with coarse-to-fine re- inforcement learning. In 8th Annual Conference on Robot Learning , 2024. URL https: //openreview.net/forum?id=WjDR48cL3O

  65. [65]

    Learning robot skills with temporal variational inference

    Tanmay Shankar and Abhinav Gupta. Learning robot skills with temporal variational inference. In International Conference on Machine Learning, pages 8624–8633. PMLR, 2020

  66. [66]

    Using relative novelty to identify useful temporal abstrac- tions in reinforcement learning

    Özgür ¸ Sim¸ sek and Andrew G Barto. Using relative novelty to identify useful temporal abstrac- tions in reinforcement learning. In Proceedings of the twenty-first international conference on Machine learning, page 95, 2004

  67. [67]

    Özgür ¸ Sim¸ sek and Andrew G. Barto. Betweenness centrality as a basis for forming skills. Workingpaper, University of Massachusetts Amherst, April 2007

  68. [68]

    Parrot: Data-driven behavioral priors for reinforcement learning

    Avi Singh, Huihan Liu, Gaoyue Zhou, Albert Yu, Nicholas Rhinehart, and Sergey Levine. Parrot: Data-driven behavioral priors for reinforcement learning. In International Confer- ence on Learning Representations , 2021. URL https://openreview.net/forum? id=Ysuv-WOFeKR

  69. [69]

    Hybrid RL: Using both offline and online data can make RL efficient

    Yuda Song, Yifei Zhou, Ayush Sekhari, Drew Bagnell, Akshay Krishnamurthy, and Wen Sun. Hybrid RL: Using both offline and online data can make RL efficient. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview. net/forum?id=yyBis80iUuU

  70. [70]

    Option discovery in hierarchical reinforcement learning using spatio-temporal clustering

    Aravind Srinivas, Ramnandan Krishnamurthy, Peeyush Kumar, and Balaraman Ravindran. Option discovery in hierarchical reinforcement learning using spatio-temporal clustering. arXiv preprint arXiv:1605.05359, 2016

  71. [71]

    Learning to summarize with human feedback

    Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human feedback. Advances in neural information processing systems, 33:3008–3021, 2020

  72. [72]

    Reinforcement learning: An introduction, volume 1

    Richard S Sutton, Andrew G Barto, et al. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998

  73. [73]

    Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning

    Richard S Sutton, Doina Precup, and Satinder Singh. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(1-2): 181–211, 1999

  74. [74]

    Revisiting the minimalist approach to offline reinforcement learning

    Denis Tarasov, Vladislav Kurenkov, Alexander Nikulin, and Sergey Kolesnikov. Revisiting the minimalist approach to offline reinforcement learning. Advances in Neural Information Processing Systems, 36, 2024. 15

  75. [75]

    Chunking the critic: A transformer-based soft actor-critic with n-step returns

    Dong Tian, Ge Li, Hongyi Zhou, Onur Celik, and Gerhard Neumann. Chunking the critic: A transformer-based soft actor-critic with n-step returns. arXiv preprint arXiv:2503.03660, 2025

  76. [76]

    Does zero-shot reinforcement learning exist? In The Eleventh International Conference on Learning Representations, 2022

    Ahmed Touati, Jérémy Rapin, and Yann Ollivier. Does zero-shot reinforcement learning exist? In The Eleventh International Conference on Learning Representations, 2022

  77. [77]

    Strategic attentive writer for learning macro-actions

    Alexander Vezhnevets, V olodymyr Mnih, Simon Osindero, Alex Graves, Oriol Vinyals, John Agapiou, et al. Strategic attentive writer for learning macro-actions. Advances in neural information processing systems, 29, 2016

  78. [78]

    Feudal networks for hierarchical reinforcement learning

    Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, and Koray Kavukcuoglu. Feudal networks for hierarchical reinforcement learning. In International conference on machine learning, pages 3540–3549. PMLR, 2017

  79. [79]

    Train once, get a family: State-adaptive balances for offline-to-online reinforcement learning

    Shenzhi Wang, Qisen Yang, Jiawei Gao, Matthieu Lin, Hao Chen, Liwei Wu, Ning Jia, Shiji Song, and Gao Huang. Train once, get a family: State-adaptive balances for offline-to-online reinforcement learning. Advances in Neural Information Processing Systems, 36:47081–47104, 2023

  80. [80]

    Learning from delayed rewards

    Christopher John Cornish Hellaby Watkins et al. Learning from delayed rewards. 1989

Showing first 80 references.