Direct Preference Optimization for Primitive-Enabled Hierarchical RL: A Bilevel Approach

Amrit Singh Bedi; Anit Kumar Sahu; Brian M. Sadler; Derrik E. Asher; Mubarak Shah; Souradip Chakraborty; Utsav Singh; Vinay P. Namboodiri; Wesley A. Suttle

arxiv: 2411.00361 · v4 · submitted 2024-11-01 · 💻 cs.LG

Direct Preference Optimization for Primitive-Enabled Hierarchical RL: A Bilevel Approach

Utsav Singh , Souradip Chakraborty , Wesley A. Suttle , Brian M. Sadler , Derrik E. Asher , Anit Kumar Sahu , Mubarak Shah , Vinay P. Namboodiri

show 1 more author

Amrit Singh Bedi

This is my paper

Pith reviewed 2026-05-23 18:27 UTC · model grok-4.3

classification 💻 cs.LG

keywords hierarchical reinforcement learningdirect preference optimizationbilevel optimizationnon-stationaritysubgoal feasibilityrobotic benchmarksgoal-conditioned policiespreference-based learning

0 comments

The pith

DIPPER applies direct preference optimization to higher-level policies in a bilevel formulation of hierarchical reinforcement learning to overcome non-stationarity and infeasible subgoals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DIPPER to solve two key problems in hierarchical reinforcement learning: the non-stationarity that arises when the lower-level policy changes during training, destabilizing the higher level, and the proposal of subgoals that the lower level cannot achieve. It does this by casting the problem as bilevel optimization and training the higher-level policy with direct preference optimization on comparisons between sequences of subgoals, which remain stationary even as the lower policy evolves. A regularization term based on the lower-level value function further pushes the higher level toward only feasible subgoals. If successful, this would let agents tackle longer, more complex tasks in robotics and elsewhere by making hierarchical decomposition more reliable without constant retraining interference between levels.

Core claim

DIPPER formulates goal-conditioned hierarchical reinforcement learning as a bi-level optimization problem where the higher-level policy is trained using direct preference optimization on stationary preference comparisons over subgoal sequences rather than non-stationary rewards, combined with lower-level value function regularization to promote achievable subgoals, resulting in improved performance on robotic navigation and manipulation tasks.

What carries the argument

Bilevel optimization formulation of goal-conditioned HRL that trains the higher-level policy via direct preference optimization on stationary subgoal-sequence preferences, augmented by lower-level value function regularization.

If this is right

Stationary preference comparisons allow higher-level learning to proceed independently of lower-level policy changes.
Lower-level value function regularization reduces generation of infeasible subgoals by the higher level.
Two new quantitative metrics can verify mitigation of non-stationarity and infeasible subgoal problems.
Empirical gains reach up to 40 percent over prior baselines on robotic navigation and manipulation benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The bilevel DPO structure may transfer to other settings where one policy level adapts faster than another, such as options frameworks with changing primitives.
Collecting the required preference labels over subgoal sequences will determine practical cost, since the method assumes such comparisons are easier to obtain than stationary rewards.
If value regularization reliably signals feasibility, it opens a route to replace hand-crafted subgoal constraints with learned value estimates in other hierarchical methods.

Load-bearing premise

Preference comparisons over subgoal sequences remain stationary and independent of lower-level policy evolution, and lower-level value regularization alone suffices to keep proposed subgoals feasible without new instabilities.

What would settle it

Training curves in which higher-level policy updates continue to track lower-level policy changes despite the use of preferences, or in which the two new metrics show no reduction in non-stationarity or infeasible subgoals when the value regularization term is ablated.

Figures

Figures reproduced from arXiv: 2411.00361 by Amrit Singh Bedi, Anit Kumar Sahu, Brian M. Sadler, Derrik E. Asher, Mubarak Shah, Souradip Chakraborty, Utsav Singh, Vinay P. Namboodiri, Wesley A. Suttle.

**Figure 1.** Figure 1: DIPPER Overview: (left) In vanilla HRL, the higher level predicts subgoals gt and gets the environment reward that depend on the lower primitive behavior, which causes non-stationarity in HRL. Also, the higher level may predict infeasible subgoals that are too hard for lower primitive. (middle) In DIPPER, the lower level value function VπL is leveraged to condition higher level policy into predicting feasi… view at source ↗

**Figure 2.** Figure 2: Success Rate plots. This figure illustrates the success rates across four sparse-reward maze navigation and robotic manipulation tasks, where the solid lines represent the mean, and the shaded areas denote the standard deviation across 5 different seeds. We evaluate DIPPER against several baselines. Although HAC, SAGA and RAPS outperform DIPPER in the easier maze task, they fail to perform well in other ch… view at source ↗

**Figure 3.** Figure 3: Subgoal Distance Metric. This figure compares DIPPER with DIPPER-No-V, HAC, RAPS, HIER baselines, based on average distance between subgoals predicted by the higher level policy and subgoals achieved by the lower level primitive. DIPPER consistently generates low average distance values, which implies that in DIPPER, the higher level policy generates achievable subgoals that induce optimal lower primitive … view at source ↗

**Figure 4.** Figure 4: Lower Q-Function Metric. This figure compares DIPPER with DIPPER-No-V, HAC, RAPS, HIER baselines, based on average lower level Q function values for the subgoals predicted by the higher level policy. DIPPER consistently leads to large Q-function values, thus inducing optimal lower policy behavior and predicting feasible subgoals. Thus, DIPPER is able to mitigate non-stationary in HRL, while generating feas… view at source ↗

**Figure 5.** Figure 5: Regularization weight ablation. This figure depicts the success rate performance for varying values of the primitive regularization weight λ. When λ is too small, we loose the benefits of primitive-informed regularization resulting in poor performance, whereas too large λ values can lead to degenerate solutions. Hence, selecting appropriate λ value is essential for accurate subgoal prediction and enhancing… view at source ↗

**Figure 6.** Figure 6: Max-ent parameter ablation. This figure illustrates the success rate performance for different values of the max-ent parameter β hyper-parameter. This parameter controls the exploration in maximum-entropy formulation. If β is too large, the higher-level policy may perform extensive exploration but stay away from optimal subgoal prediction, whereas if β is too small, the higher-level might not explore and p… view at source ↗

**Figure 7.** Figure 7: Maze navigation task visualization: The visualization is a successful attempt at performing maze navigation task [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Pick and place task visualization: This figure provides visualization of a successful attempt at performing pick and place task 17 [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: Push task visualization: The visualization is a successful attempt at performing push task [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: Kitchen task visualization: The visualization is a successful attempt at performing kitchen task 18 [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

read the original abstract

Hierarchical reinforcement learning (HRL) enables agents to solve complex, long-horizon tasks by decomposing them into manageable sub-tasks. However, HRL methods face two fundamental challenges: (i) non-stationarity caused by the evolving lower-level policy during training, which destabilizes higher-level learning, and (ii) the generation of infeasible subgoals that lower-level policies cannot achieve. To address these challenges, we introduce DIPPER, a novel HRL framework that formulates goal-conditioned HRL as a bi-level optimization problem and leverages direct preference optimization (DPO) to train the higher-level policy. By learning from stationary preference comparisons over subgoal sequences rather than rewards that depend on the evolving lower-level policy, DIPPER mitigates the impact of non-stationarity on hierarchical learning. To address infeasible subgoals, DIPPER incorporates lower-level value function regularization that encourages the higher-level policy to propose achievable subgoals. We also introduce two novel metrics to quantitatively verify that DIPPER mitigates non-stationarity and infeasible subgoal generation issues in HRL. We perform empirical evaluations on challenging robotic navigation and manipulation benchmarks and show that DIPPER achieves upto 40% improvements over state-of-the-art baselines, demonstrating that preference-based methods can effectively alleviate persistent challenges in hierarchical

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DIPPER puts DPO on subgoal-sequence preferences inside a bi-level HRL loop to cut non-stationarity and adds value regularization for feasible subgoals, but the stationarity claim rests on an assumption that needs explicit checking.

read the letter

DIPPER frames goal-conditioned HRL as bi-level optimization and trains the higher-level policy with DPO on preference comparisons over subgoal sequences instead of non-stationary rewards. It adds lower-level value regularization to discourage infeasible subgoals and introduces two metrics to track whether those problems are reduced. On robotic navigation and manipulation tasks it reports gains up to 40 percent over baselines. That combination and the new metrics are the concrete additions relative to prior separate uses of DPO and bi-level HRL. The empirical results are the clearest positive signal; they suggest the approach can be made to work on the targeted benchmarks. The stationarity argument is the soft spot. Preference labels over subgoal sequences are generated by evaluating whether a sequence supports task success, and that evaluation depends on how well the current lower-level policy can execute each subgoal. When the lower-level policy is updated concurrently, the same sequence can receive different labels, so the claimed independence from lower-level evolution is not automatic. The value regularization addresses feasibility but does not obviously freeze the preference data. The abstract supplies no equations for the bi-level objective, no description of how the preference dataset is built or refreshed, and no error bars or statistical tests on the reported gains. Those gaps make the central claims hard to assess from the given text alone. The paper is written for researchers already working on hierarchical RL or preference-based methods for long-horizon control. A reader who needs concrete ways to stabilize high-level learning or wants the new diagnostic metrics would find material to test. It is coherent enough on its own terms to merit referee time; the stationarity issue and the missing implementation details are the natural points for review rather than reasons to desk-reject.

Referee Report

2 major / 2 minor

Summary. The paper proposes DIPPER, a bi-level optimization framework for goal-conditioned hierarchical RL that applies direct preference optimization (DPO) to the higher-level policy using preference comparisons over subgoal sequences. This is intended to mitigate non-stationarity arising from concurrent lower-level policy updates, while lower-level value function regularization encourages feasible subgoals. The work introduces two new metrics to quantify mitigation of non-stationarity and infeasible subgoal issues, and reports empirical gains of up to 40% over baselines on robotic navigation and manipulation benchmarks.

Significance. If the stationarity of the preference dataset and the effectiveness of the regularization can be rigorously established, the approach would offer a practical way to stabilize higher-level learning in HRL without relying on non-stationary rewards. The empirical evaluation on standard robotic benchmarks provides concrete evidence of improvement, and the introduction of quantitative metrics for the two core HRL challenges is a useful contribution for future comparisons. The work applies an existing preference optimization technique to a standard bi-level HRL skeleton rather than deriving new theoretical primitives.

major comments (2)

[Abstract, §3] Abstract and §3 (bi-level formulation): The central claim that 'preference comparisons over subgoal sequences' are stationary and independent of lower-level policy evolution is load-bearing for the non-stationarity mitigation argument. However, generating such preferences requires evaluating whether a subgoal sequence leads to task success, which depends on the lower-level policy's success rate for individual subgoals. If the lower-level policy is updated concurrently (standard in HRL), these success rates change, so the preference labels are not guaranteed to remain stationary. The lower-level value regularization addresses feasibility but does not decouple the labels from lower-level dynamics; a concrete description of the preference dataset construction and whether it is held fixed or regenerated is required to verify the claim.
[§4] §4 (empirical evaluation): The reported 'up to 40% improvements' and the two new metrics for non-stationarity and infeasibility are central to the contribution, yet the abstract and available description supply no error bars, statistical tests, or details on how the metrics are computed from the learned policies. Without these, it is impossible to assess whether the gains are robust or whether the metrics actually isolate the claimed effects versus other factors such as hyperparameter tuning.

minor comments (2)

[Abstract] The abstract sentence is truncated at 'hierarchical'; this should be completed in the final version.
[§3] Notation for the bi-level objective and the DPO loss applied to the higher-level policy should be introduced with explicit equations early in §3 to allow readers to trace how the preference loss replaces the usual non-stationary reward signal.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point-by-point below. Both points identify areas where additional detail will strengthen the manuscript, and we will revise accordingly.

read point-by-point responses

Referee: [Abstract, §3] Abstract and §3 (bi-level formulation): The central claim that 'preference comparisons over subgoal sequences' are stationary and independent of lower-level policy evolution is load-bearing for the non-stationarity mitigation argument. However, generating such preferences requires evaluating whether a subgoal sequence leads to task success, which depends on the lower-level policy's success rate for individual subgoals. If the lower-level policy is updated concurrently (standard in HRL), these success rates change, so the preference labels are not guaranteed to remain stationary. The lower-level value regularization addresses feasibility but does not decouple the labels from lower-level dynamics; a concrete description of the preference dataset construction and whether it is held fixed or regenerated is required to verify the claim.

Authors: We agree that a precise account of dataset construction is necessary to substantiate the stationarity claim. In DIPPER the preference dataset is generated once, offline, by rolling out subgoal sequences with a fixed snapshot of the lower-level policy and labeling each sequence according to whether it produces task success under that snapshot; the resulting preference pairs are then held fixed for the entire higher-level DPO phase. Because the labels are never regenerated during concurrent lower-level updates, the preference comparisons remain stationary by construction. The value-function regularization operates only on the higher-level objective and does not alter the fixed labels. We will add an explicit subsection in §3 (with pseudocode) describing this offline collection and freezing procedure. revision: yes
Referee: [§4] §4 (empirical evaluation): The reported 'up to 40% improvements' and the two new metrics for non-stationarity and infeasibility are central to the contribution, yet the abstract and available description supply no error bars, statistical tests, or details on how the metrics are computed from the learned policies. Without these, it is impossible to assess whether the gains are robust or whether the metrics actually isolate the claimed effects versus other factors such as hyperparameter tuning.

Authors: We concur that error bars, statistical tests, and explicit metric definitions are required for rigorous assessment. In the revised §4 we will report means and standard errors over at least five independent random seeds, include paired t-test p-values for the performance deltas, and provide the exact formulas used to compute the non-stationarity and infeasibility metrics from policy rollouts (including the precise window sizes and success thresholds employed). These additions will allow readers to evaluate both robustness and the metrics' specificity. revision: yes

Circularity Check

0 steps flagged

No circularity: standard application of DPO inside bi-level HRL with external stationarity assumption

full rationale

The provided abstract and reader summary describe DIPPER as a bi-level formulation that applies existing DPO (imported from prior non-self work) to preference comparisons over subgoal sequences, plus lower-level value regularization. No equations, derivations, or self-citations are quoted that reduce a claimed prediction or uniqueness result to a fitted input or prior self-result by construction. The stationarity claim is presented as an assumption rather than derived from the method itself, and no load-bearing step renames a fit as a prediction or smuggles an ansatz via self-citation. The central claims therefore remain independent of the paper's own fitted quantities.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract, the central claim rests on the domain assumption that preference comparisons can be made stationary with respect to lower-level policy changes and that value regularization will enforce feasible subgoals. No explicit free parameters or invented physical entities are identifiable from the provided text.

axioms (1)

domain assumption Preference comparisons over subgoal sequences remain stationary and independent of the evolving lower-level policy
Invoked to justify replacing reward signals with DPO comparisons for the higher-level policy

pith-pipeline@v0.9.0 · 5798 in / 1378 out tokens · 64535 ms · 2026-05-23T18:27:26.582047+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

HiPO: Hierarchical Preference Optimization for Adaptive Reasoning in LLMs
cs.AI 2026-04 unverdicted novelty 7.0

HiPO improves LLM reasoning performance by optimizing preferences separately on response segments rather than entire outputs.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · cited by 1 Pith paper · 5 internal anchors

[1]

Hindsight Experience Replay

Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. CoRR, abs/1707.01495, 2017. URL http://arxiv.org/abs/1707.01495

work page internal anchor Pith review Pith/arXiv arXiv 2017
[2]

Barto and Sridhar Mahadevan

Andrew G. Barto and Sridhar Mahadevan. Recent advances in hierarchical reinforcement learning. Discrete Event Dynamic Systems, 13:341–379, 2003. 10

work page 2003
[3]

Ralph Allan Bradley and Milton E. Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39:324, 1952. URL https://api. semanticscholar.org/CorpusID:125209808

work page 1952
[4]

Human preference scaling with demonstrations for deep reinforcement learning

Zehong Cao, Kaichiu Wong, and Chin-Teng Lin. Human preference scaling with demonstrations for deep reinforcement learning. arXiv preprint arXiv:2007.12904, 2020

work page arXiv 2007
[5]

Goal-conditioned reinforcement learning with imagined subgoals

Elliot Chane-Sane, Cordelia Schmid, and Ivan Laptev. Goal-conditioned reinforcement learning with imagined subgoals. In International Conference on Machine Learning, pages 1430–1440. PMLR, 2021

work page 2021
[6]

Deep reinforcement learning from human preferences

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017

work page 2017
[7]

Accelerating robotic reinforcement learning via parameterized action primitives

Murtaza Dalal, Deepak Pathak, and Russ R Salakhutdinov. Accelerating robotic reinforcement learning via parameterized action primitives. Advances in Neural Information Processing Systems, 34:21847–21859, 2021

work page 2021
[8]

Active reward learning with a novel acquisition function

Christian Daniel, Oliver Kroemer, Malte Viering, Jan Metz, and Jan Peters. Active reward learning with a novel acquisition function. Autonomous Robots, 39:389–405, 2015

work page 2015
[9]

Feudal reinforcement learning

Peter Dayan and Geoffrey E Hinton. Feudal reinforcement learning. Advances in neural information processing systems, 5, 1992

work page 1992
[10]

Dietterich

Thomas G. Dietterich. Hierarchical reinforcement learning with the MAXQ value function decomposition. CoRR, cs.LG/9905014, 1999. URL https://arxiv.org/abs/cs/9905014

work page arXiv 1999
[11]

Iq-learn: Inverse soft-q learning for imitation

Divyansh Garg, Shuvam Chakraborty, Chris Cundy, Jiaming Song, and Stefano Ermon. Iq-learn: Inverse soft-q learning for imitation. Advances in Neural Information Processing Systems, 34: 4028–4039, 2021

work page 2021
[12]

Relay policy learning: Solving long-horizon tasks via imitation and reinforcement learning

Abhishek Gupta, Vikash Kumar, Corey Lynch, Sergey Levine, and Karol Hausman. Relay policy learning: Solving long-horizon tasks via imitation and reinforcement learning. arXiv preprint arXiv:1910.11956, 2019

work page arXiv 1910
[14]

URL http://arxiv.org/abs/1801.01290

work page internal anchor Pith review Pith/arXiv arXiv
[15]

When waiting is not an option: Learning options with a deliberation cost

Jean Harb, Pierre-Luc Bacon, Martin Klissarov, and Doina Precup. When waiting is not an option: Learning options with a deliberation cost. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018

work page 2018
[16]

Contrastive prefence learning: Learning from human feedback without rl

Joey Hejna, Rafael Rafailov, Harshit Sikchi, Chelsea Finn, Scott Niekum, W Bradley Knox, and Dorsa Sadigh. Contrastive prefence learning: Learning from human feedback without rl. arXiv preprint arXiv:2310.13639, 2023

work page arXiv 2023
[17]

Reward learning from human preferences and demonstrations in atari, 2018

Borja Ibarz, Jan Leike, Tobias Pohlen, Geoffrey Irving, Shane Legg, and Dario Amodei. Reward learning from human preferences and demonstrations in atari, 2018

work page 2018
[18]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[19]

Interactively shaping agents via human reinforcement: The tamer framework

W Bradley Knox and Peter Stone. Interactively shaping agents via human reinforcement: The tamer framework. In Proceedings of the fifth international conference on Knowledge capture, pages 9–16, 2009

work page 2009
[20]

Discriminator-Actor-Critic: Addressing Sample Inefficiency and Reward Bias in Adversarial Imitation Learning

Ilya Kostrikov, Kumar Krishna Agrawal, Debidatta Dwibedi, Sergey Levine, and Jonathan Tompson. Discriminator-actor-critic: Addressing sample inefficiency and reward bias in adversarial imitation learning. arXiv preprint arXiv:1809.02925, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[21]

Pebble: Feedback-efficient interactive reinforce- ment learning via relabeling experience and unsupervised pre-training, 2021

Kimin Lee, Laura Smith, and Pieter Abbeel. Pebble: Feedback-efficient interactive reinforce- ment learning via relabeling experience and unsupervised pre-training, 2021. 11

work page 2021
[22]

Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review

Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:1805.00909, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[23]

Learning multi-level hierar- chies with hindsight

Andrew Levy, George Konidaris, Robert Platt, and Kate Saenko. Learning multi-level hierar- chies with hindsight. In International Conference on Learning Representations, 2018

work page 2018
[24]

Bome! bilevel optimization made easy: A simple first-order approach

Bo Liu, Mao Ye, Stephen Wright, Peter Stone, and Qiang Liu. Bome! bilevel optimization made easy: A simple first-order approach. Advances in neural information processing systems, 35:17248–17262, 2022

work page 2022
[25]

Data-efficient hierarchical reinforcement learning

Ofir Nachum, Shixiang Shane Gu, Honglak Lee, and Sergey Levine. Data-efficient hierarchical reinforcement learning. Advances in neural information processing systems, 31, 2018

work page 2018
[26]

Why does hierarchy (sometimes) work so well in reinforcement learning? arXiv preprint arXiv:1909.10618, 2019

Ofir Nachum, Haoran Tang, Xingyu Lu, Shixiang Gu, Honglak Lee, and Sergey Levine. Why does hierarchy (sometimes) work so well in reinforcement learning? arXiv preprint arXiv:1909.10618, 2019

work page arXiv 1909
[27]

Augmenting reinforcement learning with behavior primitives for diverse manipulation tasks

Soroush Nasiriany, Huihan Liu, and Yuke Zhu. Augmenting reinforcement learning with behavior primitives for diverse manipulation tasks. CoRR, abs/2110.03655, 2021. URL https://arxiv.org/abs/2110.03655

work page arXiv 2021
[28]

Reinforcement learning with hierarchies of machines

Ronald Parr and Stuart Russell. Reinforcement learning with hierarchies of machines. In M. Jordan, M. Kearns, and S. Solla, editors, Advances in Neural Information Processing Systems, volume 10. MIT Press, 1998

work page 1998
[29]

Online human training of a myoelectric prosthesis controller via actor-critic reinforcement learning

Patrick M Pilarski, Michael R Dawson, Thomas Degris, Farbod Fahimi, Jason P Carey, and Richard S Sutton. Online human training of a myoelectric prosthesis controller via actor-critic reinforcement learning. In 2011 IEEE international conference on rehabilitation robotics, pages 1–7. IEEE, 2011

work page 2011
[30]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R

Rafael Rafailov, Joey Hejna, Ryan Park, and Chelsea Finn. From r to qˆ* : Your language model is secretly a q-function. arXiv preprint arXiv:2404.12358, 2024

work page arXiv 2024
[31]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[32]

Crisp: Curriculum inducing primitive informed subgoal prediction

Utsav Singh and Vinay P Namboodiri. Crisp: Curriculum inducing primitive informed subgoal prediction. arXiv preprint arXiv:2304.03535, 2023

work page arXiv 2023
[33]

Pear: Primitive enabled adaptive relabeling for boosting hierarchical reinforcement learning

Utsav Singh and Vinay P Namboodiri. Pear: Primitive enabled adaptive relabeling for boosting hierarchical reinforcement learning. arXiv preprint arXiv:2306.06394, 2023

work page arXiv 2023
[34]

Piper: Primitive-informed preference-based hierarchical reinforcement learning via hindsight relabeling

Utsav Singh, Wesley A Suttle, Brian M Sadler, Vinay P Namboodiri, and Amrit Singh Bedi. Piper: Primitive-informed preference-based hierarchical reinforcement learning via hindsight relabeling. arXiv preprint arXiv:2404.13423, 2024

work page arXiv 2024
[35]

Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning

Richard S Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(1-2): 181–211, 1999

work page 1999
[36]

Feudal networks for hierarchical reinforcement learning

Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, and Koray Kavukcuoglu. Feudal networks for hierarchical reinforcement learning. In International Conference on Machine Learning, pages 3540–3549. PMLR, 2017

work page 2017
[37]

State- conditioned adversarial subgoal generation

Vivienne Huiling Wang, Joni Pajarinen, Tinghuai Wang, and Joni-Kristian Kämäräinen. State- conditioned adversarial subgoal generation. In Proceedings of the AAAI conference on artificial intelligence, volume 37, pages 10184–10191, 2023

work page 2023
[38]

A bayesian approach for policy learning from trajectory preference queries

Aaron Wilson, Alan Fern, and Prasad Tadepalli. A bayesian approach for policy learning from trajectory preference queries. Advances in neural information processing systems, 25, 2012

work page 2012
[39]

Modeling purposeful adaptive behavior with the principle of maximum causal entropy

Brian D Ziebart. Modeling purposeful adaptive behavior with the principle of maximum causal entropy. Carnegie Mellon University, 2010. 12 Contents 1 Introduction 1 2 Related Work 3 3 Problem Formulation 3 3.1 Hierarchical Reinforcement Learning ( HRL) . . . . . . . . . . . . . . . . . . . . . 3 3.1.1 Hierarchical Setup . . . . . . . . . . . . . . . . . . ...

work page 2010

[1] [1]

Hindsight Experience Replay

Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. CoRR, abs/1707.01495, 2017. URL http://arxiv.org/abs/1707.01495

work page internal anchor Pith review Pith/arXiv arXiv 2017

[2] [2]

Barto and Sridhar Mahadevan

Andrew G. Barto and Sridhar Mahadevan. Recent advances in hierarchical reinforcement learning. Discrete Event Dynamic Systems, 13:341–379, 2003. 10

work page 2003

[3] [3]

Ralph Allan Bradley and Milton E. Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39:324, 1952. URL https://api. semanticscholar.org/CorpusID:125209808

work page 1952

[4] [4]

Human preference scaling with demonstrations for deep reinforcement learning

Zehong Cao, Kaichiu Wong, and Chin-Teng Lin. Human preference scaling with demonstrations for deep reinforcement learning. arXiv preprint arXiv:2007.12904, 2020

work page arXiv 2007

[5] [5]

Goal-conditioned reinforcement learning with imagined subgoals

Elliot Chane-Sane, Cordelia Schmid, and Ivan Laptev. Goal-conditioned reinforcement learning with imagined subgoals. In International Conference on Machine Learning, pages 1430–1440. PMLR, 2021

work page 2021

[6] [6]

Deep reinforcement learning from human preferences

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017

work page 2017

[7] [7]

Accelerating robotic reinforcement learning via parameterized action primitives

Murtaza Dalal, Deepak Pathak, and Russ R Salakhutdinov. Accelerating robotic reinforcement learning via parameterized action primitives. Advances in Neural Information Processing Systems, 34:21847–21859, 2021

work page 2021

[8] [8]

Active reward learning with a novel acquisition function

Christian Daniel, Oliver Kroemer, Malte Viering, Jan Metz, and Jan Peters. Active reward learning with a novel acquisition function. Autonomous Robots, 39:389–405, 2015

work page 2015

[9] [9]

Feudal reinforcement learning

Peter Dayan and Geoffrey E Hinton. Feudal reinforcement learning. Advances in neural information processing systems, 5, 1992

work page 1992

[10] [10]

Dietterich

Thomas G. Dietterich. Hierarchical reinforcement learning with the MAXQ value function decomposition. CoRR, cs.LG/9905014, 1999. URL https://arxiv.org/abs/cs/9905014

work page arXiv 1999

[11] [11]

Iq-learn: Inverse soft-q learning for imitation

Divyansh Garg, Shuvam Chakraborty, Chris Cundy, Jiaming Song, and Stefano Ermon. Iq-learn: Inverse soft-q learning for imitation. Advances in Neural Information Processing Systems, 34: 4028–4039, 2021

work page 2021

[12] [12]

Relay policy learning: Solving long-horizon tasks via imitation and reinforcement learning

Abhishek Gupta, Vikash Kumar, Corey Lynch, Sergey Levine, and Karol Hausman. Relay policy learning: Solving long-horizon tasks via imitation and reinforcement learning. arXiv preprint arXiv:1910.11956, 2019

work page arXiv 1910

[13] [14]

URL http://arxiv.org/abs/1801.01290

work page internal anchor Pith review Pith/arXiv arXiv

[14] [15]

When waiting is not an option: Learning options with a deliberation cost

Jean Harb, Pierre-Luc Bacon, Martin Klissarov, and Doina Precup. When waiting is not an option: Learning options with a deliberation cost. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018

work page 2018

[15] [16]

Contrastive prefence learning: Learning from human feedback without rl

Joey Hejna, Rafael Rafailov, Harshit Sikchi, Chelsea Finn, Scott Niekum, W Bradley Knox, and Dorsa Sadigh. Contrastive prefence learning: Learning from human feedback without rl. arXiv preprint arXiv:2310.13639, 2023

work page arXiv 2023

[16] [17]

Reward learning from human preferences and demonstrations in atari, 2018

Borja Ibarz, Jan Leike, Tobias Pohlen, Geoffrey Irving, Shane Legg, and Dario Amodei. Reward learning from human preferences and demonstrations in atari, 2018

work page 2018

[17] [18]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[18] [19]

Interactively shaping agents via human reinforcement: The tamer framework

W Bradley Knox and Peter Stone. Interactively shaping agents via human reinforcement: The tamer framework. In Proceedings of the fifth international conference on Knowledge capture, pages 9–16, 2009

work page 2009

[19] [20]

Discriminator-Actor-Critic: Addressing Sample Inefficiency and Reward Bias in Adversarial Imitation Learning

Ilya Kostrikov, Kumar Krishna Agrawal, Debidatta Dwibedi, Sergey Levine, and Jonathan Tompson. Discriminator-actor-critic: Addressing sample inefficiency and reward bias in adversarial imitation learning. arXiv preprint arXiv:1809.02925, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[20] [21]

Pebble: Feedback-efficient interactive reinforce- ment learning via relabeling experience and unsupervised pre-training, 2021

Kimin Lee, Laura Smith, and Pieter Abbeel. Pebble: Feedback-efficient interactive reinforce- ment learning via relabeling experience and unsupervised pre-training, 2021. 11

work page 2021

[21] [22]

Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review

Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:1805.00909, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[22] [23]

Learning multi-level hierar- chies with hindsight

Andrew Levy, George Konidaris, Robert Platt, and Kate Saenko. Learning multi-level hierar- chies with hindsight. In International Conference on Learning Representations, 2018

work page 2018

[23] [24]

Bome! bilevel optimization made easy: A simple first-order approach

Bo Liu, Mao Ye, Stephen Wright, Peter Stone, and Qiang Liu. Bome! bilevel optimization made easy: A simple first-order approach. Advances in neural information processing systems, 35:17248–17262, 2022

work page 2022

[24] [25]

Data-efficient hierarchical reinforcement learning

Ofir Nachum, Shixiang Shane Gu, Honglak Lee, and Sergey Levine. Data-efficient hierarchical reinforcement learning. Advances in neural information processing systems, 31, 2018

work page 2018

[25] [26]

Why does hierarchy (sometimes) work so well in reinforcement learning? arXiv preprint arXiv:1909.10618, 2019

Ofir Nachum, Haoran Tang, Xingyu Lu, Shixiang Gu, Honglak Lee, and Sergey Levine. Why does hierarchy (sometimes) work so well in reinforcement learning? arXiv preprint arXiv:1909.10618, 2019

work page arXiv 1909

[26] [27]

Augmenting reinforcement learning with behavior primitives for diverse manipulation tasks

Soroush Nasiriany, Huihan Liu, and Yuke Zhu. Augmenting reinforcement learning with behavior primitives for diverse manipulation tasks. CoRR, abs/2110.03655, 2021. URL https://arxiv.org/abs/2110.03655

work page arXiv 2021

[27] [28]

Reinforcement learning with hierarchies of machines

Ronald Parr and Stuart Russell. Reinforcement learning with hierarchies of machines. In M. Jordan, M. Kearns, and S. Solla, editors, Advances in Neural Information Processing Systems, volume 10. MIT Press, 1998

work page 1998

[28] [29]

Online human training of a myoelectric prosthesis controller via actor-critic reinforcement learning

Patrick M Pilarski, Michael R Dawson, Thomas Degris, Farbod Fahimi, Jason P Carey, and Richard S Sutton. Online human training of a myoelectric prosthesis controller via actor-critic reinforcement learning. In 2011 IEEE international conference on rehabilitation robotics, pages 1–7. IEEE, 2011

work page 2011

[29] [30]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R

Rafael Rafailov, Joey Hejna, Ryan Park, and Chelsea Finn. From r to qˆ* : Your language model is secretly a q-function. arXiv preprint arXiv:2404.12358, 2024

work page arXiv 2024

[30] [31]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024

work page 2024

[31] [32]

Crisp: Curriculum inducing primitive informed subgoal prediction

Utsav Singh and Vinay P Namboodiri. Crisp: Curriculum inducing primitive informed subgoal prediction. arXiv preprint arXiv:2304.03535, 2023

work page arXiv 2023

[32] [33]

Pear: Primitive enabled adaptive relabeling for boosting hierarchical reinforcement learning

Utsav Singh and Vinay P Namboodiri. Pear: Primitive enabled adaptive relabeling for boosting hierarchical reinforcement learning. arXiv preprint arXiv:2306.06394, 2023

work page arXiv 2023

[33] [34]

Piper: Primitive-informed preference-based hierarchical reinforcement learning via hindsight relabeling

Utsav Singh, Wesley A Suttle, Brian M Sadler, Vinay P Namboodiri, and Amrit Singh Bedi. Piper: Primitive-informed preference-based hierarchical reinforcement learning via hindsight relabeling. arXiv preprint arXiv:2404.13423, 2024

work page arXiv 2024

[34] [35]

Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning

Richard S Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(1-2): 181–211, 1999

work page 1999

[35] [36]

Feudal networks for hierarchical reinforcement learning

Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, and Koray Kavukcuoglu. Feudal networks for hierarchical reinforcement learning. In International Conference on Machine Learning, pages 3540–3549. PMLR, 2017

work page 2017

[36] [37]

State- conditioned adversarial subgoal generation

Vivienne Huiling Wang, Joni Pajarinen, Tinghuai Wang, and Joni-Kristian Kämäräinen. State- conditioned adversarial subgoal generation. In Proceedings of the AAAI conference on artificial intelligence, volume 37, pages 10184–10191, 2023

work page 2023

[37] [38]

A bayesian approach for policy learning from trajectory preference queries

Aaron Wilson, Alan Fern, and Prasad Tadepalli. A bayesian approach for policy learning from trajectory preference queries. Advances in neural information processing systems, 25, 2012

work page 2012

[38] [39]

Modeling purposeful adaptive behavior with the principle of maximum causal entropy

Brian D Ziebart. Modeling purposeful adaptive behavior with the principle of maximum causal entropy. Carnegie Mellon University, 2010. 12 Contents 1 Introduction 1 2 Related Work 3 3 Problem Formulation 3 3.1 Hierarchical Reinforcement Learning ( HRL) . . . . . . . . . . . . . . . . . . . . . 3 3.1.1 Hierarchical Setup . . . . . . . . . . . . . . . . . . ...

work page 2010