RMTL: Reinforced Micro-task Learning for Long-Horizon Manipulation with VLM Rewards

An{\i}l Can Ate\c{s}; Cihan Topal; Orhan Kahraman

arxiv: 2606.26175 · v1 · pith:GZ6GYSFJnew · submitted 2026-06-24 · 💻 cs.RO

RMTL: Reinforced Micro-task Learning for Long-Horizon Manipulation with VLM Rewards

An{\i}l Can Ate\c{s} , Orhan Kahraman , Cihan Topal This is my paper

Pith reviewed 2026-06-26 02:05 UTC · model grok-4.3

classification 💻 cs.RO

keywords reinforcement learningrobotic manipulationvision-language modelsmicro-task decompositionhierarchical policiesreward designFetch environment

0 comments

The pith

Decomposing long-horizon robotic tasks into micro-tasks with separate VLM prompts produces non-flat rewards that speed up reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that one global text prompt fed to a vision-language model yields reward signals that stay near zero for most of a long manipulation sequence with random starts. Breaking the same task into three micro-tasks, each with its own short prompt, lets the model score progress at every stage and supplies an averaged multi-view signal. The agent first learns under a simple rule that picks the active micro-task, then a learned manager replaces the rule. Experiments on the Fetch arm indicate faster policy improvement than the single-prompt baseline. If correct, this decomposition removes the need for dense manual rewards or large demonstration sets when training language-guided robot controllers.

Core claim

RMTL decomposes a manipulation task into a small set of language-described micro-tasks and trains the agent to switch between them. At each step the agent receives a multi-view VLM reward computed using the prompt of the currently active micro-task. A reverse curriculum gradually hardens the initial conditions while a PPO worker is first trained with a fixed distance-based rule that selects the active micro-task; this rule is later replaced by a learned hierarchical manager. The method uses three short stage-specific prompts without further tuning and produces more informative reward signals than a single global prompt.

What carries the argument

Micro-task decomposition that assigns a distinct language prompt to each stage so the VLM can compute a non-flat, view-averaged reward for the active stage only.

If this is right

VLM rewards become usable for early phases of long sequences instead of remaining flat.
A learned manager can replace an initial rule-based phase selector while preserving performance.
The approach requires no prompt retuning once the three stage prompts are chosen.
Single-prompt VLM rewards remain too coarse for long-horizon tasks with varied starts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same micro-task split could be tested with other vision-language reward models to check whether the gain is specific to VLM scoring.
Adding more micro-tasks might extend the method to sequences longer than those tested in Fetch.
The hierarchical manager learned here could be reused as a starting point for other manipulation problems that share similar stage structure.

Load-bearing premise

Three short stage-specific prompts can be written once, without further tuning, such that the VLM produces sufficiently informative and non-flat rewards for each micro-task across randomized initial conditions.

What would settle it

Training curves on the Fetch environment with randomized starts that show identical learning speed and final success rate for RMTL and the single-prompt VLM baseline.

Figures

Figures reproduced from arXiv: 2606.26175 by An{\i}l Can Ate\c{s}, Cihan Topal, Orhan Kahraman.

**Figure 1.** Figure 1: RMTL at a glance. A single task-level VLM prompt can produce a nearly flat reward over early parts of a long-horizon manipulation trajectory. RMTL instead decomposes the task into stage-specific language prompts and evaluates the agent using only the currently active micro-task prompt. This produces more progress-aligned reward variation within each stage. Abstract Reinforcement learning (RL) for robotic m… view at source ↗

**Figure 2.** Figure 2: RMTL data flow. Solid arrows are data; dashed arrows are selection/control. The phase [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Per-view vs. aggregate reward on one frame. Six viewpoints of the same scene yield different VLM scores; their mean is the training signal. independent rollouts in [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Visual-domain modifications applied to the FetchPickAndPlace-v4 renderer. The manipulated cube is rendered in saturated red, the table is given a wooden material, and the background is set to neutral grey, with lighting tuned for contrast. manipulated cube is rendered in a saturated red, the table is given a wooden appearance, the floor and background are set to neutral grey tones, and lighting and materi… view at source ↗

**Figure 5.** Figure 5: Micro-task prompts vs. single-prompt ablation. Success rate for full RMTL, multi-task prompts + multi-view, and single global prompt + multi view, with all other RL settings held fixed. 7 [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Reverse-curriculum progression: env steps per level and rolling per-level success rate during Stage-1 training. the hardest levels indicates the randomised-init regime is the binding difficulty rather than any intermediate one. Behavioural comparison: rule vs. learned manager [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Behavioural comparison on FetchPickAndPlace-v4 (1000 paired-seed episodes, random init). (a) Grasp-detection success rate; (b–c) behavioural breakdown over the same episodes. Manager–rule agreement: ~73%. 8 [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Multi-view reward over two episodes. Each sub-figure shows, top to bottom: multi-view mean vs. single wide_view baseline; episode-rescaled VLM similarity vs. a dense −∥g − o∥ reward; raw VLM similarity. Multi-view vs. single-view properties [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 9.** Figure 9: Stage-2 HRL data flow with a frozen worker. Solid arrows are online data; dashed arrows are offline / regularisation signals. ManagerNet (MLP [128, 128], REINFORCE-trained) is the only new learned component; the worker is loaded from the Stage-1 PPO 560k checkpoint and its actor is frozen for the first 50,000 env steps (critic remains trainable). The distance-based rule selector is used only offline (to la… view at source ↗

**Figure 10.** Figure 10: Effect of α on the temporal reward profile, one representative episode. (a) Raw active-micro-task VLM reward for α ∈ [0, 1]: higher α stretches the reward to a larger absolute range (better signal-to-noise) but does not change the shape of the trajectory. (b) Each curve rescaled to [−1, 0] via per-α min–max: the rescaled trajectories cluster tightly, confirming that α primarily controls magnitude rather t… view at source ↗

**Figure 11.** Figure 11: Effect of α on the state-conditioned reward, aggregated over 10 analysis episodes. Each curve is the per-bin mean of the rescaled active-micro-task VLM reward; one curve per α ∈ {0, 0.1, . . . , 1} (colourbar). Left: monotone climb in gripper–cube distance — the dominant state dimension — with the α family tightly bundled. Right: object-lift height. The sharp drop at h ≈ 0 is the table–air boundary; the o… view at source ↗

read the original abstract

Reinforcement learning (RL) for robotic manipulation often requires manually designing a dense reward function, which is difficult to tune and often fragile, or learning a reward from human demonstrations or preferences, which can be expensive. A recent line of work uses pretrained vision-language models (VLMs) as zero-shot reward models, replacing these costs with a single text prompt. However, we argue that a single global prompt is too coarse for long-horizon manipulation tasks with randomized initial conditions. The single-prompt VLM reward is near-flat for much of the trajectory, making early progress hard for the agent to detect. We propose Reinforced Micro-Task Learning (RMTL), an approach that decomposes a manipulation task into a small set of language-described micro-tasks and trains the agent to switch between them. At each step, the agent receives a multi-view VLM reward computed using the prompt of the currently active micro-task and averaged across multiple camera views to reduce the effect of view-specific occlusions. A reverse curriculum gradually exposes the agent to harder initial conditions, while a PPO worker is first trained with a fixed distance-based rule that selects the active micro-task. We then replace this rule with a learned hierarchical manager, turning rule-based phase selection into a fully learned hierarchical policy. We instantiate RMTL on the Fetch manipulation environment using three short stage-specific prompts and without additional prompt tuning. Experiments show that RMTL provides more informative reward signals than single-prompt VLM rewards, enabling faster learning. These results suggest that decomposing VLM rewards into micro-task-specific language prompts can substantially improve the scalability of language-guided reinforcement learning for robotic manipulation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RMTL shows micro-task decomposition with stage-specific VLM prompts can yield denser rewards than single-prompt baselines on Fetch, plus a clean switch from rule-based to learned manager.

read the letter

The main thing here is that breaking a long manipulation task into three language-described micro-tasks, each with its own short VLM prompt, produces more informative rewards than one global prompt. They add multi-view averaging to cut view-specific noise, a reverse curriculum, and they start with a distance-based rule for phase selection before replacing it with a learned hierarchical manager.

The combination is new relative to the single-prompt VLM baselines they cite. The staged prompting plus the explicit rule-to-learned transition is a practical incremental step that directly targets the flat-reward problem for randomized initial conditions. The paper does a clean job laying out why a single prompt fails on long horizons and showing a low-cost way to get denser signals without demonstrations or extra tuning.

The soft spots are in the evidence. The abstract asserts faster learning and non-flat rewards, but the strength of that claim rests on whether the three fixed prompts actually stay informative across randomized Fetch starts; if any prompt still produces near-constant output, the decomposition does not solve the identified problem. Starting with a rule-based manager also means the VLM component is only tested after the manager is already trained, so clean ablations would be needed to attribute gains to the prompt decomposition rather than the curriculum or averaging. The weakest assumption is that three short prompts written once will work without further adjustment under randomization.

This is for robotic RL researchers who are already using VLMs for rewards and want a concrete way to scale to longer tasks. A reader focused on practical reward engineering would find the method useful.

It deserves peer review because the problem is real, the method is straightforward to implement, and the experiments are on a standard environment even if the results need close scrutiny.

Referee Report

2 major / 1 minor

Summary. The paper proposes Reinforced Micro-Task Learning (RMTL) to address flat VLM rewards in long-horizon robotic manipulation. It decomposes tasks into three language-described micro-tasks, computes multi-view VLM rewards using stage-specific prompts, employs a reverse curriculum and an initial distance-based rule for micro-task selection before replacing it with a learned hierarchical manager, and reports faster learning than single-prompt baselines on the Fetch environment without additional prompt tuning.

Significance. If the empirical claims hold, the decomposition approach could improve the practicality of zero-shot VLM rewards for complex manipulation by providing denser signals, potentially scaling language-guided RL beyond short-horizon tasks.

major comments (2)

[Abstract] Abstract: the central claim that 'decomposing VLM rewards into micro-task-specific language prompts' yields 'more informative reward signals' and 'faster learning' rests on the unverified assumption that three fixed prompts produce non-flat rewards under randomized initial conditions; no reward density statistics, histograms, or ablation tables are provided to confirm this.
[Abstract] Abstract: the initial training uses a distance-based rule rather than VLM rewards, so the contribution of the prompt decomposition to learning is only tested after the manager is learned; without separate curves isolating the VLM component or controls for curriculum effects, attribution to the micro-task decomposition is not established.

minor comments (1)

[Abstract] Abstract: the Fetch environment and number of views used for averaging are not specified, which limits reproducibility assessment.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive comments on our manuscript. We address each major comment point by point below.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that 'decomposing VLM rewards into micro-task-specific language prompts' yields 'more informative reward signals' and 'faster learning' rests on the unverified assumption that three fixed prompts produce non-flat rewards under randomized initial conditions; no reward density statistics, histograms, or ablation tables are provided to confirm this.

Authors: We agree that the manuscript does not provide direct reward density statistics, histograms, or ablation tables to verify non-flat rewards from the micro-task prompts. The claim of more informative signals is supported indirectly by the faster learning curves relative to the single-prompt baseline. To address this, we will add reward signal analysis or visualizations in the revised manuscript. revision: yes
Referee: [Abstract] Abstract: the initial training uses a distance-based rule rather than VLM rewards, so the contribution of the prompt decomposition to learning is only tested after the manager is learned; without separate curves isolating the VLM component or controls for curriculum effects, attribution to the micro-task decomposition is not established.

Authors: We clarify that VLM rewards (computed with the active micro-task prompt) are used from the start of training; the distance-based rule governs only the selection of the active micro-task in the initial phase, prior to training the learned hierarchical manager. The reverse curriculum is applied consistently. We acknowledge that this setup leaves potential confounding from the rule-based selection and curriculum, and that isolating curves or additional controls would better establish attribution to the decomposition. We will incorporate further discussion or ablations in the revision. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical method with external VLM evaluation

full rationale

The paper presents an empirical RL approach that decomposes tasks into micro-tasks and uses fixed VLM prompts for rewards, validated via experiments on the Fetch environment. No equations, fitted parameters, or self-citations reduce any performance claim to a quantity defined by construction within the paper. The central results rely on observed learning curves rather than any self-referential derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no concrete free parameters, axioms, or invented entities; all such elements would require the full methods and experimental sections.

pith-pipeline@v0.9.1-grok · 5839 in / 1118 out tokens · 25755 ms · 2026-06-26T02:05:36.959479+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 6 canonical work pages · 5 internal anchors

[1]

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, et al. Do as i can, not as i say: Grounding language in robotic affordances.arXiv preprint arXiv:2204.01691, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[2]

The option-critic architecture

Pierre-Luc Bacon, Jean Harb, and Doina Precup. The option-critic architecture. InAAAI Conference on Artificial Intelligence (AAAI), 2017

2017
[3]

RT-1: Robotics Transformer for Real-World Control at Scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, et al. RT-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[4]

MineDojo: Building open-ended embodied agents with internet-scale knowledge.Advances in Neural Information Processing Systems (NeurIPS), 35:18343–18362, 2022

Linxi Fan, Guanzhi Wang, Yunfan Jiang, Ajay Mandlekar, Yuncong Yang, Haoyi Zhu, Andrew Tang, De-An Huang, Yuke Zhu, and Anima Anandkumar. MineDojo: Building open-ended embodied agents with internet-scale knowledge.Advances in Neural Information Processing Systems (NeurIPS), 35:18343–18362, 2022

2022
[5]

Reverse curriculum generation for reinforcement learning

Carlos Florensa, David Held, Markus Wulfmeier, Michael Zhang, and Pieter Abbeel. Reverse curriculum generation for reinforcement learning. InConference on Robot Learning (CoRL), 2017

2017
[6]

CURL: Contrastive unsupervised rep- resentations for reinforcement learning

Michael Laskin, Aravind Srinivas, and Pieter Abbeel. CURL: Contrastive unsupervised rep- resentations for reinforcement learning. InInternational Conference on Machine Learning (ICML), pages 5639–5650, 2020

2020
[7]

Reinforcement learning with augmented data.Advances in Neural Information Processing Systems (NeurIPS), 33:19884–19895, 2020

Misha Laskin, Kimin Lee, Adam Stooke, Lerrel Pinto, Pieter Abbeel, and Aravind Srinivas. Reinforcement learning with augmented data.Advances in Neural Information Processing Systems (NeurIPS), 33:19884–19895, 2020

2020
[8]

VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training

Yecheng Jason Ma, Shagun Sodhani, Dinesh Jayaraman, Osbert Bastani, Vikash Kumar, and Amy Zhang. VIP: Towards universal visual reward and representation via value-implicit pre-training.arXiv preprint arXiv:2210.00030, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[9]

LIV: Language-image representations and rewards for robotic control

Yecheng Jason Ma, Vikash Kumar, Amy Zhang, Osbert Bastani, and Dinesh Jayaraman. LIV: Language-image representations and rewards for robotic control. InInternational Conference on Machine Learning (ICML), pages 23301–23320, 2023

2023
[10]

Eureka: Human-Level Reward Design via Coding Large Language Models

Yecheng Jason Ma, William Liang, Guanzhi Wang, De-An Huang, Osbert Bastani, Dinesh Jayaraman, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Eureka: Human-level reward design via coding large language models.arXiv preprint arXiv:2310.12931, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

PE-Core-bigG-14-448: Perception encoders for vision-language modelling, 2025

Meta AI. PE-Core-bigG-14-448: Perception encoders for vision-language modelling, 2025. Open weights, distributed via the OpenCLIP API; available at https://huggingface.co/ facebook/PE-Core-bigG-14-448

2025
[12]

Data-efficient hierarchical reinforcement learning.Advances in Neural Information Processing Systems (NeurIPS), 31, 2018

Ofir Nachum, Shixiang Shane Gu, Honglak Lee, and Sergey Levine. Data-efficient hierarchical reinforcement learning.Advances in Neural Information Processing Systems (NeurIPS), 31, 2018

2018
[13]

Multi-goal reinforcement learning: Challenging robotics environments and request for research

Matthias Plappert, Marcin Andrychowicz, Alex Ray, Bob McGrew, Bowen Baker, Glenn Powell, Jonas Schneider, Josh Tobin, Maciek Chociej, Peter Welinder, Vikash Kumar, and Wojciech Zaremba. Multi-goal reinforcement learning: Challenging robotics environments and request for research. 2018

2018
[14]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InInterna- tional Conference on Machine Learning (ICML), 2021

2021
[15]

Stable-Baselines3: Reliable reinforcement learning implementations.Journal of Machine Learning Research (JMLR), 2021

Antonin Raffin, Ashley Hill, Adam Gleave, Anssi Kanervisto, Maximilian Ernestus, and Noah Dormann. Stable-Baselines3: Reliable reinforcement learning implementations.Journal of Machine Learning Research (JMLR), 2021. 10

2021
[16]

Vision- language models are zero-shot reward models for reinforcement learning

Juan Rocamonde, Victoriano Montesinos, Elvis Nava, Ethan Perez, and David Lindner. Vision- language models are zero-shot reward models for reinforcement learning. InInternational Conference on Learning Representations (ICLR), 2024

2024
[17]

Gordon, and J

Stéphane Ross, Geoffrey J. Gordon, and J. Andrew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InInternational Conference on Artificial Intelligence and Statistics (AISTATS), 2011

2011
[18]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[19]

Sumedh Sontakke, Jesse Zhang, Sébastien M. R. Arnold, et al. RoboCLIP: One demonstration is enough to learn robot policies.arXiv preprint arXiv:2310.07899, 2023

work page arXiv 2023
[20]

Sutton, Doina Precup, and Satinder Singh

Richard S. Sutton, Doina Precup, and Satinder Singh. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning.Artificial Intelligence, 112(1–2): 181–211, 1999

1999
[21]

a robot gripper and a red cube on a table

Emanuel Todorov, Tom Erez, and Yuval Tassa. MuJoCo: A physics engine for model-based control. InIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2012. 11 A Implementation details Algorithm.The RMTL algorithm is given below. Algorithm 1Stage-2 HRL Learned Manager with a Frozen-then-Unfrozen PPO Worker Require: Pretrained PPO work...

2012

[1] [1]

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, et al. Do as i can, not as i say: Grounding language in robotic affordances.arXiv preprint arXiv:2204.01691, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[2] [2]

The option-critic architecture

Pierre-Luc Bacon, Jean Harb, and Doina Precup. The option-critic architecture. InAAAI Conference on Artificial Intelligence (AAAI), 2017

2017

[3] [3]

RT-1: Robotics Transformer for Real-World Control at Scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, et al. RT-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[4] [4]

MineDojo: Building open-ended embodied agents with internet-scale knowledge.Advances in Neural Information Processing Systems (NeurIPS), 35:18343–18362, 2022

Linxi Fan, Guanzhi Wang, Yunfan Jiang, Ajay Mandlekar, Yuncong Yang, Haoyi Zhu, Andrew Tang, De-An Huang, Yuke Zhu, and Anima Anandkumar. MineDojo: Building open-ended embodied agents with internet-scale knowledge.Advances in Neural Information Processing Systems (NeurIPS), 35:18343–18362, 2022

2022

[5] [5]

Reverse curriculum generation for reinforcement learning

Carlos Florensa, David Held, Markus Wulfmeier, Michael Zhang, and Pieter Abbeel. Reverse curriculum generation for reinforcement learning. InConference on Robot Learning (CoRL), 2017

2017

[6] [6]

CURL: Contrastive unsupervised rep- resentations for reinforcement learning

Michael Laskin, Aravind Srinivas, and Pieter Abbeel. CURL: Contrastive unsupervised rep- resentations for reinforcement learning. InInternational Conference on Machine Learning (ICML), pages 5639–5650, 2020

2020

[7] [7]

Reinforcement learning with augmented data.Advances in Neural Information Processing Systems (NeurIPS), 33:19884–19895, 2020

Misha Laskin, Kimin Lee, Adam Stooke, Lerrel Pinto, Pieter Abbeel, and Aravind Srinivas. Reinforcement learning with augmented data.Advances in Neural Information Processing Systems (NeurIPS), 33:19884–19895, 2020

2020

[8] [8]

VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training

Yecheng Jason Ma, Shagun Sodhani, Dinesh Jayaraman, Osbert Bastani, Vikash Kumar, and Amy Zhang. VIP: Towards universal visual reward and representation via value-implicit pre-training.arXiv preprint arXiv:2210.00030, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[9] [9]

LIV: Language-image representations and rewards for robotic control

Yecheng Jason Ma, Vikash Kumar, Amy Zhang, Osbert Bastani, and Dinesh Jayaraman. LIV: Language-image representations and rewards for robotic control. InInternational Conference on Machine Learning (ICML), pages 23301–23320, 2023

2023

[10] [10]

Eureka: Human-Level Reward Design via Coding Large Language Models

Yecheng Jason Ma, William Liang, Guanzhi Wang, De-An Huang, Osbert Bastani, Dinesh Jayaraman, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Eureka: Human-level reward design via coding large language models.arXiv preprint arXiv:2310.12931, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [11]

PE-Core-bigG-14-448: Perception encoders for vision-language modelling, 2025

Meta AI. PE-Core-bigG-14-448: Perception encoders for vision-language modelling, 2025. Open weights, distributed via the OpenCLIP API; available at https://huggingface.co/ facebook/PE-Core-bigG-14-448

2025

[12] [12]

Data-efficient hierarchical reinforcement learning.Advances in Neural Information Processing Systems (NeurIPS), 31, 2018

Ofir Nachum, Shixiang Shane Gu, Honglak Lee, and Sergey Levine. Data-efficient hierarchical reinforcement learning.Advances in Neural Information Processing Systems (NeurIPS), 31, 2018

2018

[13] [13]

Multi-goal reinforcement learning: Challenging robotics environments and request for research

Matthias Plappert, Marcin Andrychowicz, Alex Ray, Bob McGrew, Bowen Baker, Glenn Powell, Jonas Schneider, Josh Tobin, Maciek Chociej, Peter Welinder, Vikash Kumar, and Wojciech Zaremba. Multi-goal reinforcement learning: Challenging robotics environments and request for research. 2018

2018

[14] [14]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InInterna- tional Conference on Machine Learning (ICML), 2021

2021

[15] [15]

Stable-Baselines3: Reliable reinforcement learning implementations.Journal of Machine Learning Research (JMLR), 2021

Antonin Raffin, Ashley Hill, Adam Gleave, Anssi Kanervisto, Maximilian Ernestus, and Noah Dormann. Stable-Baselines3: Reliable reinforcement learning implementations.Journal of Machine Learning Research (JMLR), 2021. 10

2021

[16] [16]

Vision- language models are zero-shot reward models for reinforcement learning

Juan Rocamonde, Victoriano Montesinos, Elvis Nava, Ethan Perez, and David Lindner. Vision- language models are zero-shot reward models for reinforcement learning. InInternational Conference on Learning Representations (ICLR), 2024

2024

[17] [17]

Gordon, and J

Stéphane Ross, Geoffrey J. Gordon, and J. Andrew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InInternational Conference on Artificial Intelligence and Statistics (AISTATS), 2011

2011

[18] [18]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[19] [19]

Sumedh Sontakke, Jesse Zhang, Sébastien M. R. Arnold, et al. RoboCLIP: One demonstration is enough to learn robot policies.arXiv preprint arXiv:2310.07899, 2023

work page arXiv 2023

[20] [20]

Sutton, Doina Precup, and Satinder Singh

Richard S. Sutton, Doina Precup, and Satinder Singh. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning.Artificial Intelligence, 112(1–2): 181–211, 1999

1999

[21] [21]

a robot gripper and a red cube on a table

Emanuel Todorov, Tom Erez, and Yuval Tassa. MuJoCo: A physics engine for model-based control. InIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2012. 11 A Implementation details Algorithm.The RMTL algorithm is given below. Algorithm 1Stage-2 HRL Learned Manager with a Frozen-then-Unfrozen PPO Worker Require: Pretrained PPO work...

2012