pith. sign in

arxiv: 2606.26175 · v1 · pith:GZ6GYSFJnew · submitted 2026-06-24 · 💻 cs.RO

RMTL: Reinforced Micro-task Learning for Long-Horizon Manipulation with VLM Rewards

Pith reviewed 2026-06-26 02:05 UTC · model grok-4.3

classification 💻 cs.RO
keywords reinforcement learningrobotic manipulationvision-language modelsmicro-task decompositionhierarchical policiesreward designFetch environment
0
0 comments X

The pith

Decomposing long-horizon robotic tasks into micro-tasks with separate VLM prompts produces non-flat rewards that speed up reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that one global text prompt fed to a vision-language model yields reward signals that stay near zero for most of a long manipulation sequence with random starts. Breaking the same task into three micro-tasks, each with its own short prompt, lets the model score progress at every stage and supplies an averaged multi-view signal. The agent first learns under a simple rule that picks the active micro-task, then a learned manager replaces the rule. Experiments on the Fetch arm indicate faster policy improvement than the single-prompt baseline. If correct, this decomposition removes the need for dense manual rewards or large demonstration sets when training language-guided robot controllers.

Core claim

RMTL decomposes a manipulation task into a small set of language-described micro-tasks and trains the agent to switch between them. At each step the agent receives a multi-view VLM reward computed using the prompt of the currently active micro-task. A reverse curriculum gradually hardens the initial conditions while a PPO worker is first trained with a fixed distance-based rule that selects the active micro-task; this rule is later replaced by a learned hierarchical manager. The method uses three short stage-specific prompts without further tuning and produces more informative reward signals than a single global prompt.

What carries the argument

Micro-task decomposition that assigns a distinct language prompt to each stage so the VLM can compute a non-flat, view-averaged reward for the active stage only.

If this is right

  • VLM rewards become usable for early phases of long sequences instead of remaining flat.
  • A learned manager can replace an initial rule-based phase selector while preserving performance.
  • The approach requires no prompt retuning once the three stage prompts are chosen.
  • Single-prompt VLM rewards remain too coarse for long-horizon tasks with varied starts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same micro-task split could be tested with other vision-language reward models to check whether the gain is specific to VLM scoring.
  • Adding more micro-tasks might extend the method to sequences longer than those tested in Fetch.
  • The hierarchical manager learned here could be reused as a starting point for other manipulation problems that share similar stage structure.

Load-bearing premise

Three short stage-specific prompts can be written once, without further tuning, such that the VLM produces sufficiently informative and non-flat rewards for each micro-task across randomized initial conditions.

What would settle it

Training curves on the Fetch environment with randomized starts that show identical learning speed and final success rate for RMTL and the single-prompt VLM baseline.

Figures

Figures reproduced from arXiv: 2606.26175 by An{\i}l Can Ate\c{s}, Cihan Topal, Orhan Kahraman.

Figure 1
Figure 1. Figure 1: RMTL at a glance. A single task-level VLM prompt can produce a nearly flat reward over early parts of a long-horizon manipulation trajectory. RMTL instead decomposes the task into stage-specific language prompts and evaluates the agent using only the currently active micro-task prompt. This produces more progress-aligned reward variation within each stage. Abstract Reinforcement learning (RL) for robotic m… view at source ↗
Figure 2
Figure 2. Figure 2: RMTL data flow. Solid arrows are data; dashed arrows are selection/control. The phase [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Per-view vs. aggregate reward on one frame. Six viewpoints of the same scene yield different VLM scores; their mean is the training signal. independent rollouts in [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Visual-domain modifications applied to the FetchPickAndPlace-v4 renderer. The manipulated cube is rendered in saturated red, the table is given a wooden material, and the back￾ground is set to neutral grey, with lighting tuned for contrast. manipulated cube is rendered in a saturated red, the table is given a wooden appearance, the floor and background are set to neutral grey tones, and lighting and materi… view at source ↗
Figure 5
Figure 5. Figure 5: Micro-task prompts vs. single-prompt ablation. Success rate for full RMTL, multi-task prompts + multi-view, and single global prompt + multi view, with all other RL settings held fixed. 7 [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Reverse-curriculum progression: env steps per level and rolling per-level success rate during Stage-1 training. the hardest levels indicates the randomised-init regime is the binding difficulty rather than any intermediate one. Behavioural comparison: rule vs. learned manager [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Behavioural comparison on FetchPickAndPlace-v4 (1000 paired-seed episodes, ran￾dom init). (a) Grasp-detection success rate; (b–c) behavioural breakdown over the same episodes. Manager–rule agreement: ~73%. 8 [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Multi-view reward over two episodes. Each sub-figure shows, top to bottom: multi-view mean vs. single wide_view baseline; episode-rescaled VLM similarity vs. a dense −∥g − o∥ reward; raw VLM similarity. Multi-view vs. single-view properties [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Stage-2 HRL data flow with a frozen worker. Solid arrows are online data; dashed arrows are offline / regularisation signals. ManagerNet (MLP [128, 128], REINFORCE-trained) is the only new learned component; the worker is loaded from the Stage-1 PPO 560k checkpoint and its actor is frozen for the first 50,000 env steps (critic remains trainable). The distance-based rule selector is used only offline (to la… view at source ↗
Figure 10
Figure 10. Figure 10: Effect of α on the temporal reward profile, one representative episode. (a) Raw active-micro-task VLM reward for α ∈ [0, 1]: higher α stretches the reward to a larger absolute range (better signal-to-noise) but does not change the shape of the trajectory. (b) Each curve rescaled to [−1, 0] via per-α min–max: the rescaled trajectories cluster tightly, confirming that α primarily controls magnitude rather t… view at source ↗
Figure 11
Figure 11. Figure 11: Effect of α on the state-conditioned reward, aggregated over 10 analysis episodes. Each curve is the per-bin mean of the rescaled active-micro-task VLM reward; one curve per α ∈ {0, 0.1, . . . , 1} (colourbar). Left: monotone climb in gripper–cube distance — the dominant state dimension — with the α family tightly bundled. Right: object-lift height. The sharp drop at h ≈ 0 is the table–air boundary; the o… view at source ↗
read the original abstract

Reinforcement learning (RL) for robotic manipulation often requires manually designing a dense reward function, which is difficult to tune and often fragile, or learning a reward from human demonstrations or preferences, which can be expensive. A recent line of work uses pretrained vision-language models (VLMs) as zero-shot reward models, replacing these costs with a single text prompt. However, we argue that a single global prompt is too coarse for long-horizon manipulation tasks with randomized initial conditions. The single-prompt VLM reward is near-flat for much of the trajectory, making early progress hard for the agent to detect. We propose Reinforced Micro-Task Learning (RMTL), an approach that decomposes a manipulation task into a small set of language-described micro-tasks and trains the agent to switch between them. At each step, the agent receives a multi-view VLM reward computed using the prompt of the currently active micro-task and averaged across multiple camera views to reduce the effect of view-specific occlusions. A reverse curriculum gradually exposes the agent to harder initial conditions, while a PPO worker is first trained with a fixed distance-based rule that selects the active micro-task. We then replace this rule with a learned hierarchical manager, turning rule-based phase selection into a fully learned hierarchical policy. We instantiate RMTL on the Fetch manipulation environment using three short stage-specific prompts and without additional prompt tuning. Experiments show that RMTL provides more informative reward signals than single-prompt VLM rewards, enabling faster learning. These results suggest that decomposing VLM rewards into micro-task-specific language prompts can substantially improve the scalability of language-guided reinforcement learning for robotic manipulation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes Reinforced Micro-Task Learning (RMTL) to address flat VLM rewards in long-horizon robotic manipulation. It decomposes tasks into three language-described micro-tasks, computes multi-view VLM rewards using stage-specific prompts, employs a reverse curriculum and an initial distance-based rule for micro-task selection before replacing it with a learned hierarchical manager, and reports faster learning than single-prompt baselines on the Fetch environment without additional prompt tuning.

Significance. If the empirical claims hold, the decomposition approach could improve the practicality of zero-shot VLM rewards for complex manipulation by providing denser signals, potentially scaling language-guided RL beyond short-horizon tasks.

major comments (2)
  1. [Abstract] Abstract: the central claim that 'decomposing VLM rewards into micro-task-specific language prompts' yields 'more informative reward signals' and 'faster learning' rests on the unverified assumption that three fixed prompts produce non-flat rewards under randomized initial conditions; no reward density statistics, histograms, or ablation tables are provided to confirm this.
  2. [Abstract] Abstract: the initial training uses a distance-based rule rather than VLM rewards, so the contribution of the prompt decomposition to learning is only tested after the manager is learned; without separate curves isolating the VLM component or controls for curriculum effects, attribution to the micro-task decomposition is not established.
minor comments (1)
  1. [Abstract] Abstract: the Fetch environment and number of views used for averaging are not specified, which limits reproducibility assessment.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive comments on our manuscript. We address each major comment point by point below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that 'decomposing VLM rewards into micro-task-specific language prompts' yields 'more informative reward signals' and 'faster learning' rests on the unverified assumption that three fixed prompts produce non-flat rewards under randomized initial conditions; no reward density statistics, histograms, or ablation tables are provided to confirm this.

    Authors: We agree that the manuscript does not provide direct reward density statistics, histograms, or ablation tables to verify non-flat rewards from the micro-task prompts. The claim of more informative signals is supported indirectly by the faster learning curves relative to the single-prompt baseline. To address this, we will add reward signal analysis or visualizations in the revised manuscript. revision: yes

  2. Referee: [Abstract] Abstract: the initial training uses a distance-based rule rather than VLM rewards, so the contribution of the prompt decomposition to learning is only tested after the manager is learned; without separate curves isolating the VLM component or controls for curriculum effects, attribution to the micro-task decomposition is not established.

    Authors: We clarify that VLM rewards (computed with the active micro-task prompt) are used from the start of training; the distance-based rule governs only the selection of the active micro-task in the initial phase, prior to training the learned hierarchical manager. The reverse curriculum is applied consistently. We acknowledge that this setup leaves potential confounding from the rule-based selection and curriculum, and that isolating curves or additional controls would better establish attribution to the decomposition. We will incorporate further discussion or ablations in the revision. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical method with external VLM evaluation

full rationale

The paper presents an empirical RL approach that decomposes tasks into micro-tasks and uses fixed VLM prompts for rewards, validated via experiments on the Fetch environment. No equations, fitted parameters, or self-citations reduce any performance claim to a quantity defined by construction within the paper. The central results rely on observed learning curves rather than any self-referential derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no concrete free parameters, axioms, or invented entities; all such elements would require the full methods and experimental sections.

pith-pipeline@v0.9.1-grok · 5839 in / 1118 out tokens · 25755 ms · 2026-06-26T02:05:36.959479+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

21 extracted references · 6 canonical work pages · 5 internal anchors

  1. [1]

    Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

    Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, et al. Do as i can, not as i say: Grounding language in robotic affordances.arXiv preprint arXiv:2204.01691, 2022

  2. [2]

    The option-critic architecture

    Pierre-Luc Bacon, Jean Harb, and Doina Precup. The option-critic architecture. InAAAI Conference on Artificial Intelligence (AAAI), 2017

  3. [3]

    RT-1: Robotics Transformer for Real-World Control at Scale

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, et al. RT-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

  4. [4]

    MineDojo: Building open-ended embodied agents with internet-scale knowledge.Advances in Neural Information Processing Systems (NeurIPS), 35:18343–18362, 2022

    Linxi Fan, Guanzhi Wang, Yunfan Jiang, Ajay Mandlekar, Yuncong Yang, Haoyi Zhu, Andrew Tang, De-An Huang, Yuke Zhu, and Anima Anandkumar. MineDojo: Building open-ended embodied agents with internet-scale knowledge.Advances in Neural Information Processing Systems (NeurIPS), 35:18343–18362, 2022

  5. [5]

    Reverse curriculum generation for reinforcement learning

    Carlos Florensa, David Held, Markus Wulfmeier, Michael Zhang, and Pieter Abbeel. Reverse curriculum generation for reinforcement learning. InConference on Robot Learning (CoRL), 2017

  6. [6]

    CURL: Contrastive unsupervised rep- resentations for reinforcement learning

    Michael Laskin, Aravind Srinivas, and Pieter Abbeel. CURL: Contrastive unsupervised rep- resentations for reinforcement learning. InInternational Conference on Machine Learning (ICML), pages 5639–5650, 2020

  7. [7]

    Reinforcement learning with augmented data.Advances in Neural Information Processing Systems (NeurIPS), 33:19884–19895, 2020

    Misha Laskin, Kimin Lee, Adam Stooke, Lerrel Pinto, Pieter Abbeel, and Aravind Srinivas. Reinforcement learning with augmented data.Advances in Neural Information Processing Systems (NeurIPS), 33:19884–19895, 2020

  8. [8]

    VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training

    Yecheng Jason Ma, Shagun Sodhani, Dinesh Jayaraman, Osbert Bastani, Vikash Kumar, and Amy Zhang. VIP: Towards universal visual reward and representation via value-implicit pre-training.arXiv preprint arXiv:2210.00030, 2022

  9. [9]

    LIV: Language-image representations and rewards for robotic control

    Yecheng Jason Ma, Vikash Kumar, Amy Zhang, Osbert Bastani, and Dinesh Jayaraman. LIV: Language-image representations and rewards for robotic control. InInternational Conference on Machine Learning (ICML), pages 23301–23320, 2023

  10. [10]

    Eureka: Human-Level Reward Design via Coding Large Language Models

    Yecheng Jason Ma, William Liang, Guanzhi Wang, De-An Huang, Osbert Bastani, Dinesh Jayaraman, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Eureka: Human-level reward design via coding large language models.arXiv preprint arXiv:2310.12931, 2024

  11. [11]

    PE-Core-bigG-14-448: Perception encoders for vision-language modelling, 2025

    Meta AI. PE-Core-bigG-14-448: Perception encoders for vision-language modelling, 2025. Open weights, distributed via the OpenCLIP API; available at https://huggingface.co/ facebook/PE-Core-bigG-14-448

  12. [12]

    Data-efficient hierarchical reinforcement learning.Advances in Neural Information Processing Systems (NeurIPS), 31, 2018

    Ofir Nachum, Shixiang Shane Gu, Honglak Lee, and Sergey Levine. Data-efficient hierarchical reinforcement learning.Advances in Neural Information Processing Systems (NeurIPS), 31, 2018

  13. [13]

    Multi-goal reinforcement learning: Challenging robotics environments and request for research

    Matthias Plappert, Marcin Andrychowicz, Alex Ray, Bob McGrew, Bowen Baker, Glenn Powell, Jonas Schneider, Josh Tobin, Maciek Chociej, Peter Welinder, Vikash Kumar, and Wojciech Zaremba. Multi-goal reinforcement learning: Challenging robotics environments and request for research. 2018

  14. [14]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InInterna- tional Conference on Machine Learning (ICML), 2021

  15. [15]

    Stable-Baselines3: Reliable reinforcement learning implementations.Journal of Machine Learning Research (JMLR), 2021

    Antonin Raffin, Ashley Hill, Adam Gleave, Anssi Kanervisto, Maximilian Ernestus, and Noah Dormann. Stable-Baselines3: Reliable reinforcement learning implementations.Journal of Machine Learning Research (JMLR), 2021. 10

  16. [16]

    Vision- language models are zero-shot reward models for reinforcement learning

    Juan Rocamonde, Victoriano Montesinos, Elvis Nava, Ethan Perez, and David Lindner. Vision- language models are zero-shot reward models for reinforcement learning. InInternational Conference on Learning Representations (ICLR), 2024

  17. [17]

    Gordon, and J

    Stéphane Ross, Geoffrey J. Gordon, and J. Andrew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InInternational Conference on Artificial Intelligence and Statistics (AISTATS), 2011

  18. [18]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  19. [19]

    Sumedh Sontakke, Jesse Zhang, Sébastien M. R. Arnold, et al. RoboCLIP: One demonstration is enough to learn robot policies.arXiv preprint arXiv:2310.07899, 2023

  20. [20]

    Sutton, Doina Precup, and Satinder Singh

    Richard S. Sutton, Doina Precup, and Satinder Singh. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning.Artificial Intelligence, 112(1–2): 181–211, 1999

  21. [21]

    a robot gripper and a red cube on a table

    Emanuel Todorov, Tom Erez, and Yuval Tassa. MuJoCo: A physics engine for model-based control. InIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2012. 11 A Implementation details Algorithm.The RMTL algorithm is given below. Algorithm 1Stage-2 HRL Learned Manager with a Frozen-then-Unfrozen PPO Worker Require: Pretrained PPO work...