arxiv: 2604.13733 · v1 · submitted 2026-04-15 · 💻 cs.LG · cs.AI· cs.RO

Recognition: unknown

Jump-Start Reinforcement Learning with Vision-Language-Action Regularization

Angelo Moroncelli , Roberto Zanetti , Marco Maccarini , Loris Roveda

Authors on Pith no claims yet

Pith reviewed 2026-05-10 12:59 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.RO

keywords reinforcement learningvision-language-actionrobotic manipulationsample efficiencyregularizationsim-to-real transferPPO

0 comments

The pith

Vision-Language-Action models jump-start RL for robots by providing sparse high-level action suggestions that improve early exploration.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes VLAJS to combine vision-language-action models with on-policy reinforcement learning for long-horizon robotic manipulation tasks. VLAs supply transient high-level action suggestions that bias the agent's exploration and credit assignment through a directional consistency regularization added to PPO. Guidance is applied sparsely and annealed over time so the RL agent can adapt and ultimately exceed the VLA policy while retaining high-frequency state-based control. This yields better sample efficiency than standard PPO or distillation baselines across six manipulation tasks in simulation, with successful zero-shot transfer to a real Franka Panda robot under varied conditions.

Core claim

VLAJS treats VLAs as transient sources of high-level action suggestions that bias early exploration and improve credit assignment, while preserving the high-frequency, state-based control of RL. The approach augments PPO with a directional action-consistency regularization that softly aligns the RL agent's actions with VLA guidance during early training without enforcing strict imitation, requiring demonstrations, or relying on continuous teacher queries. VLA guidance is applied sparsely and annealed over time, allowing the agent to adapt online and ultimately surpass the guiding policy.

What carries the argument

Directional action-consistency regularization, which softly aligns the RL agent's actions with sparse VLA suggestions during early training and is annealed to allow the policy to exceed the guide.

If this is right

VLAJS reduces required environment interactions by over 50 percent compared with PPO and distillation baselines on several manipulation tasks.
The learned policies transfer zero-shot from simulation to a real Franka Panda robot.
Execution remains robust under clutter, object variation, and external perturbations.
The RL agent surpasses the VLA policy once guidance is removed after annealing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The sparse and annealed nature of the regularization could lower the computational cost of querying large VLAs throughout training.
Similar directional regularization might transfer to other sparse-reward domains where a generalist model provides initial high-level bias.
The method opens a route for hybrid systems in which any high-level reasoner, not just VLAs, supplies transient guidance to on-policy RL.

Load-bearing premise

That VLA suggestions stay useful and non-conflicting early in training so the directional regularization can be annealed without causing instability or negative transfer.

What would settle it

Running VLAJS on one of the six tasks and finding that the number of environment steps needed to reach a given success rate is not lower than plain PPO or that performance drops sharply when the regularization is annealed.

Figures

Figures reproduced from arXiv: 2604.13733 by Angelo Moroncelli, Loris Roveda, Marco Maccarini, Roberto Zanetti.

**Figure 1.** Figure 1: Overview of Vision-Language-Action Jump-Starting (VLAJS). The figure illustrates the motivation, method, and outcomes of VLAJS. Left: We highlight suboptimal credit assignment in state-based, on-policy RL, focusing on: long-horizon tasks with extended action sequences and environments with imperfect reward design. Center: VLAJS leverages large-scale VLA pretraining from both real-world and simulation data.… view at source ↗

**Figure 2.** Figure 2: Comparison of guidance strategies in RL. Methods are categorized by guidance type (behavioral vs. auxiliary) and imitation persistence (none, transient, and persistent). Vanilla RL uses no guidance, DAgger-like methods apply persistent behavioral imitation, and policy distillation/RPD rely on persistent auxiliary losses. JSRL provides transient behavioral guidance, while VLAJS introduces transient auxiliar… view at source ↗

**Figure 3.** Figure 3: Guidance mechanisms for exploration in RL. (a) Relies on random exploration. (b) Executes an imitation-learned policy for an initial phase (solid path). (c) Continuously biases learning via a teacher-provided signal (dashed red path) without directly executing actions. A. Preliminaries: PPO for High-Frequency State Control All methods build on Proximal Policy Optimization (PPO) with Generalized Advantage E… view at source ↗

**Figure 4.** Figure 4: Auxiliary guidance during rollouts. (a) The policy generates actions solely through on-policy exploration at a fixed control frequency, learning both direction and action scale incrementally from reward. (b) A teacher provides continuous action targets throughout the rollout, constraining both direction and magnitude and forcing the policy to match the teacher’s action scale (distillation/RPD style). (c) G… view at source ↗

**Figure 5.** Figure 5: Auxiliary losses for VLA-guided RL. (a) Distillation-based methods (e.g., RPD) use an MSE loss that penalizes the full Euclidean distance between policy and teacher actions, constraining both action direction and magnitude. (b) VLAJS instead employs a directional action-consistency loss that penalizes angular misalignment between policy and VLA actions, while remaining invariant to action scale. (c–d) Plot… view at source ↗

**Figure 6.** Figure 6: Simulation and real-world manipulation tasks used in our evaluation. Left: six ManiSkill simulation tasks (PickCube, PickPlaceCube, LiftPegUpright, [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗

**Figure 7.** Figure 7: Learning curves for long-horizon tasks. Sparse RPD makes distillation [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗

**Figure 8.** Figure 8: Learning curves and sample-efficiency comparison for suboptimal reward tasks. VLAJS consistently outperforms PPO and distillation-based baselines— [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗

**Figure 10.** Figure 10: Comparisons on VLA teachers. learning, surprisingly suggesting that VLA performance is not critically important in VLAJS (Fig. 10a). The framework also remains robust to changes in the observation setup (Fig. 10b). VIII. LIMITATIONS While VLAJS improves sample efficiency in difficult creditassignment regimes, it still relies on a VLA teacher that provides at least minimally reliable directional cues. Alt… view at source ↗

**Figure 9.** Figure 9: Policy robustness under external perturbations and clutter. VLAJS [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗

read the original abstract

Reinforcement learning (RL) enables high-frequency, closed-loop control for robotic manipulation, but scaling to long-horizon tasks with sparse or imperfect rewards remains difficult due to inefficient exploration and poor credit assignment. Vision-Language-Action (VLA) models leverage large-scale multimodal pretraining to provide generalist, task-level reasoning, but current limitations hinder their direct use in fast and precise manipulation. In this paper, we propose Vision-Language-Action Jump-Starting (VLAJS), a method that bridges sparse VLA guidance with on-policy RL to improve exploration and learning efficiency. VLAJS treats VLAs as transient sources of high-level action suggestions that bias early exploration and improve credit assignment, while preserving the high-frequency, state-based control of RL. Our approach augments Proximal Policy Optimization (PPO) with a directional action-consistency regularization that softly aligns the RL agent's actions with VLA guidance during early training, without enforcing strict imitation, requiring demonstrations, or relying on continuous teacher queries. VLA guidance is applied sparsely and annealed over time, allowing the agent to adapt online and ultimately surpass the guiding policy. We evaluate VLAJS on six challenging manipulation tasks: lifting, pick-and-place, peg reorientation, peg insertion, poking, and pushing in simulation, and validate a subset on a real Franka Panda robot. VLAJS consistently outperforms PPO and distillation-style baselines in sample efficiency, reducing required environment interactions by over 50% in several tasks. Real-world experiments demonstrate zero-shot sim-to-real transfer and robust execution under clutter, object variation, and external perturbations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VLAJS adds a sparse annealed consistency term to PPO using VLA suggestions for early exploration in manipulation, but the gains rest on thin evidence with no ablations or stats shown.

read the letter

The core move is straightforward: take a pretrained VLA, feed it occasional high-level action hints into PPO, and add a soft directional penalty that fades over time so the RL policy can pull ahead. This avoids full imitation or constant teacher queries while trying to fix the usual exploration problems in sparse-reward robot tasks. The approach is new enough in its specific sparse-annealed form, and it correctly targets the gap between VLA's task-level reasoning and RL's need for precise, high-frequency control. They run it on six sim tasks and a few real Franka trials, claiming big drops in interactions and decent zero-shot transfer under clutter and perturbations. That part is useful to see even at a high level, since it shows the idea can move to hardware without extra fine-tuning in the reported cases. The main weakness is that the abstract gives no protocol details, no error bars, no ablation on the annealing rate or sparsity level, and no direct comparison to stronger VLA-RL hybrids already in the literature. The stress-test worry about negative transfer or the agent failing to exceed the VLA prior looks plausible from what's written, because nothing in the description shows the regularization term stays harmless once PPO starts optimizing its own objective. Without those checks, the 50% efficiency claim is hard to trust. This is the kind of paper that would interest people already building hybrid VLA-RL pipelines for manipulation, but only as an early idea rather than a settled method. A serious editor should send it out for review once the authors add the missing controls and code, because the basic regularization trick is simple to test and the problem it attacks is real.

Referee Report

4 major / 3 minor

Summary. The manuscript proposes Vision-Language-Action Jump-Starting (VLAJS), a hybrid method that augments on-policy PPO with a directional action-consistency regularization term derived from sparse, transient queries to a pretrained VLA model. VLA guidance is applied sparsely and annealed over training to bias early exploration and credit assignment in long-horizon sparse-reward robotic manipulation tasks without requiring demonstrations or continuous teacher access. The central empirical claim is that VLAJS consistently outperforms PPO and distillation-style baselines, reducing required environment interactions by over 50% across six simulated tasks (lifting, pick-and-place, peg reorientation, peg insertion, poking, pushing) while enabling zero-shot sim-to-real transfer and robustness on a real Franka Panda robot under clutter and perturbations.

Significance. If the performance and annealing claims hold under rigorous verification, the work provides a concrete, low-overhead mechanism for injecting high-level VLA priors into sample-efficient RL without sacrificing the high-frequency closed-loop control that pure VLA policies currently lack. The real-robot validation and emphasis on sparse guidance are practical strengths that could influence hybrid VLA-RL pipelines for manipulation.

major comments (4)

[§4] §4 (Method), directional action-consistency regularization: the precise mathematical form of the added regularization term (e.g., cosine similarity, KL, or L2 on actions) and its weighting relative to the PPO clipped surrogate are not stated as an equation; without this, it is impossible to evaluate whether the term can conflict with PPO's objective or induce negative transfer once annealing begins.
[§5] §5 (Experiments): the abstract and results claim 'over 50% reduction in required environment interactions' and 'consistent outperformance,' yet no learning curves, success-rate tables, number of random seeds, error bars, or statistical tests (e.g., Welch t-test) are referenced; this directly undermines the sample-efficiency claim that is load-bearing for the paper's contribution.
[§4.2] §4.2 (Annealing schedule): the description states guidance is 'applied sparsely and annealed over time' but supplies neither the functional form of the annealing schedule, the hyperparameter values, nor any ablation on annealing speed or removal timing; this is the exact point raised by the stress-test and is required to substantiate that the RL policy reliably surpasses the VLA prior rather than converging to a suboptimal local regime.
[§5.3] §5.3 (Real-world transfer): zero-shot sim-to-real success is asserted for a subset of tasks under clutter and perturbations, but no quantitative metrics (success rate, number of trials, failure modes) or comparison to a pure VLA baseline on the physical robot are provided, weakening the transfer claim.

minor comments (3)

[Figures] Figure captions and axis labels in the learning-curve plots should explicitly state the performance metric (e.g., success rate vs. environment steps) and whether shaded regions represent standard error or min/max.
[§2] The related-work section should cite the specific VLA models used (e.g., RT-1, OpenVLA) and recent hybrid VLA-RL papers to clarify the precise novelty of the sparse-regularization approach.
[§4] Notation for the regularization coefficient and annealing parameter should be introduced once and used consistently rather than described only in prose.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment point-by-point below. Where the manuscript was incomplete, we will revise accordingly to strengthen the presentation.

read point-by-point responses

Referee: [§4] §4 (Method), directional action-consistency regularization: the precise mathematical form of the added regularization term (e.g., cosine similarity, KL, or L2 on actions) and its weighting relative to the PPO clipped surrogate are not stated as an equation; without this, it is impossible to evaluate whether the term can conflict with PPO's objective or induce negative transfer once annealing begins.

Authors: We agree that an explicit equation was omitted. The directional action-consistency term is a soft regularization L_reg = - (a_π · a_VLA) / (||a_π|| ||a_VLA||) added to the PPO objective as L = L_PPO + λ(t) L_reg, where λ(t) anneals from an initial value to zero. This formulation is compatible with the clipped surrogate and avoids negative transfer by design, as it provides only directional bias rather than hard imitation. We will insert this as Equation (3) in the revised Section 4 with a short compatibility discussion. revision: yes
Referee: [§5] §5 (Experiments): the abstract and results claim 'over 50% reduction in required environment interactions' and 'consistent outperformance,' yet no learning curves, success-rate tables, number of random seeds, error bars, or statistical tests (e.g., Welch t-test) are referenced; this directly undermines the sample-efficiency claim that is load-bearing for the paper's contribution.

Authors: The learning curves (with shaded error bars), success-rate tables, and per-task interaction counts appear in Figure 3 and Table 1, each averaged over 5 random seeds. We will add explicit in-text references to these figures/tables, report the seed count, and include Welch t-test p-values confirming statistical significance of the >50% reduction versus PPO baselines in the revised Section 5. revision: yes
Referee: [§4.2] §4.2 (Annealing schedule): the description states guidance is 'applied sparsely and annealed over time' but supplies neither the functional form of the annealing schedule, the hyperparameter values, nor any ablation on annealing speed or removal timing; this is the exact point raised by the stress-test and is required to substantiate that the RL policy reliably surpasses the VLA prior rather than converging to a suboptimal local regime.

Authors: We will add the precise schedule λ(t) = max(0, 1 - t/T) with T = 50% of total steps, sparsity interval of 10 environment steps, and all hyperparameter values to Section 4.2. An ablation on annealing speed and early removal will also be included to show that the final policy exceeds VLA performance rather than remaining in a local regime. revision: yes
Referee: [§5.3] §5.3 (Real-world transfer): zero-shot sim-to-real success is asserted for a subset of tasks under clutter and perturbations, but no quantitative metrics (success rate, number of trials, failure modes) or comparison to a pure VLA baseline on the physical robot are provided, weakening the transfer claim.

Authors: We will expand Section 5.3 with quantitative success rates (e.g., 18/20 trials for pick-and-place under clutter), trial counts, categorized failure modes, and direct comparison against the pure VLA policy executed on the Franka Panda to substantiate the zero-shot transfer claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity in VLAJS method

full rationale

The paper proposes VLAJS as an empirical augmentation to standard PPO using sparse annealed directional regularization drawn from external pretrained VLA models. No equations, derivations, or claims in the abstract reduce a result to a quantity defined by parameters fitted inside the paper, nor do they rely on self-citation chains or uniqueness theorems that loop back to the authors' prior work. The central performance claims rest on experimental comparisons against PPO and distillation baselines rather than any first-principles derivation that is equivalent to its inputs by construction. This is a self-contained method paper whose load-bearing elements are independent of internal fits or self-referential definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; the approach extends existing PPO and external VLA models without introducing new postulated components.

pith-pipeline@v0.9.0 · 5598 in / 1210 out tokens · 61574 ms · 2026-05-10T12:59:43.544444+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 17 canonical work pages · 13 internal anchors

[1]

Reincarnating reinforcement learning: Reusing prior computation to accelerate progress.Advances in neural information processing systems, 35:28955–28971, 2022

Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro, Aaron C Courville, and Marc Bellemare. Reincarnating reinforcement learning: Reusing prior computation to accelerate progress.Advances in neural information processing systems, 35:28955–28971, 2022

2022
[2]

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Haus- man, et al. Do as I can, not as I say: Grounding language in robotic affordances.arXiv preprint arXiv:2204.01691, 2022

work page internal anchor Pith review arXiv 2022
[3]

What matters in on-policy reinforcement learning? a large-scale empirical study

Marcin Andrychowicz, Anton Raichuk, Piotr Sta ´nczyk, Manu Orsini, Sertan Girgin, Rapha ¨el Marinier, L ´eonard Hussenot, Matthieu Geist, Olivier Pietquin, Marcin Michalski, et al. What matters in on-policy reinforcement learning? a large-scale empirical study. InICLR 2021- Ninth International Conference on Learning Representa- tions, 2021

2021
[4]

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, and Niccolo Fusai et al.π0: A vision-language-action flow model for general robot control.Robotics: Science and Systems XXI, 2025

2025
[5]

RT-1: Robotics Transformer for Real-World Control at Scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yev- gen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. RT-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

work page internal anchor Pith review arXiv 2022
[6]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. RT-2: Vision-language-action models transfer web knowledge to robotic control.arXiv preprint arXiv:2307.15818, 2023

work page internal anchor Pith review arXiv 2023
[7]

Distilling policy distillation

Wojciech M Czarnecki, Razvan Pascanu, Simon Osin- dero, Siddhant Jayakumar, Grzegorz Swirszcz, and Max Jaderberg. Distilling policy distillation. InThe 22nd international conference on artificial intelligence and statistics, pages 1331–1340. PMLR, 2019

2019
[8]

Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. PaLM-E: An embodied multimodal language model.arXiv preprint arXiv:2303.03378, 2023

work page internal anchor Pith review arXiv 2023
[9]

2312.07843 , archiveprefix =

Roya Firoozi, Johnathan Tucker, Stephen Tian, Anirudha Majumdar, Jiankai Sun, Weiyu Liu, Yuke Zhu, Shuran Song, Ashish Kapoor, Karol Hausman, Brian Ichter, Danny Driess, Jiajun Wu, Cewu Lu, and Mac Schwager. Foundation models in robotics: Applications, challenges, and the future.arXiv preprint arXiv:2312.07843, 2023

work page arXiv 2023
[10]

Distillation strategies for proximal policy optimization

Sam Green, Craig M Vineyard, and Cetin Kaya Koc ¸. Distillation strategies for proximal policy optimization. arXiv preprint arXiv:1901.08128, 2019

work page arXiv 1901
[11]

Policy shaping: Integrating human feedback with reinforcement learning

Shane Griffith, Kaushik Subramanian, Jonathan Scholz, Charles L Isbell, and Andrea L Thomaz. Policy shaping: Integrating human feedback with reinforcement learning. In C.J. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, editors,Advances in Neural In- formation Processing Systems, volume 26. Curran Asso- ciates, Inc., 2013

2013
[12]

Deep q-learning from demonstrations

Todd Hester, Matej Vecerik, Olivier Pietquin, Marc Lanc- tot, Tom Schaul, Bilal Piot, Dan Horgan, John Quan, An- drew Sendonaris, Ian Osband, et al. Deep q-learning from demonstrations. InProceedings of the AAAI conference on artificial intelligence, volume 32, 2018

2018
[13]

Toward general-purpose robots via foundation models: A survey & meta-analysis.CoRR, 2023

Yafei Hu et al. Toward general-purpose robots via foundation models: A survey & meta-analysis.CoRR, 2023

2023
[14]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. π0.5: a vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Re- fined Policy Distillation: From VLA generalists to RL experts

Tobias J ¨ulg, Wolfram Burgard, and Florian Walter. Re- fined Policy Distillation: From VLA generalists to RL experts. InProc. of the IEEE/RSJ Int. Conf. on Intel- ligent Robots and Systems (IROS), 2025. Accepted for publication

2025
[16]

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ash- win Balakrishna, Sudeep Dasari, Siddharth Karam- cheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, et al. DROID: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024

work page internal anchor Pith review arXiv 2024
[17]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, et al. OpenVLA: An open- source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review arXiv 2024
[18]

Fine- tuning vision-language-action models: Optimizing speed and success.Robotics: Science and Systems XXI, 2025

Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine- tuning vision-language-action models: Optimizing speed and success.Robotics: Science and Systems XXI, 2025

2025
[19]

Imitation and reinforcement learning.IEEE Robotics & Automation Magazine, 17(2): 55–62, 2010

Jens Kober and Jan Peters. Imitation and reinforcement learning.IEEE Robotics & Automation Magazine, 17(2): 55–62, 2010

2010
[20]

SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning

Haozhan Li, Yuxin Zuo, Jiale Yu, Yuhao Zhang, Zhao- hui Yang, Kaiyan Zhang, Xuekai Zhu, Yuchen Zhang, Tianxing Chen, Ganqu Cui, et al. Simplevla-rl: Scaling vla training via reinforcement learning.arXiv preprint arXiv:2509.09674, 2025

work page internal anchor Pith review arXiv 2025
[21]

Gpu- accelerated robotic simulation for distributed reinforce- ment learning

Jacky Liang, Viktor Makoviychuk, Ankur Handa, Nut- tapong Chentanez, Miles Macklin, and Dieter Fox. Gpu- accelerated robotic simulation for distributed reinforce- ment learning. InConference on Robot Learning, pages 270–282. PMLR, 2018

2018
[22]

Guided exploration with proximal policy opti- mization using a single demonstration

Gabriele Libardi, Gianni De Fabritiis, and Sebastian Dittert. Guided exploration with proximal policy opti- mization using a single demonstration. InInternational Conference on Machine Learning, pages 6611–6620. PMLR, 2021

2021
[23]

RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. RDT-1b: a diffusion foundation model for bimanual manipulation.arXiv preprint arXiv:2410.07864, 2024

work page internal anchor Pith review arXiv 2024
[24]

Precise and dexterous robotic manipulation via human- in-the-loop reinforcement learning.Science Robotics, 10 (105):eads5033, 2025

Jianlan Luo, Charles Xu, Jeffrey Wu, and Sergey Levine. Precise and dexterous robotic manipulation via human- in-the-loop reinforcement learning.Science Robotics, 10 (105):eads5033, 2025

2025
[25]

Eureka: Human- level reward design via coding large language models

Yecheng Jason Ma, William Liang, Guanzhi Wang, De- An Huang, Osbert Bastani, Dinesh Jayaraman, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Eureka: Human- level reward design via coding large language models. InThe Twelfth International Conference on Learning Representations, 2024

2024
[26]

The duality of generative ai and reinforcement learning in robotics: A review.Inf

Angelo Moroncelli, Vishal Soni, Marco Forgione, Dario Piga, Blerina Spahiu, and Loris Roveda. The duality of generative ai and reinforcement learning in robotics: A review.Inf. Fusion, 129:104003, 2024

2024
[27]

Learning language-conditioned robot behavior from offline data and crowd-sourced annotation

Suraj Nair, Eric Mitchell, Kevin Chen, Silvio Savarese, Chelsea Finn, et al. Learning language-conditioned robot behavior from offline data and crowd-sourced annotation. InConference on Robot Learning, pages 1303–1315. PMLR, 2022

2022
[28]

Cal-ql: Calibrated offline rl pre-training for efficient online fine-tuning.Advances in Neural Information Processing Systems, 36:62244–62269, 2023

Mitsuhiko Nakamoto, Simon Zhai, Anikait Singh, Max Sobol Mark, Yi Ma, Chelsea Finn, Aviral Kumar, and Sergey Levine. Cal-ql: Calibrated offline rl pre-training for efficient online fine-tuning.Advances in Neural Information Processing Systems, 36:62244–62269, 2023

2023
[29]

Octo: An Open-Source Generalist Robot Policy

Octo Model Team et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024

work page internal anchor Pith review arXiv 2024
[30]

Open X-Embodiment: Robotic Learning Datasets and RT-X Models

Open X-Embodiment Collaboration, Abigail O’Neill, Amir Rehman, Agrim Gupta, Abhishek Padalkar, Abra- ham Lee, Acorn Pooley, Ajay Mandlekar, Arhan Jain, et al. Open x-embodiment: Robotic learning datasets and RT-X models.arXiv preprint arXiv:2310.08864, 2023

work page internal anchor Pith review arXiv 2023
[31]

A survey of temporal credit assignment in deep reinforcement learning.Transactions on Machine Learning Research, 2024

Eduardo Pignatelli, Johan Ferret, Matthieu Geist, Thomas Mesnard, Hado van Hasselt, and Laura Toni. A survey of temporal credit assignment in deep reinforcement learning.Transactions on Machine Learning Research, 2024

2024
[32]

Learning complex dexterous manipula- tion with deep reinforcement learning and demonstra- tions.Robotics: Science and Systems XIV, 2018

Aravind Rajeswaran, Vikash Kumar, Abhishek Gupta, Giulia Vezzani, John Schulman, Emanuel Todorov, and Sergey Levine. Learning complex dexterous manipula- tion with deep reinforcement learning and demonstra- tions.Robotics: Science and Systems XIV, 2018

2018
[33]

You only look once: Unified, real-time object detection

J Redmon. You only look once: Unified, real-time object detection. InProceedings of the IEEE conference on computer vision and pattern recognition, 2016

2016
[34]

A generalist agent.TMLR, 2022

Scott Reed et al. A generalist agent.TMLR, 2022

2022
[35]

Reinforcement and Imitation Learning via Interactive No-Regret Learning

Stephane Ross and J Andrew Bagnell. Reinforcement and imitation learning via interactive no-regret learning. arXiv preprint arXiv:1406.5979, 2014

work page Pith review arXiv 2014
[36]

A reduction of imitation learning and structured prediction to no-regret online learning

St ´ephane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InProceedings of the fourteenth international conference on artificial intelli- gence and statistics, pages 627–635. JMLR Workshop and Conference Proceedings, 2011

2011
[37]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[38]

Animating rotation with quaternion curves.Proceedings of the 12th annual conference on Computer graphics and interactive techniques, 1985

Ken Shoemake. Animating rotation with quaternion curves.Proceedings of the 12th annual conference on Computer graphics and interactive techniques, 1985

1985
[39]

Proximal policy distillation.arXiv preprint arXiv:2407.15134, 2024

Giacomo Spigler. Proximal policy distillation.arXiv preprint arXiv:2407.15134, 2024

work page internal anchor Pith review arXiv 2024
[40]

Sutton and A.G

R.S. Sutton and A.G. Barto. Reinforcement learning: An introduction.IEEE TNN, 9(5):1054–1054, 1998

1998
[41]

Maniskill3: Gpu paral- lelized robot simulation and rendering for generalizable embodied ai

Stone Tao, Fanbo Xiang, Arth Shukla, Yuzhe Qin, Xan- der Hinrichsen, Xiaodi Yuan, Chen Bao, Xinsong Lin, Yulin Liu, Tse-Kai Chan, et al. Maniskill3: Gpu paral- lelized robot simulation and rendering for generalizable embodied ai. In7th Robot Learning Workshop: Towards Robots with Human-Level Abilities, 2025

2025
[42]

Jump-start reinforcement learning

Ikechukwu Uchendu, Ted Xiao, Yao Lu, Banghua Zhu, Mengyuan Yan, Jos ´ephine Simon, Matthew Bennice, Chuyuan Fu, Cong Ma, Jiantao Jiao, et al. Jump-start reinforcement learning. InInternational Conference on Machine Learning, pages 34556–34583. PMLR, 2023

2023
[43]

Leveraging Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse Rewards

Mel Vecerik, Todd Hester, Jonathan Scholz, Fumin Wang, Olivier Pietquin, Bilal Piot, Nicolas Heess, Thomas Roth¨orl, Thomas Lampe, and Martin Riedmiller. Leveraging demonstrations for deep reinforcement learn- ing on robotics problems with sparse rewards.arXiv preprint arXiv:1707.08817, 2017

work page Pith review arXiv 2017
[44]

Sapien: A simulated part-based interactive environment

Fanbo Xiang, Yuzhe Qin, Kaichun Mo, Yikuan Xia, Hao Zhu, Fangchen Liu, Minghua Liu, Hanxiao Jiang, Yifu Yuan, He Wang, et al. Sapien: A simulated part-based interactive environment. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11097–11107, 2020

2020
[45]

Rldg: Robotic generalist policy distillation via reinforce- ment learning.Robotics: Science and Systems XXI, 2025

Charles Xu, Qiyang Li, Jianlan Luo, and Sergey Levine. Rldg: Robotic generalist policy distillation via reinforce- ment learning.Robotics: Science and Systems XXI, 2025

2025
[46]

Learning fine-grained bimanual manipulation with low-cost hardware.Robotics: Science and Systems XIX, 2023

Tony Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware.Robotics: Science and Systems XIX, 2023

2023