pith. machine review for the scientific record. sign in

arxiv: 2505.18719 · v1 · submitted 2025-05-24 · 💻 cs.RO · cs.AI

Recognition: no theorem link

VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning

Authors on Pith no claims yet

Pith reviewed 2026-05-16 12:52 UTC · model grok-4.3

classification 💻 cs.RO cs.AI
keywords robotic manipulationreinforcement learningvision-language-actiononline RLprocess reward modelLIBERO benchmarktest-time optimization
0
0 comments X

The pith

VLA-RL applies online reinforcement learning to raise pretrained vision-language-action models above finetuned baselines on robot tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces VLA-RL as a way to move beyond imitation learning by running online reinforcement learning on top of pretrained auto-regressive vision-language-action models. It reformulates manipulation trajectories as multi-modal conversations and supplies dense process rewards by fine-tuning a separate vision-language model on segments automatically cut from task demonstrations. The resulting system lets OpenVLA-7B outperform the best prior finetuned baseline by 4.5 percent across 40 LIBERO tasks and reach parity with commercial models such as π₀-FAST. Gains continue when more test-time optimization steps are allowed, suggesting that robotics may follow an inference-scaling pattern once reward models are in place.

Core claim

VLA-RL casts general robotic manipulation as a trajectory-level reinforcement-learning problem inside an auto-regressive VLA, models the trajectory as a multi-modal multi-turn conversation, and supplies rewards through a fine-tuned vision-language process reward model trained on pseudo-labels from automatically segmented demonstrations. With supporting techniques for curriculum selection, vectorized GPU environments, batch decoding, and critic warmup, the method produces a 4.5 percent lift over the strongest finetuned baseline on the 40-task LIBERO suite and matches the performance of advanced commercial systems while continuing to improve with additional test-time steps.

What carries the argument

The robotic process reward model: a pretrained vision-language model fine-tuned on pseudo reward labels extracted from automatically segmented task demonstrations to provide dense guidance for online RL.

If this is right

  • Pretrained VLAs can be improved at test time without collecting new human demonstrations.
  • Process rewards derived from vision-language models can replace sparse success signals in long-horizon manipulation.
  • Curriculum ordering and vectorized execution become necessary engineering ingredients for scaling online RL on robots.
  • Extended test-time optimization yields continued gains, opening a path to inference-time scaling in robotics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar reward-model-plus-online-RL pipelines could transfer to other embodied domains such as navigation or assembly where demonstration data is also limited.
  • The observed scaling with test-time steps implies that robot policies may eventually be deployed with variable compute budgets at inference, trading latency for reliability.
  • If the reward model can be kept frozen while the policy improves, the approach decouples perception from control and may simplify safety verification.

Load-bearing premise

A vision-language model trained on automatically extracted task segments will generate reward signals accurate and general enough to drive stable online reinforcement learning across out-of-distribution robot scenarios.

What would settle it

Running the same VLA-RL procedure with a reward model that receives human-verified segment labels instead of pseudo labels and observing no performance gain or a performance drop would falsify the claim that the pseudo-label approach suffices.

read the original abstract

Recent high-capacity vision-language-action (VLA) models have demonstrated impressive performance on a range of robotic manipulation tasks by imitating human demonstrations. However, exploiting offline data with limited visited states will cause execution failure in out-of-distribution scenarios. Intuitively, an exploration-based method that improves on online collected data at test time could address this limitation. We present VLA-RL, an algorithmic and systematic framework that leverages online reinforcement learning (RL) to improve pretrained auto-regressive VLAs in downstream tasks. Within a unified perspective, we first introduce a trajectory-level RL formulation for auto-regressive VLA training, which models general robotic manipulation trajectory as multi-modal multi-turn conversation. To address the challenge of sparse rewards, we fine-tune a pretrained vision-language model as a robotic process reward model, which is trained on pseudo reward labels annotated on automatically extracted task segments. To scale up, we identify several implementation findings that improve the stability and efficiency including curriculum selection strategy, GPU-balanced vectorized environments, batch decoding, and critic warmup. VLA-RL enables OpenVLA-7B to surpass the strongest finetuned baseline by 4.5% on 40 challenging robotic manipulation tasks in LIBERO, and even matches the performance of advanced commercial models such as $\pi_0$-FAST. Notably, we observe that VLA-RL benefits from increased test-time optimization, indicating an early spark of inference scaling laws in robotics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces VLA-RL, a framework for improving pretrained vision-language-action (VLA) models via online reinforcement learning on robotic manipulation tasks. It reformulates auto-regressive VLA training as trajectory-level RL modeled as multi-modal multi-turn conversations, addresses sparse rewards by fine-tuning a pretrained VLM as a process reward model on pseudo-labels from automatically extracted task segments, and incorporates practical stabilizations such as curriculum selection, vectorized environments, batch decoding, and critic warmup. On the LIBERO benchmark's 40 tasks, the method reportedly lifts OpenVLA-7B by 4.5% over the strongest fine-tuned baseline and reaches parity with commercial systems such as π₀-FAST, while also noting benefits from increased test-time optimization.

Significance. If the central empirical result holds, the work provides concrete evidence that online RL can be scaled to high-capacity VLAs for general manipulation, yielding gains that close the gap to proprietary models. The practical implementation findings for stable training and the observation of inference-time scaling are useful contributions to the robotics community, particularly for practitioners seeking to extend offline VLA policies without additional human demonstrations.

major comments (2)
  1. [Process reward model training and pseudo-label construction] The process reward model is trained exclusively on pseudo-labels generated by automatic task-segment extraction, yet the manuscript reports no human validation, inter-annotator agreement, held-out accuracy, or error analysis for these labels. Because the 4.5% gain and parity with π₀-FAST rest on the quality of the dense rewards supplied during online RL, any systematic mis-labeling (e.g., on contact-rich or multi-step tasks) would render the improvement indistinguishable from baseline variance or curriculum effects.
  2. [Experimental results and LIBERO evaluation] The experimental section presents the 4.5% aggregate improvement on LIBERO without reporting per-task breakdowns, standard deviations across random seeds, or statistical significance tests. In the absence of these quantities it is impossible to determine whether the lift exceeds run-to-run variability or is driven by a small subset of tasks.
minor comments (2)
  1. [Abstract] The abstract states that 'several implementation findings' improve stability but does not enumerate them; a short bulleted list would improve readability.
  2. [Method formulation] Notation for the trajectory-level RL objective and the process reward model could be unified more clearly with the conversation-style formulation introduced earlier.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript accordingly to improve clarity and rigor.

read point-by-point responses
  1. Referee: [Process reward model training and pseudo-label construction] The process reward model is trained exclusively on pseudo-labels generated by automatic task-segment extraction, yet the manuscript reports no human validation, inter-annotator agreement, held-out accuracy, or error analysis for these labels. Because the 4.5% gain and parity with π₀-FAST rest on the quality of the dense rewards supplied during online RL, any systematic mis-labeling (e.g., on contact-rich or multi-step tasks) would render the improvement indistinguishable from baseline variance or curriculum effects.

    Authors: We agree that explicit validation of the pseudo-labels is important for substantiating the dense reward signals. The original manuscript describes the automatic task-segment extraction but does not include human validation or error analysis. In the revision we will add a new subsection detailing the extraction algorithm, results from manual inspection of 200 randomly sampled segments (reporting precision/recall against human labels), and a targeted error analysis on contact-rich and multi-step tasks. This will directly address concerns about label quality and its impact on the observed gains. revision: yes

  2. Referee: [Experimental results and LIBERO evaluation] The experimental section presents the 4.5% aggregate improvement on LIBERO without reporting per-task breakdowns, standard deviations across random seeds, or statistical significance tests. In the absence of these quantities it is impossible to determine whether the lift exceeds run-to-run variability or is driven by a small subset of tasks.

    Authors: We acknowledge that aggregate results alone limit interpretability. The revised manuscript will include a per-task success rate table for all 40 LIBERO tasks, standard deviations computed over three independent random seeds, and statistical significance tests (paired t-tests and Wilcoxon signed-rank tests) comparing VLA-RL against the strongest baselines. These additions will demonstrate that the 4.5% improvement is consistent across tasks and exceeds run-to-run variability. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical RL gains rest on independent online optimization and pseudo-label reward model

full rationale

The paper's central result is an empirical performance improvement (4.5% on LIBERO) obtained by running online RL on newly collected trajectories, using a separately fine-tuned VLM reward model whose labels come from automatic segment extraction. No derivation chain reduces the reported gains to fitted parameters from the same dataset, self-referential definitions, or load-bearing self-citations. The trajectory-as-conversation modeling choice and implementation heuristics (curriculum, batch decoding, critic warmup) are engineering decisions whose validity is tested by the external benchmark results rather than assumed by construction. The pseudo-label step introduces potential label noise but does not create circularity because the subsequent RL phase optimizes against new online data and reports held-out task success rates.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the effectiveness of the pseudo-label reward model and standard RL stability assumptions rather than new mathematical derivations.

free parameters (1)
  • RL training hyperparameters (learning rate, batch size, curriculum thresholds)
    Standard RL knobs tuned for stability and efficiency; values not reported in abstract.
axioms (1)
  • domain assumption A pretrained vision-language model can be fine-tuned on automatically extracted task segments to produce reliable process rewards for robotic trajectories.
    Invoked to solve sparse-reward problem; central to the method.

pith-pipeline@v0.9.0 · 5584 in / 1236 out tokens · 28399 ms · 2026-05-16T12:52:28.368258+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Enhancing MLLM Spatial Understanding via Active 3D Scene Exploration for Multi-Perspective Reasoning

    cs.CV 2026-04 unverdicted novelty 7.0

    A training-free Visual Chain-of-Thought framework reconstructs high-fidelity 3D meshes from single images and iteratively synthesizes optimal novel views to enhance MLLM spatial comprehension on benchmarks like 3DSRBench.

  2. You've Got a Golden Ticket: Improving Generative Robot Policies With A Single Noise Vector

    cs.RO 2026-03 conditional novelty 7.0

    Optimizing a single constant initial noise vector for frozen generative robot policies improves success rates on 38 of 43 tasks by up to 58% relative improvement.

  3. RL-VLA$^3$: A Flexible and Asynchronous Reinforcement Learning Framework for VLA Training

    cs.AI 2026-02 unverdicted novelty 7.0

    RL-VLA³ is an asynchronous RL framework for VLA training that delivers up to 85.2% higher throughput than synchronous baselines while preserving identical sample efficiency and scaling to 256 GPUs.

  4. Reinforcing VLAs in Task-Agnostic World Models

    cs.AI 2026-05 unverdicted novelty 6.0

    RAW-Dream lets VLAs learn new tasks in zero-shot imagination by using a world model pre-trained only on task-free behaviors and an unmodified VLM to supply rewards, with dual-noise verification to limit hallucinations.

  5. Unified Noise Steering for Efficient Human-Guided VLA Adaptation

    cs.RO 2026-05 unverdicted novelty 6.0

    UniSteer unifies human corrective actions and noise-space RL for VLA adaptation by inverting actions to noise targets, raising success rates from 20% to 90% in 66 minutes across four real-world manipulation tasks.

  6. Learning while Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies

    cs.RO 2026-05 unverdicted novelty 6.0

    Fleet-scale RL framework improves a single generalist VLA policy from deployment data to 95% average success on eight real-world manipulation tasks with 16 dual-arm robots.

  7. LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning

    cs.RO 2026-04 unverdicted novelty 6.0

    LaST-R1 introduces a RL post-training method called LAPO that optimizes latent Chain-of-Thought reasoning in vision-language-action models, yielding 99.9% success on LIBERO and up to 22.5% real-world gains.

  8. LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning

    cs.RO 2026-04 unverdicted novelty 6.0

    LaST-R1 reaches 99.8% average success on the LIBERO benchmark using one-shot warm-up plus LAPO reinforcement learning on latent physical reasoning, with up to 44% real-world gains on complex single- and dual-arm tasks.

  9. CorridorVLA: Explicit Spatial Constraints for Generative Action Heads via Sparse Anchors

    cs.RO 2026-04 unverdicted novelty 6.0

    CorridorVLA improves VLA models by using predicted sparse anchors to impose explicit spatial corridors on action trajectories, yielding 3.4-12.4% success rate gains on LIBERO-Plus with GR00T-Corr reaching 83.21%.

  10. MoRI: Mixture of RL and IL Experts for Long-Horizon Manipulation Tasks

    cs.RO 2026-04 unverdicted novelty 6.0

    MoRI dynamically mixes RL and IL experts with variance-based switching and IL regularization to reach 97.5% success in four real-world robotic tasks while cutting human intervention by 85.8%.

  11. E-VLA: Event-Augmented Vision-Language-Action Model for Dark and Blurred Scenes

    cs.CV 2026-04 conditional novelty 6.0

    E-VLA integrates event streams directly into VLA models via lightweight fusion, raising Pick-Place success from 0% to 60-90% at 20 lux and from 0% to 20-25% under severe motion blur.

  12. RISE: Self-Improving Robot Policy with Compositional World Model

    cs.RO 2026-02 unverdicted novelty 6.0

    RISE combines a controllable dynamics model and progress value model into a closed-loop self-improving pipeline that updates robot policies entirely in imagination, reporting over 35% absolute gains on three real-world tasks.

  13. $\pi^{*}_{0.6}$: a VLA That Learns From Experience

    cs.LG 2025-11 unverdicted novelty 6.0

    RECAP enables a generalist VLA to self-improve via advantage-conditioned RL on mixed real-world data, more than doubling throughput and halving failure rates on hard manipulation tasks.

  14. SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning

    cs.RO 2025-09 conditional novelty 6.0

    SimpleVLA-RL applies tailored reinforcement learning to VLA models, reaching SoTA on LIBERO, outperforming π₀ on RoboTwin, and surpassing SFT in real-world tasks while reducing data needs and identifying a 'pushcut' p...

  15. F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions

    cs.RO 2025-09 unverdicted novelty 6.0

    F1 integrates next-scale visual foresight prediction into a Mixture-of-Transformer VLA architecture to reformulate action generation as foresight-guided inverse dynamics, achieving higher success rates on 136 tasks.

  16. Learning Action Manifold with Multi-view Latent Priors for Robotic Manipulation

    cs.RO 2026-05 unverdicted novelty 5.0

    The method uses multi-view diffusion priors and action manifold learning to resolve depth ambiguity and improve action prediction in VLA robotic manipulation models, reporting higher success rates than baselines on LI...

  17. Closing the Loop: Unified 3D Scene Generation and Immersive Interaction via LLM-RL Coupling

    cs.CV 2026-05 unverdicted novelty 5.0

    A closed-loop system couples LLM-based 3D scene generation with RL optimization and VR user interactions to produce adaptive, immersive environments, claiming SOTA results on the ALFRED benchmark.

  18. OmniVLA-RL: A Vision-Language-Action Model with Spatial Understanding and Online RL

    cs.RO 2026-04 unverdicted novelty 4.0

    OmniVLA-RL uses a mix-of-transformers architecture and flow-matching reformulated as SDE with group segmented policy optimization to surpass prior VLA models on LIBERO benchmarks.

  19. AugVLA-3D: Depth-Driven Feature Augmentation for Vision-Language-Action Models

    cs.CV 2026-02 unverdicted novelty 3.0

    AugVLA-3D augments existing VLA models with depth-derived 3D features and action priors to improve generalization and action accuracy in 3D robotic tasks.

Reference graph

Works this paper leans on

93 extracted references · 93 canonical work pages · cited by 18 Pith papers · 36 internal anchors

  1. [1]

    Proceedings of Advances in Neural Information Processing Systems (NeurIPS) 35, 28955–28971 (2022) 2, 3

    Agarwal, R., Schwarzer, M., Castro, P.S., Courville, A.C., Bellemare, M.: Reincarnating reinforcement learning: Reusing prior computation to accelerate progress. Proceedings of Advances in Neural Information Processing Systems (NeurIPS) 35, 28955–28971 (2022) 2, 3

  2. [2]

    Solving Rubik's Cube with a Robot Hand

    Akkaya, I., Andrychowicz, M., Chociej, M., Litwin, M., McGrew, B., Petron, A., Paino, A., Plappert, M., Powell, G., Ribas, R., et al.: Solving rubik’s cube with a robot hand. arXiv preprint arXiv:1910.07113 (2019) 2

  3. [3]

    arXiv preprint arXiv:2505.14231 (2025) 3

    Bai, S., Li, M., Liu, Y ., Tang, J., Zhang, H., Sun, L., Chu, X., Tang, Y .: Univg-r1: Reasoning guided universal visual grounding with reinforcement learning. arXiv preprint arXiv:2505.14231 (2025) 3

  4. [4]

    Proceedings of Advances in Neural Information Processing Systems (NeurIPS) 35, 24639– 24654 (2022) 2

    Baker, B., Akkaya, I., Zhokov, P., Huizinga, J., Tang, J., Ecoffet, A., Houghton, B., Sampedro, R., Clune, J.: Video pretraining (vpt): Learning to act by watching unlabeled online videos. Proceedings of Advances in Neural Information Processing Systems (NeurIPS) 35, 24639– 24654 (2022) 2

  5. [5]

    In: Proceedings of International Conference on Machine Learning (ICML)

    Ball, P.J., Smith, L., Kostrikov, I., Levine, S.: Efficient online reinforcement learning with offline data. In: Proceedings of International Conference on Machine Learning (ICML). pp. 1577–1594. PMLR (2023) 2

  6. [6]

    arXiv preprint arXiv:2309.01918 (2023) 1, 2

    Bharadhwaj, H., Vakil, J., Sharma, M., Gupta, A., Tulsiani, S., Kumar, V .: Roboagent: General- ization and efficiency in robot manipulation via semantic augmentations and action chunking. arXiv preprint arXiv:2309.01918 (2023) 1, 2

  7. [7]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., et al.: zpi_0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164 (2024) 1, 2, 9

  8. [8]

    RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    Brohan, A., Brown, N., Carbajal, J., Chebotar, Y ., Chen, X., Choromanski, K., Ding, T., Driess, D., Dubey, A., Finn, C., et al.: Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818 (2023) 1, 2

  9. [9]

    RT-1: Robotics Transformer for Real-World Control at Scale

    Brohan, A., Brown, N., Carbajal, J., Chebotar, Y ., Dabis, J., Finn, C., Gopalakrishnan, K., Hausman, K., Herzog, A., Hsu, J., et al.: Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817 (2022) 1, 2

  10. [10]

    Robotics: Science and Systems (RSS) (2019) 1

    Cabi, S., Colmenarejo, S.G., Novikov, A., Konyushkova, K., Reed, S., Jeong, R., Zolna, K., Aytar, Y ., Budden, D., Vecerik, M., Sushkov, O., Barker, D., Scholz, J., Denil, M., de Freitas, N., Wang, Z.: Scaling data-driven robotics with reward sketching and batch reinforcement learning. Robotics: Science and Systems (RSS) (2019) 1

  11. [11]

    2106.01345 , archiveprefix =

    Chen, L., Lu, K., Rajeswaran, A., Lee, K., Grover, A., Laskin, M., Abbeel, P., Srinivas, A., Mordatch, I.: Decision transformer: Reinforcement learning via sequence modeling. arXiv preprint arXiv:2106.01345 (2021) 3

  12. [12]

    In: Robotics: Science and Systems (RSS) (2023) 6, 8

    Chi, C., Feng, S., Du, Y ., Xu, Z., Cousineau, E., Burchfiel, B., Song, S.: Diffusion policy: Visuomotor policy learning via action diffusion. In: Robotics: Science and Systems (RSS) (2023) 6, 8

  13. [13]

    Christiano, P., Leike, J., Brown, T.B., Martic, M., Legg, S., Amodei, D.: Deep reinforcement learning from human preferences (2023), https://arxiv.org/abs/1706.03741 1

  14. [14]

    Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

    Chu, Y ., Xu, J., Zhou, X., Yang, Q., Zhang, S., Yan, Z., Zhou, C., Zhou, J.: Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models. arXiv preprint arXiv:2311.07919 (2023) 1

  15. [15]

    Open X-Embodiment: Robotic Learning Datasets and RT-X Models

    Collaboration, O.X.E.: Open X-Embodiment: Robotic learning datasets and RT-X models. https://arxiv.org/abs/2310.08864 (2023) 1, 2

  16. [16]

    Conference on Robot Learning (CoRL) (2019) 1, 2 10

    Dasari, S., Ebert, F., Tian, S., Nair, S., Bucher, B., Schmeckpeper, K., Singh, S., Levine, S., Finn, C.: Robonet: Large-scale multi-robot learning. Conference on Robot Learning (CoRL) (2019) 1, 2 10

  17. [17]

    DeepSeek-AI, et al.: Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning (2025), https://arxiv.org/abs/2501.12948 1, 2, 3

  18. [18]

    Bridge Data: Boosting Generalization of Robotic Skills with Cross-Domain Datasets

    Ebert, F., Yang, Y ., Schmeckpeper, K., Bucher, B., Georgakis, G., Daniilidis, K., Finn, C., Levine, S.: Bridge data: Boosting generalization of robotic skills with cross-domain datasets. arXiv preprint arXiv:2109.13396 (2021) 1, 2

  19. [19]

    arXiv preprint arXiv:2312.02976 (2023) 1, 2

    Ehsani, K., Gupta, T., Hendrix, R., Salvador, J., Weihs, L., Zeng, K.H., Singh, K.P., Kim, Y ., Han, W., Herrasti, A., et al.: Imitating shortest paths in simulation enables effective navigation and manipulation in the real world. arXiv preprint arXiv:2312.02976 (2023) 1, 2

  20. [20]

    Towards Generalist Robots: Learning Paradigms for Scalable Skill Acquisition@ CoRL2023 3, 5 (2023) 1, 2

    Fang, H.S., Fang, H., Tang, Z., Liu, J., Wang, C., Wang, J., Zhu, H., Lu, C.: Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot. Towards Generalist Robots: Learning Paradigms for Scalable Skill Acquisition@ CoRL2023 3, 5 (2023) 1, 2

  21. [21]

    Reinforced Self-Training (ReST) for Language Modeling

    Gulcehre, C., Paine, T.L., Srinivasan, S., Konyushkova, K., Weerts, L., Sharma, A., Siddhant, A., Ahern, A., Wang, M., Gu, C., et al.: Reinforced self-training (rest) for language modeling. arXiv preprint arXiv:2308.08998 (2023) 3

  22. [22]

    arXiv preprint arXiv:2501.16664 (2025) 2

    Guo, Y ., Zhang, J., Chen, X., Ji, X., Wang, Y .J., Hu, Y ., Chen, J.: Improving vision-language- action model with online reinforcement learning. arXiv preprint arXiv:2501.16664 (2025) 2

  23. [23]

    Proceedings of Advances in Neural Information Processing Systems (NeurIPS) 31 (2018) 1

    Gupta, A., Murali, A., Gandhi, D.P., Pinto, L.: Robot learning in homes: Improving general- ization and reducing dataset bias. Proceedings of Advances in Neural Information Processing Systems (NeurIPS) 31 (2018) 1

  24. [24]

    Conference on Robot Learning (CoRL) (2019) 2, 3

    Gupta, A., Kumar, V ., Lynch, C., Levine, S., Hausman, K.: Relay policy learning: Solving long-horizon tasks via imitation and reinforcement learning. Conference on Robot Learning (CoRL) (2019) 2, 3

  25. [25]

    In: Proceedings of AAAI Conference on Artificial Intelligence (AAAI) (2018) 2

    Hester, T., Vecerik, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., et al.: Deep q-learning from demonstrations. In: Proceedings of AAAI Conference on Artificial Intelligence (AAAI) (2018) 2

  26. [26]

    Proceedings of International Conference on Learning Representations (ICLR) 1(2), 3 (2022) 5

    Hu, E.J., Shen, Y ., Wallis, P., Allen-Zhu, Z., Li, Y ., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. Proceedings of International Conference on Learning Representations (ICLR) 1(2), 3 (2022) 5

  27. [27]

    arXiv preprint arXiv:2311.02198 (2023) 2

    Hu, H., Mirchandani, S., Sadigh, D.: Imitation bootstrapped reinforcement learning. arXiv preprint arXiv:2311.02198 (2023) 2

  28. [28]

    arXiv preprint arXiv:2409.16578 (2024) 2, 9

    Hu, J., Hendrix, R., Farhadi, A., Kembhavi, A., Martin-Martin, R., Stone, P., Zeng, K.H., Ehsan, K.: Flare: Achieving masterful and adaptive robot policies with large-scale reinforcement learning fine-tuning. arXiv preprint arXiv:2409.16578 (2024) 2, 9

  29. [29]

    arXiv preprint arXiv:2305.04866 (2023) 2

    Hu, J., Stone, P., Martín-Martín, R.: Causal policy gradient for whole-body mobile manipulation. arXiv preprint arXiv:2305.04866 (2023) 2

  30. [30]

    OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework

    Hu, J., Wu, X., Zhu, Z., Wang, W., Zhang, D., Cao, Y ., et al.: Openrlhf: An easy-to-use, scalable and high-performance rlhf framework. arXiv preprint arXiv:2405.11143 (2024) 6

  31. [31]

    Hu, J., Zhang, Y ., Han, Q., Jiang, D., Zhang, X., Shum, H.Y .: Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model (2025), https: //arxiv.org/abs/2503.24290 2, 3

  32. [32]

    $\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

    Intelligence, P., Black, K., Brown, N., Darpinian, J., Dhabalia, K., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., et al.: zpi_t0.5u: a vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054 (2025) 1, 2

  33. [33]

    In: Conference on Robot Learning (CoRL)

    Jang, E., Irpan, A., Khansari, M., Kappler, D., Ebert, F., Lynch, C., Levine, S., Finn, C.: Bc-z: Zero-shot task generalization with robotic imitation learning. In: Conference on Robot Learning (CoRL). pp. 991–1002. PMLR (2022) 1 11

  34. [34]

    arXiv preprint arXiv:2004.10190 (2020) 2

    Julian, R., Swanson, B., Sukhatme, G.S., Levine, S., Finn, C., Hausman, K.: Never stop learning: The effectiveness of fine-tuning in robotic reinforcement learning. arXiv preprint arXiv:2004.10190 (2020) 2

  35. [35]

    QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation

    Kalashnikov, D., Irpan, A., Pastor, P., Ibarz, J., Herzog, A., Jang, E., Quillen, D., Holly, E., Kalakrishnan, M., Vanhoucke, V ., et al.: QT-Opt: Scalable deep reinforcement learning for vision-based robotic manipulation. arXiv preprint arXiv:1806.10293 (2018) 1, 2

  36. [36]

    arXiv (2021) 1, 2

    Kalashnkov, D., Varley, J., Chebotar, Y ., Swanson, B., Jonschkowski, R., Finn, C., Levine, S., Hausman, K.: Mt-opt: Continuous multi-task robotic reinforcement learning at scale. arXiv (2021) 1, 2

  37. [37]

    DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

    Khazatsky, A., Pertsch, K., Nair, S., Balakrishna, A., Dasari, S., Karamcheti, S., Nasiriany, S., Srirama, M.K., Chen, L.Y ., Ellis, K., et al.: Droid: A large-scale in-the-wild robot manipulation dataset. arXiv preprint arXiv:2403.12945 (2024) 1, 2

  38. [38]

    Journal of Artificial Intelligence Research75, 1401–1476 (2022) 2

    Khetarpal, K., Riemer, M., Rish, I., Precup, D.: Towards continual reinforcement learning: A review and perspectives. Journal of Artificial Intelligence Research75, 1401–1476 (2022) 2

  39. [39]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Kim, M.J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., Sanketi, P., et al.: Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246 (2024) 1, 2, 3, 6, 7, 8

  40. [40]

    In: From motor learning to interaction learning in robots, pp

    Kober, J., Mohler, B., Peters, J.: Imitation and reinforcement learning for motor primitives with perceptual coupling. In: From motor learning to interaction learning in robots, pp. 209–225. Springer (2010) 2, 3

  41. [41]

    In: Proceedings of the 29th Symposium on Operating Systems Principles

    Kwon, W., Li, Z., Zhuang, S., Sheng, Y ., Zheng, L., Yu, C.H., Gonzalez, J., Zhang, H., Stoica, I.: Efficient memory management for large language model serving with pagedattention. In: Proceedings of the 29th Symposium on Operating Systems Principles. pp. 611–626 (2023) 6

  42. [42]

    Tulu 3: Pushing Frontiers in Open Language Model Post-Training

    Lambert, N., Morrison, J., Pyatkin, V ., Huang, S., Ivison, H., Brahman, F., Miranda, L.J.V ., Liu, A., Dziri, N., Lyu, S., et al.: Tz" ulu 3: Pushing frontiers in open language model post-training. arXiv preprint arXiv:2411.15124 (2024) 6, 7

  43. [43]

    Let's Verify Step by Step

    Lightman, H., Kosaraju, V ., Burda, Y ., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., Cobbe, K.: Let’s verify step by step. arXiv preprint arXiv:2305.20050 (2023) 3

  44. [44]

    Lin, F., Hu, Y ., Sheng, P., Wen, C., You, J., Gao, Y .: Data scaling laws in imitation learning for robotic manipulation (2024), https://arxiv.org/abs/2410.18647 1, 2

  45. [45]

    Proceedings of Advances in Neural Information Processing Systems (NeurIPS) 36 (2024) 2, 6

    Liu, B., Zhu, Y ., Gao, C., Feng, Y ., Liu, Q., Zhu, Y ., Stone, P.: Libero: Benchmarking knowledge transfer for lifelong robot learning. Proceedings of Advances in Neural Information Processing Systems (NeurIPS) 36 (2024) 2, 6

  46. [46]

    In: Proceedings of European Conference on Computer Vision (ECCV)

    Liu, J., Dai, W., Wang, C., Cheng, Y ., Tang, Y ., Tong, X.: Plan, posture and go: Towards open- vocabulary text-to-motion generation. In: Proceedings of European Conference on Computer Vision (ECCV). pp. 445–463. Springer (2024) 2

  47. [47]

    Liu, Z., Chen, C., Li, W., Qi, P., Tianyu Pang, C.D., Lee, W.S., Lin, M.: Understanding r1-zero- like training: A critical perspective (2025), https://arxiv.org/abs/2503.20783 2, 3

  48. [48]

    arXiv preprint arXiv:2312.07062 (2023) 2

    Lu, G., Wang, Z., Liu, C., Lu, J., Tang, Y .: Thinkbot: Embodied instruction following with thought chain reasoning. arXiv preprint arXiv:2312.07062 (2023) 2

  49. [49]

    arXiv preprint arXiv:2111.05424 (2021) 2

    Lu, Y ., Hausman, K., Chebotar, Y ., Yan, M., Jang, E., Herzog, A., Xiao, T., Irpan, A., Khansari, M., Kalashnikov, D., et al.: Aw-opt: Learning robotic skills with imitation and reinforcement at scale. arXiv preprint arXiv:2111.05424 (2021) 2

  50. [50]

    In: IEEE International Conference on Robotics and Automation (ICRA)

    Luo, J., Hu, Z., Xu, C., Tan, Y .L., Berg, J., Sharma, A., Schaal, S., Finn, C., Gupta, A., Levine, S.: Serl: A software suite for sample-efficient robotic reinforcement learning. In: IEEE International Conference on Robotics and Automation (ICRA). pp. 16961–16969. IEEE (2024) 2, 3 12

  51. [51]

    arXiv preprint arXiv:2410.21845 (2024) 2, 3

    Luo, J., Xu, C., Wu, J., Levine, S.: Precise and dexterous robotic manipulation via human-in- the-loop reinforcement learning. arXiv preprint arXiv:2410.21845 (2024) 2, 3

  52. [52]

    In: Conference on Robot Learning (CoRL)

    Mandlekar, A., Zhu, Y ., Garg, A., Booher, J., Spero, M., Tung, A., Gao, J., Emmons, J., Gupta, A., Orbay, E., et al.: Roboturk: A crowdsourcing platform for robotic skill learning through imitation. In: Conference on Robot Learning (CoRL). pp. 879–893. PMLR (2018) 1

  53. [53]

    In: 13th USENIX symposium on operating systems design and implementation (OSDI 18)

    Moritz, P., Nishihara, R., Wang, S., Tumanov, A., Liaw, R., Liang, E., Elibol, M., Yang, Z., Paul, W., Jordan, M.I., et al.: Ray: A distributed framework for emerging tAIu applications. In: 13th USENIX symposium on operating systems design and implementation (OSDI 18). pp. 561–577 (2018) 6

  54. [54]

    AWAC: Accelerating Online Reinforcement Learning with Offline Datasets

    Nair, A., Gupta, A., Dalal, M., Levine, S.: Awac: Accelerating online reinforcement learning with offline datasets. arXiv preprint arXiv:2006.09359 (2020) 2, 3

  55. [55]

    DINOv2: Learning Robust Visual Features without Supervision

    Oquab, M., Darcet, T., Moutakanni, T., V o, H., Szafraniec, M., Khalidov, V ., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023) 4

  56. [56]

    Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C.L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J., Lowe, R.: Training language models to follow instructions with human feedback (2022), https://arxiv.org/abs/2203.02155 1, 3

  57. [57]

    FAST: Efficient Action Tokenization for Vision-Language-Action Models

    Pertsch, K., Stachowicz, K., Ichter, B., Driess, D., Nair, S., Vuong, Q., Mees, O., Finn, C., Levine, S.: Fast: Efficient action tokenization for vision-language-action models. arXiv preprint arXiv:2501.09747 (2025) 1, 2, 8

  58. [58]

    In: IEEE International Conference on Robotics and Automation (ICRA)

    Pinto, L., Gupta, A.: Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours. In: IEEE International Conference on Robotics and Automation (ICRA). pp. 3406–3413. IEEE (2016) 1

  59. [59]

    Proceedings of Advances in Neural Information Processing Systems (NeurIPS) (2023) 7

    Rafailov, R., Sharma, A., Mitchell, E., Manning, C.D., Ermon, S., Finn, C.: Direct preference optimization: Your language model is secretly a reward model. Proceedings of Advances in Neural Information Processing Systems (NeurIPS) (2023) 7

  60. [60]

    Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations

    Rajeswaran, A., Kumar, V ., Gupta, A., Vezzani, G., Schulman, J., Todorov, E., Levine, S.: Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. arXiv preprint arXiv:1709.10087 (2017) 2, 3

  61. [61]

    High-Dimensional Continuous Control Using Generalized Advantage Estimation

    Schulman, J., Moritz, P., Levine, S., Jordan, M., Abbeel, P.: High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438 (2015) 5

  62. [62]

    Proximal Policy Optimization Algorithms

    Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) 3, 5

  63. [63]

    Google AI (2025) 1

    Silver, D., Sutton, R.S.: Welcome to the era of experience. Google AI (2025) 1

  64. [64]

    Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

    Snell, C., Lee, J., Xu, K., Kumar, A.: Scaling llm test-time compute optimally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314 (2024) 2, 7

  65. [65]

    In: Proceedings of International Conference on Learning Representations (ICLR) (2023) 2

    Taiga, A.A., Agarwal, R., Farebrother, J., Courville, A., Bellemare, M.G.: Investigating multi- task pretraining and generalization in reinforcement learning. In: Proceedings of International Conference on Learning Representations (ICLR) (2023) 2

  66. [66]

    arXiv preprint arXiv:2408.03539 (2024) 2

    Tang, C., Abbatematteo, B., Hu, J., Chandra, R., Martín-Martín, R., Stone, P.: Deep reinforce- ment learning for robotics: A survey of real-world successes. arXiv preprint arXiv:2408.03539 (2024) 2

  67. [67]

    Journal of Machine Learning Research (JMLR) 10(7) (2009) 2

    Taylor, M.E., Stone, P.: Transfer learning for reinforcement learning domains: A survey. Journal of Machine Learning Research (JMLR) 10(7) (2009) 2

  68. [68]

    Octo: An Open-Source Generalist Robot Policy

    Team, O.M., Ghosh, D., Walke, H., Pertsch, K., Black, K., Mees, O., Dasari, S., Hejna, J., Kreiman, T., Xu, C., et al.: Octo: An open-source generalist robot policy. arXiv preprint arXiv:2405.12213 (2024) 1, 2, 7, 8, 9 13

  69. [69]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y ., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al.: Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023) 3

  70. [70]

    In: Proceedings of International Conference on Machine Learning (ICML)

    Uchendu, I., Xiao, T., Lu, Y ., Zhu, B., Yan, M., Simon, J., Bennice, M., Fu, C., Ma, C., Jiao, J., et al.: Jump-start reinforcement learning. In: Proceedings of International Conference on Machine Learning (ICML). pp. 34556–34583. PMLR (2023) 2, 3

  71. [71]

    Leveraging Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse Rewards

    Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) 2

  72. [72]

    Walke, H., Black, K., Lee, A., Kim, M.J., Du, M., Zheng, C., Zhao, T., Hansen-Estruch, P., Vuong, Q., He, A., Myers, V ., Fang, K., Finn, C., Levine, S.: Bridgedata v2: A dataset for robot learning at scale (2023) 1, 2

  73. [73]

    arXiv preprint arXiv:2407.00603 (2024) 1

    Wang, Y ., Zhang, H., Tang, Y ., Liu, Y ., Feng, J., Dai, J., Jin, X.: Hierarchical memory for long video qa. arXiv preprint arXiv:2407.00603 (2024) 1

  74. [74]

    arXiv preprint arXiv:2412.01268 (2024) 1

    Wang, Y ., Zhang, H., Tian, J., Tang, Y .: Ponder & press: Advancing visual gui agent towards general computer control. arXiv preprint arXiv:2412.01268 (2024) 1

  75. [75]

    RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning

    Wang, Z., Wang, K., Wang, Q., Zhang, P., Li, L., Yang, Z., Yu, K., Nguyen, M.N., Liu, L., Gottlieb, E., et al.: Ragen: Understanding self-evolution in llm agents via multi-turn reinforcement learning. arXiv preprint arXiv:2504.20073 (2025) 3

  76. [76]

    Wołczyk, M., Cupiał, B., Ostaszewski, M., Bortkiewicz, M., Zaj ˛ ac, M., Pascanu, R., Łukasz Kuci´nski, Miło´s, P.: Fine-tuning reinforcement learning models is secretly a forgetting mitigation problem (2024), https://arxiv.org/abs/2402.02868 2

  77. [77]

    arXiv preprint arXiv:2403.12203 (2024) 2

    Xing, J., Romero, A., Bauersfeld, L., Scaramuzza, D.: Bootstrapping reinforcement learning with imitation for vision-based agile flight. arXiv preprint arXiv:2403.12203 (2024) 2

  78. [78]

    arXiv preprint arXiv:2412.09858 (2024) 9

    Xu, C., Li, Q., Luo, J., Levine, S.: Rldg: Robotic generalist policy distillation via reinforcement learning. arXiv preprint arXiv:2412.09858 (2024) 9

  79. [79]

    Qwen2.5 Technical Report

    Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., et al.: Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115 (2024) 1

  80. [80]

    arXiv preprint arXiv:2412.00447 (2024) 1

    Ye, X., Gan, Y ., Ge, Y ., Zhang, X.P., Tang, Y .: Atp-llava: Adaptive token pruning for large vision language models. arXiv preprint arXiv:2412.00447 (2024) 1

Showing first 80 references.