arxiv: 2505.18719 · v1 · submitted 2025-05-24 · 💻 cs.RO · cs.AI

Recognition: no theorem link

VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning

Guanxing Lu , Wenkai Guo , Chubin Zhang , Yuheng Zhou , Haonan Jiang , Zifeng Gao , Yansong Tang , Ziwei Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-16 12:52 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords robotic manipulationreinforcement learningvision-language-actiononline RLprocess reward modelLIBERO benchmarktest-time optimization

0 comments

The pith

VLA-RL applies online reinforcement learning to raise pretrained vision-language-action models above finetuned baselines on robot tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces VLA-RL as a way to move beyond imitation learning by running online reinforcement learning on top of pretrained auto-regressive vision-language-action models. It reformulates manipulation trajectories as multi-modal conversations and supplies dense process rewards by fine-tuning a separate vision-language model on segments automatically cut from task demonstrations. The resulting system lets OpenVLA-7B outperform the best prior finetuned baseline by 4.5 percent across 40 LIBERO tasks and reach parity with commercial models such as π₀-FAST. Gains continue when more test-time optimization steps are allowed, suggesting that robotics may follow an inference-scaling pattern once reward models are in place.

Core claim

VLA-RL casts general robotic manipulation as a trajectory-level reinforcement-learning problem inside an auto-regressive VLA, models the trajectory as a multi-modal multi-turn conversation, and supplies rewards through a fine-tuned vision-language process reward model trained on pseudo-labels from automatically segmented demonstrations. With supporting techniques for curriculum selection, vectorized GPU environments, batch decoding, and critic warmup, the method produces a 4.5 percent lift over the strongest finetuned baseline on the 40-task LIBERO suite and matches the performance of advanced commercial systems while continuing to improve with additional test-time steps.

What carries the argument

The robotic process reward model: a pretrained vision-language model fine-tuned on pseudo reward labels extracted from automatically segmented task demonstrations to provide dense guidance for online RL.

If this is right

Pretrained VLAs can be improved at test time without collecting new human demonstrations.
Process rewards derived from vision-language models can replace sparse success signals in long-horizon manipulation.
Curriculum ordering and vectorized execution become necessary engineering ingredients for scaling online RL on robots.
Extended test-time optimization yields continued gains, opening a path to inference-time scaling in robotics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar reward-model-plus-online-RL pipelines could transfer to other embodied domains such as navigation or assembly where demonstration data is also limited.
The observed scaling with test-time steps implies that robot policies may eventually be deployed with variable compute budgets at inference, trading latency for reliability.
If the reward model can be kept frozen while the policy improves, the approach decouples perception from control and may simplify safety verification.

Load-bearing premise

A vision-language model trained on automatically extracted task segments will generate reward signals accurate and general enough to drive stable online reinforcement learning across out-of-distribution robot scenarios.

What would settle it

Running the same VLA-RL procedure with a reward model that receives human-verified segment labels instead of pseudo labels and observing no performance gain or a performance drop would falsify the claim that the pseudo-label approach suffices.

read the original abstract

Recent high-capacity vision-language-action (VLA) models have demonstrated impressive performance on a range of robotic manipulation tasks by imitating human demonstrations. However, exploiting offline data with limited visited states will cause execution failure in out-of-distribution scenarios. Intuitively, an exploration-based method that improves on online collected data at test time could address this limitation. We present VLA-RL, an algorithmic and systematic framework that leverages online reinforcement learning (RL) to improve pretrained auto-regressive VLAs in downstream tasks. Within a unified perspective, we first introduce a trajectory-level RL formulation for auto-regressive VLA training, which models general robotic manipulation trajectory as multi-modal multi-turn conversation. To address the challenge of sparse rewards, we fine-tune a pretrained vision-language model as a robotic process reward model, which is trained on pseudo reward labels annotated on automatically extracted task segments. To scale up, we identify several implementation findings that improve the stability and efficiency including curriculum selection strategy, GPU-balanced vectorized environments, batch decoding, and critic warmup. VLA-RL enables OpenVLA-7B to surpass the strongest finetuned baseline by 4.5% on 40 challenging robotic manipulation tasks in LIBERO, and even matches the performance of advanced commercial models such as $\pi_0$-FAST. Notably, we observe that VLA-RL benefits from increased test-time optimization, indicating an early spark of inference scaling laws in robotics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VLA-RL shows a workable trajectory-level RL framing for VLAs plus a VLM reward model on pseudo labels, delivering a 4.5% LIBERO gain, but the reward signals rest on unvalidated auto-extracted labels.

read the letter

The paper's main move is to treat full manipulation trajectories as multi-turn conversations so that standard RL can be applied directly to auto-regressive VLAs. They pair this with a VLM fine-tuned as a process reward model on pseudo labels from automatic task segmentation, then add practical scaling tricks like curriculum selection and critic warmup. OpenVLA-7B ends up 4.5% above the strongest finetuned baseline on the 40 LIBERO tasks and reaches parity with some commercial systems. That is the concrete result worth noting first.

Referee Report

2 major / 2 minor

Summary. The paper introduces VLA-RL, a framework for improving pretrained vision-language-action (VLA) models via online reinforcement learning on robotic manipulation tasks. It reformulates auto-regressive VLA training as trajectory-level RL modeled as multi-modal multi-turn conversations, addresses sparse rewards by fine-tuning a pretrained VLM as a process reward model on pseudo-labels from automatically extracted task segments, and incorporates practical stabilizations such as curriculum selection, vectorized environments, batch decoding, and critic warmup. On the LIBERO benchmark's 40 tasks, the method reportedly lifts OpenVLA-7B by 4.5% over the strongest fine-tuned baseline and reaches parity with commercial systems such as π₀-FAST, while also noting benefits from increased test-time optimization.

Significance. If the central empirical result holds, the work provides concrete evidence that online RL can be scaled to high-capacity VLAs for general manipulation, yielding gains that close the gap to proprietary models. The practical implementation findings for stable training and the observation of inference-time scaling are useful contributions to the robotics community, particularly for practitioners seeking to extend offline VLA policies without additional human demonstrations.

major comments (2)

[Process reward model training and pseudo-label construction] The process reward model is trained exclusively on pseudo-labels generated by automatic task-segment extraction, yet the manuscript reports no human validation, inter-annotator agreement, held-out accuracy, or error analysis for these labels. Because the 4.5% gain and parity with π₀-FAST rest on the quality of the dense rewards supplied during online RL, any systematic mis-labeling (e.g., on contact-rich or multi-step tasks) would render the improvement indistinguishable from baseline variance or curriculum effects.
[Experimental results and LIBERO evaluation] The experimental section presents the 4.5% aggregate improvement on LIBERO without reporting per-task breakdowns, standard deviations across random seeds, or statistical significance tests. In the absence of these quantities it is impossible to determine whether the lift exceeds run-to-run variability or is driven by a small subset of tasks.

minor comments (2)

[Abstract] The abstract states that 'several implementation findings' improve stability but does not enumerate them; a short bulleted list would improve readability.
[Method formulation] Notation for the trajectory-level RL objective and the process reward model could be unified more clearly with the conversation-style formulation introduced earlier.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript accordingly to improve clarity and rigor.

read point-by-point responses

Referee: [Process reward model training and pseudo-label construction] The process reward model is trained exclusively on pseudo-labels generated by automatic task-segment extraction, yet the manuscript reports no human validation, inter-annotator agreement, held-out accuracy, or error analysis for these labels. Because the 4.5% gain and parity with π₀-FAST rest on the quality of the dense rewards supplied during online RL, any systematic mis-labeling (e.g., on contact-rich or multi-step tasks) would render the improvement indistinguishable from baseline variance or curriculum effects.

Authors: We agree that explicit validation of the pseudo-labels is important for substantiating the dense reward signals. The original manuscript describes the automatic task-segment extraction but does not include human validation or error analysis. In the revision we will add a new subsection detailing the extraction algorithm, results from manual inspection of 200 randomly sampled segments (reporting precision/recall against human labels), and a targeted error analysis on contact-rich and multi-step tasks. This will directly address concerns about label quality and its impact on the observed gains. revision: yes
Referee: [Experimental results and LIBERO evaluation] The experimental section presents the 4.5% aggregate improvement on LIBERO without reporting per-task breakdowns, standard deviations across random seeds, or statistical significance tests. In the absence of these quantities it is impossible to determine whether the lift exceeds run-to-run variability or is driven by a small subset of tasks.

Authors: We acknowledge that aggregate results alone limit interpretability. The revised manuscript will include a per-task success rate table for all 40 LIBERO tasks, standard deviations computed over three independent random seeds, and statistical significance tests (paired t-tests and Wilcoxon signed-rank tests) comparing VLA-RL against the strongest baselines. These additions will demonstrate that the 4.5% improvement is consistent across tasks and exceeds run-to-run variability. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical RL gains rest on independent online optimization and pseudo-label reward model

full rationale

The paper's central result is an empirical performance improvement (4.5% on LIBERO) obtained by running online RL on newly collected trajectories, using a separately fine-tuned VLM reward model whose labels come from automatic segment extraction. No derivation chain reduces the reported gains to fitted parameters from the same dataset, self-referential definitions, or load-bearing self-citations. The trajectory-as-conversation modeling choice and implementation heuristics (curriculum, batch decoding, critic warmup) are engineering decisions whose validity is tested by the external benchmark results rather than assumed by construction. The pseudo-label step introduces potential label noise but does not create circularity because the subsequent RL phase optimizes against new online data and reports held-out task success rates.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the effectiveness of the pseudo-label reward model and standard RL stability assumptions rather than new mathematical derivations.

free parameters (1)

RL training hyperparameters (learning rate, batch size, curriculum thresholds)
Standard RL knobs tuned for stability and efficiency; values not reported in abstract.

axioms (1)

domain assumption A pretrained vision-language model can be fine-tuned on automatically extracted task segments to produce reliable process rewards for robotic trajectories.
Invoked to solve sparse-reward problem; central to the method.

pith-pipeline@v0.9.0 · 5584 in / 1236 out tokens · 28399 ms · 2026-05-16T12:52:28.368258+00:00 · methodology

discussion (0)

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Enhancing MLLM Spatial Understanding via Active 3D Scene Exploration for Multi-Perspective Reasoning
cs.CV 2026-04 unverdicted novelty 7.0

A training-free Visual Chain-of-Thought framework reconstructs high-fidelity 3D meshes from single images and iteratively synthesizes optimal novel views to enhance MLLM spatial comprehension on benchmarks like 3DSRBench.
You've Got a Golden Ticket: Improving Generative Robot Policies With A Single Noise Vector
cs.RO 2026-03 conditional novelty 7.0

Optimizing a single constant initial noise vector for frozen generative robot policies improves success rates on 38 of 43 tasks by up to 58% relative improvement.
RL-VLA$^3$: A Flexible and Asynchronous Reinforcement Learning Framework for VLA Training
cs.AI 2026-02 unverdicted novelty 7.0

RL-VLA³ is an asynchronous RL framework for VLA training that delivers up to 85.2% higher throughput than synchronous baselines while preserving identical sample efficiency and scaling to 256 GPUs.
Reinforcing VLAs in Task-Agnostic World Models
cs.AI 2026-05 unverdicted novelty 6.0

RAW-Dream lets VLAs learn new tasks in zero-shot imagination by using a world model pre-trained only on task-free behaviors and an unmodified VLM to supply rewards, with dual-noise verification to limit hallucinations.
Unified Noise Steering for Efficient Human-Guided VLA Adaptation
cs.RO 2026-05 unverdicted novelty 6.0

UniSteer unifies human corrective actions and noise-space RL for VLA adaptation by inverting actions to noise targets, raising success rates from 20% to 90% in 66 minutes across four real-world manipulation tasks.
Learning while Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies
cs.RO 2026-05 unverdicted novelty 6.0

Fleet-scale RL framework improves a single generalist VLA policy from deployment data to 95% average success on eight real-world manipulation tasks with 16 dual-arm robots.
LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning
cs.RO 2026-04 unverdicted novelty 6.0

LaST-R1 introduces a RL post-training method called LAPO that optimizes latent Chain-of-Thought reasoning in vision-language-action models, yielding 99.9% success on LIBERO and up to 22.5% real-world gains.
LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning
cs.RO 2026-04 unverdicted novelty 6.0

LaST-R1 reaches 99.8% average success on the LIBERO benchmark using one-shot warm-up plus LAPO reinforcement learning on latent physical reasoning, with up to 44% real-world gains on complex single- and dual-arm tasks.
CorridorVLA: Explicit Spatial Constraints for Generative Action Heads via Sparse Anchors
cs.RO 2026-04 unverdicted novelty 6.0

CorridorVLA improves VLA models by using predicted sparse anchors to impose explicit spatial corridors on action trajectories, yielding 3.4-12.4% success rate gains on LIBERO-Plus with GR00T-Corr reaching 83.21%.
MoRI: Mixture of RL and IL Experts for Long-Horizon Manipulation Tasks
cs.RO 2026-04 unverdicted novelty 6.0

MoRI dynamically mixes RL and IL experts with variance-based switching and IL regularization to reach 97.5% success in four real-world robotic tasks while cutting human intervention by 85.8%.
E-VLA: Event-Augmented Vision-Language-Action Model for Dark and Blurred Scenes
cs.CV 2026-04 conditional novelty 6.0

E-VLA integrates event streams directly into VLA models via lightweight fusion, raising Pick-Place success from 0% to 60-90% at 20 lux and from 0% to 20-25% under severe motion blur.
RISE: Self-Improving Robot Policy with Compositional World Model
cs.RO 2026-02 unverdicted novelty 6.0

RISE combines a controllable dynamics model and progress value model into a closed-loop self-improving pipeline that updates robot policies entirely in imagination, reporting over 35% absolute gains on three real-world tasks.
$\pi^{*}_{0.6}$: a VLA That Learns From Experience
cs.LG 2025-11 unverdicted novelty 6.0

RECAP enables a generalist VLA to self-improve via advantage-conditioned RL on mixed real-world data, more than doubling throughput and halving failure rates on hard manipulation tasks.
SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning
cs.RO 2025-09 conditional novelty 6.0

SimpleVLA-RL applies tailored reinforcement learning to VLA models, reaching SoTA on LIBERO, outperforming π₀ on RoboTwin, and surpassing SFT in real-world tasks while reducing data needs and identifying a 'pushcut' p...
F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions
cs.RO 2025-09 unverdicted novelty 6.0

F1 integrates next-scale visual foresight prediction into a Mixture-of-Transformer VLA architecture to reformulate action generation as foresight-guided inverse dynamics, achieving higher success rates on 136 tasks.
Learning Action Manifold with Multi-view Latent Priors for Robotic Manipulation
cs.RO 2026-05 unverdicted novelty 5.0

The method uses multi-view diffusion priors and action manifold learning to resolve depth ambiguity and improve action prediction in VLA robotic manipulation models, reporting higher success rates than baselines on LI...
Closing the Loop: Unified 3D Scene Generation and Immersive Interaction via LLM-RL Coupling
cs.CV 2026-05 unverdicted novelty 5.0

A closed-loop system couples LLM-based 3D scene generation with RL optimization and VR user interactions to produce adaptive, immersive environments, claiming SOTA results on the ALFRED benchmark.
OmniVLA-RL: A Vision-Language-Action Model with Spatial Understanding and Online RL
cs.RO 2026-04 unverdicted novelty 4.0

OmniVLA-RL uses a mix-of-transformers architecture and flow-matching reformulated as SDE with group segmented policy optimization to surpass prior VLA models on LIBERO benchmarks.
AugVLA-3D: Depth-Driven Feature Augmentation for Vision-Language-Action Models
cs.CV 2026-02 unverdicted novelty 3.0

AugVLA-3D augments existing VLA models with depth-derived 3D features and action priors to improve generalization and action accuracy in 3D robotic tasks.

Reference graph

Works this paper leans on

93 extracted references · 93 canonical work pages · cited by 18 Pith papers · 36 internal anchors

[1]

Proceedings of Advances in Neural Information Processing Systems (NeurIPS) 35, 28955–28971 (2022) 2, 3

Agarwal, R., Schwarzer, M., Castro, P.S., Courville, A.C., Bellemare, M.: Reincarnating reinforcement learning: Reusing prior computation to accelerate progress. Proceedings of Advances in Neural Information Processing Systems (NeurIPS) 35, 28955–28971 (2022) 2, 3

work page 2022
[2]

Solving Rubik's Cube with a Robot Hand

Akkaya, I., Andrychowicz, M., Chociej, M., Litwin, M., McGrew, B., Petron, A., Paino, A., Plappert, M., Powell, G., Ribas, R., et al.: Solving rubik’s cube with a robot hand. arXiv preprint arXiv:1910.07113 (2019) 2

work page internal anchor Pith review Pith/arXiv arXiv 1910
[3]

arXiv preprint arXiv:2505.14231 (2025) 3

Bai, S., Li, M., Liu, Y ., Tang, J., Zhang, H., Sun, L., Chu, X., Tang, Y .: Univg-r1: Reasoning guided universal visual grounding with reinforcement learning. arXiv preprint arXiv:2505.14231 (2025) 3

work page arXiv 2025
[4]

Proceedings of Advances in Neural Information Processing Systems (NeurIPS) 35, 24639– 24654 (2022) 2

Baker, B., Akkaya, I., Zhokov, P., Huizinga, J., Tang, J., Ecoffet, A., Houghton, B., Sampedro, R., Clune, J.: Video pretraining (vpt): Learning to act by watching unlabeled online videos. Proceedings of Advances in Neural Information Processing Systems (NeurIPS) 35, 24639– 24654 (2022) 2

work page 2022
[5]

In: Proceedings of International Conference on Machine Learning (ICML)

Ball, P.J., Smith, L., Kostrikov, I., Levine, S.: Efficient online reinforcement learning with offline data. In: Proceedings of International Conference on Machine Learning (ICML). pp. 1577–1594. PMLR (2023) 2

work page 2023
[6]

arXiv preprint arXiv:2309.01918 (2023) 1, 2

Bharadhwaj, H., Vakil, J., Sharma, M., Gupta, A., Tulsiani, S., Kumar, V .: Roboagent: General- ization and efficiency in robot manipulation via semantic augmentations and action chunking. arXiv preprint arXiv:2309.01918 (2023) 1, 2

work page arXiv 2023
[7]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., et al.: zpi_0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164 (2024) 1, 2, 9

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Brohan, A., Brown, N., Carbajal, J., Chebotar, Y ., Chen, X., Choromanski, K., Ding, T., Driess, D., Dubey, A., Finn, C., et al.: Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818 (2023) 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

RT-1: Robotics Transformer for Real-World Control at Scale

Brohan, A., Brown, N., Carbajal, J., Chebotar, Y ., Dabis, J., Finn, C., Gopalakrishnan, K., Hausman, K., Herzog, A., Hsu, J., et al.: Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817 (2022) 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2022
[10]

Robotics: Science and Systems (RSS) (2019) 1

Cabi, S., Colmenarejo, S.G., Novikov, A., Konyushkova, K., Reed, S., Jeong, R., Zolna, K., Aytar, Y ., Budden, D., Vecerik, M., Sushkov, O., Barker, D., Scholz, J., Denil, M., de Freitas, N., Wang, Z.: Scaling data-driven robotics with reward sketching and batch reinforcement learning. Robotics: Science and Systems (RSS) (2019) 1

work page 2019
[11]

2106.01345 , archiveprefix =

Chen, L., Lu, K., Rajeswaran, A., Lee, K., Grover, A., Laskin, M., Abbeel, P., Srinivas, A., Mordatch, I.: Decision transformer: Reinforcement learning via sequence modeling. arXiv preprint arXiv:2106.01345 (2021) 3

work page arXiv 2021
[12]

In: Robotics: Science and Systems (RSS) (2023) 6, 8

Chi, C., Feng, S., Du, Y ., Xu, Z., Cousineau, E., Burchfiel, B., Song, S.: Diffusion policy: Visuomotor policy learning via action diffusion. In: Robotics: Science and Systems (RSS) (2023) 6, 8

work page 2023
[13]

Christiano, P., Leike, J., Brown, T.B., Martic, M., Legg, S., Amodei, D.: Deep reinforcement learning from human preferences (2023), https://arxiv.org/abs/1706.03741 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[14]

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

Chu, Y ., Xu, J., Zhou, X., Yang, Q., Zhang, S., Yan, Z., Zhou, C., Zhou, J.: Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models. arXiv preprint arXiv:2311.07919 (2023) 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

Open X-Embodiment: Robotic Learning Datasets and RT-X Models

Collaboration, O.X.E.: Open X-Embodiment: Robotic learning datasets and RT-X models. https://arxiv.org/abs/2310.08864 (2023) 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

Conference on Robot Learning (CoRL) (2019) 1, 2 10

Dasari, S., Ebert, F., Tian, S., Nair, S., Bucher, B., Schmeckpeper, K., Singh, S., Levine, S., Finn, C.: Robonet: Large-scale multi-robot learning. Conference on Robot Learning (CoRL) (2019) 1, 2 10

work page 2019
[17]

DeepSeek-AI, et al.: Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning (2025), https://arxiv.org/abs/2501.12948 1, 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Bridge Data: Boosting Generalization of Robotic Skills with Cross-Domain Datasets

Ebert, F., Yang, Y ., Schmeckpeper, K., Bucher, B., Georgakis, G., Daniilidis, K., Finn, C., Levine, S.: Bridge data: Boosting generalization of robotic skills with cross-domain datasets. arXiv preprint arXiv:2109.13396 (2021) 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2021
[19]

arXiv preprint arXiv:2312.02976 (2023) 1, 2

Ehsani, K., Gupta, T., Hendrix, R., Salvador, J., Weihs, L., Zeng, K.H., Singh, K.P., Kim, Y ., Han, W., Herrasti, A., et al.: Imitating shortest paths in simulation enables effective navigation and manipulation in the real world. arXiv preprint arXiv:2312.02976 (2023) 1, 2

work page arXiv 2023
[20]

Towards Generalist Robots: Learning Paradigms for Scalable Skill Acquisition@ CoRL2023 3, 5 (2023) 1, 2

Fang, H.S., Fang, H., Tang, Z., Liu, J., Wang, C., Wang, J., Zhu, H., Lu, C.: Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot. Towards Generalist Robots: Learning Paradigms for Scalable Skill Acquisition@ CoRL2023 3, 5 (2023) 1, 2

work page 2023
[21]

Reinforced Self-Training (ReST) for Language Modeling

Gulcehre, C., Paine, T.L., Srinivasan, S., Konyushkova, K., Weerts, L., Sharma, A., Siddhant, A., Ahern, A., Wang, M., Gu, C., et al.: Reinforced self-training (rest) for language modeling. arXiv preprint arXiv:2308.08998 (2023) 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

arXiv preprint arXiv:2501.16664 (2025) 2

Guo, Y ., Zhang, J., Chen, X., Ji, X., Wang, Y .J., Hu, Y ., Chen, J.: Improving vision-language- action model with online reinforcement learning. arXiv preprint arXiv:2501.16664 (2025) 2

work page arXiv 2025
[23]

Proceedings of Advances in Neural Information Processing Systems (NeurIPS) 31 (2018) 1

Gupta, A., Murali, A., Gandhi, D.P., Pinto, L.: Robot learning in homes: Improving general- ization and reducing dataset bias. Proceedings of Advances in Neural Information Processing Systems (NeurIPS) 31 (2018) 1

work page 2018
[24]

Conference on Robot Learning (CoRL) (2019) 2, 3

Gupta, A., Kumar, V ., Lynch, C., Levine, S., Hausman, K.: Relay policy learning: Solving long-horizon tasks via imitation and reinforcement learning. Conference on Robot Learning (CoRL) (2019) 2, 3

work page 2019
[25]

In: Proceedings of AAAI Conference on Artificial Intelligence (AAAI) (2018) 2

Hester, T., Vecerik, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., et al.: Deep q-learning from demonstrations. In: Proceedings of AAAI Conference on Artificial Intelligence (AAAI) (2018) 2

work page 2018
[26]

Proceedings of International Conference on Learning Representations (ICLR) 1(2), 3 (2022) 5

Hu, E.J., Shen, Y ., Wallis, P., Allen-Zhu, Z., Li, Y ., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. Proceedings of International Conference on Learning Representations (ICLR) 1(2), 3 (2022) 5

work page 2022
[27]

arXiv preprint arXiv:2311.02198 (2023) 2

Hu, H., Mirchandani, S., Sadigh, D.: Imitation bootstrapped reinforcement learning. arXiv preprint arXiv:2311.02198 (2023) 2

work page arXiv 2023
[28]

arXiv preprint arXiv:2409.16578 (2024) 2, 9

Hu, J., Hendrix, R., Farhadi, A., Kembhavi, A., Martin-Martin, R., Stone, P., Zeng, K.H., Ehsan, K.: Flare: Achieving masterful and adaptive robot policies with large-scale reinforcement learning fine-tuning. arXiv preprint arXiv:2409.16578 (2024) 2, 9

work page arXiv 2024
[29]

arXiv preprint arXiv:2305.04866 (2023) 2

Hu, J., Stone, P., Martín-Martín, R.: Causal policy gradient for whole-body mobile manipulation. arXiv preprint arXiv:2305.04866 (2023) 2

work page arXiv 2023
[30]

OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework

Hu, J., Wu, X., Zhu, Z., Wang, W., Zhang, D., Cao, Y ., et al.: Openrlhf: An easy-to-use, scalable and high-performance rlhf framework. arXiv preprint arXiv:2405.11143 (2024) 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

Hu, J., Zhang, Y ., Han, Q., Jiang, D., Zhang, X., Shum, H.Y .: Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model (2025), https: //arxiv.org/abs/2503.24290 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

Intelligence, P., Black, K., Brown, N., Darpinian, J., Dhabalia, K., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., et al.: zpi_t0.5u: a vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054 (2025) 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

In: Conference on Robot Learning (CoRL)

Jang, E., Irpan, A., Khansari, M., Kappler, D., Ebert, F., Lynch, C., Levine, S., Finn, C.: Bc-z: Zero-shot task generalization with robotic imitation learning. In: Conference on Robot Learning (CoRL). pp. 991–1002. PMLR (2022) 1 11

work page 2022
[34]

arXiv preprint arXiv:2004.10190 (2020) 2

Julian, R., Swanson, B., Sukhatme, G.S., Levine, S., Finn, C., Hausman, K.: Never stop learning: The effectiveness of fine-tuning in robotic reinforcement learning. arXiv preprint arXiv:2004.10190 (2020) 2

work page arXiv 2004
[35]

QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation

Kalashnikov, D., Irpan, A., Pastor, P., Ibarz, J., Herzog, A., Jang, E., Quillen, D., Holly, E., Kalakrishnan, M., Vanhoucke, V ., et al.: QT-Opt: Scalable deep reinforcement learning for vision-based robotic manipulation. arXiv preprint arXiv:1806.10293 (2018) 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2018
[36]

arXiv (2021) 1, 2

Kalashnkov, D., Varley, J., Chebotar, Y ., Swanson, B., Jonschkowski, R., Finn, C., Levine, S., Hausman, K.: Mt-opt: Continuous multi-task robotic reinforcement learning at scale. arXiv (2021) 1, 2

work page 2021
[37]

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

Khazatsky, A., Pertsch, K., Nair, S., Balakrishna, A., Dasari, S., Karamcheti, S., Nasiriany, S., Srirama, M.K., Chen, L.Y ., Ellis, K., et al.: Droid: A large-scale in-the-wild robot manipulation dataset. arXiv preprint arXiv:2403.12945 (2024) 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

Journal of Artificial Intelligence Research75, 1401–1476 (2022) 2

Khetarpal, K., Riemer, M., Rish, I., Precup, D.: Towards continual reinforcement learning: A review and perspectives. Journal of Artificial Intelligence Research75, 1401–1476 (2022) 2

work page 2022
[39]

OpenVLA: An Open-Source Vision-Language-Action Model

Kim, M.J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., Sanketi, P., et al.: Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246 (2024) 1, 2, 3, 6, 7, 8

work page internal anchor Pith review Pith/arXiv arXiv 2024
[40]

In: From motor learning to interaction learning in robots, pp

Kober, J., Mohler, B., Peters, J.: Imitation and reinforcement learning for motor primitives with perceptual coupling. In: From motor learning to interaction learning in robots, pp. 209–225. Springer (2010) 2, 3

work page 2010
[41]

In: Proceedings of the 29th Symposium on Operating Systems Principles

Kwon, W., Li, Z., Zhuang, S., Sheng, Y ., Zheng, L., Yu, C.H., Gonzalez, J., Zhang, H., Stoica, I.: Efficient memory management for large language model serving with pagedattention. In: Proceedings of the 29th Symposium on Operating Systems Principles. pp. 611–626 (2023) 6

work page 2023
[42]

Tulu 3: Pushing Frontiers in Open Language Model Post-Training

Lambert, N., Morrison, J., Pyatkin, V ., Huang, S., Ivison, H., Brahman, F., Miranda, L.J.V ., Liu, A., Dziri, N., Lyu, S., et al.: Tz" ulu 3: Pushing frontiers in open language model post-training. arXiv preprint arXiv:2411.15124 (2024) 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[43]

Let's Verify Step by Step

Lightman, H., Kosaraju, V ., Burda, Y ., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., Cobbe, K.: Let’s verify step by step. arXiv preprint arXiv:2305.20050 (2023) 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[44]

Lin, F., Hu, Y ., Sheng, P., Wen, C., You, J., Gao, Y .: Data scaling laws in imitation learning for robotic manipulation (2024), https://arxiv.org/abs/2410.18647 1, 2

work page arXiv 2024
[45]

Proceedings of Advances in Neural Information Processing Systems (NeurIPS) 36 (2024) 2, 6

Liu, B., Zhu, Y ., Gao, C., Feng, Y ., Liu, Q., Zhu, Y ., Stone, P.: Libero: Benchmarking knowledge transfer for lifelong robot learning. Proceedings of Advances in Neural Information Processing Systems (NeurIPS) 36 (2024) 2, 6

work page 2024
[46]

In: Proceedings of European Conference on Computer Vision (ECCV)

Liu, J., Dai, W., Wang, C., Cheng, Y ., Tang, Y ., Tong, X.: Plan, posture and go: Towards open- vocabulary text-to-motion generation. In: Proceedings of European Conference on Computer Vision (ECCV). pp. 445–463. Springer (2024) 2

work page 2024
[47]

Liu, Z., Chen, C., Li, W., Qi, P., Tianyu Pang, C.D., Lee, W.S., Lin, M.: Understanding r1-zero- like training: A critical perspective (2025), https://arxiv.org/abs/2503.20783 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

arXiv preprint arXiv:2312.07062 (2023) 2

Lu, G., Wang, Z., Liu, C., Lu, J., Tang, Y .: Thinkbot: Embodied instruction following with thought chain reasoning. arXiv preprint arXiv:2312.07062 (2023) 2

work page arXiv 2023
[49]

arXiv preprint arXiv:2111.05424 (2021) 2

Lu, Y ., Hausman, K., Chebotar, Y ., Yan, M., Jang, E., Herzog, A., Xiao, T., Irpan, A., Khansari, M., Kalashnikov, D., et al.: Aw-opt: Learning robotic skills with imitation and reinforcement at scale. arXiv preprint arXiv:2111.05424 (2021) 2

work page arXiv 2021
[50]

In: IEEE International Conference on Robotics and Automation (ICRA)

Luo, J., Hu, Z., Xu, C., Tan, Y .L., Berg, J., Sharma, A., Schaal, S., Finn, C., Gupta, A., Levine, S.: Serl: A software suite for sample-efficient robotic reinforcement learning. In: IEEE International Conference on Robotics and Automation (ICRA). pp. 16961–16969. IEEE (2024) 2, 3 12

work page 2024
[51]

arXiv preprint arXiv:2410.21845 (2024) 2, 3

Luo, J., Xu, C., Wu, J., Levine, S.: Precise and dexterous robotic manipulation via human-in- the-loop reinforcement learning. arXiv preprint arXiv:2410.21845 (2024) 2, 3

work page arXiv 2024
[52]

In: Conference on Robot Learning (CoRL)

Mandlekar, A., Zhu, Y ., Garg, A., Booher, J., Spero, M., Tung, A., Gao, J., Emmons, J., Gupta, A., Orbay, E., et al.: Roboturk: A crowdsourcing platform for robotic skill learning through imitation. In: Conference on Robot Learning (CoRL). pp. 879–893. PMLR (2018) 1

work page 2018
[53]

In: 13th USENIX symposium on operating systems design and implementation (OSDI 18)

Moritz, P., Nishihara, R., Wang, S., Tumanov, A., Liaw, R., Liang, E., Elibol, M., Yang, Z., Paul, W., Jordan, M.I., et al.: Ray: A distributed framework for emerging tAIu applications. In: 13th USENIX symposium on operating systems design and implementation (OSDI 18). pp. 561–577 (2018) 6

work page 2018
[54]

AWAC: Accelerating Online Reinforcement Learning with Offline Datasets

Nair, A., Gupta, A., Dalal, M., Levine, S.: Awac: Accelerating online reinforcement learning with offline datasets. arXiv preprint arXiv:2006.09359 (2020) 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2006
[55]

DINOv2: Learning Robust Visual Features without Supervision

Oquab, M., Darcet, T., Moutakanni, T., V o, H., Szafraniec, M., Khalidov, V ., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023) 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[56]

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C.L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J., Lowe, R.: Training language models to follow instructions with human feedback (2022), https://arxiv.org/abs/2203.02155 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[57]

FAST: Efficient Action Tokenization for Vision-Language-Action Models

Pertsch, K., Stachowicz, K., Ichter, B., Driess, D., Nair, S., Vuong, Q., Mees, O., Finn, C., Levine, S.: Fast: Efficient action tokenization for vision-language-action models. arXiv preprint arXiv:2501.09747 (2025) 1, 2, 8

work page internal anchor Pith review Pith/arXiv arXiv 2025
[58]

In: IEEE International Conference on Robotics and Automation (ICRA)

Pinto, L., Gupta, A.: Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours. In: IEEE International Conference on Robotics and Automation (ICRA). pp. 3406–3413. IEEE (2016) 1

work page 2016
[59]

Proceedings of Advances in Neural Information Processing Systems (NeurIPS) (2023) 7

Rafailov, R., Sharma, A., Mitchell, E., Manning, C.D., Ermon, S., Finn, C.: Direct preference optimization: Your language model is secretly a reward model. Proceedings of Advances in Neural Information Processing Systems (NeurIPS) (2023) 7

work page 2023
[60]

Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations

Rajeswaran, A., Kumar, V ., Gupta, A., Vezzani, G., Schulman, J., Todorov, E., Levine, S.: Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. arXiv preprint arXiv:1709.10087 (2017) 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2017
[61]

High-Dimensional Continuous Control Using Generalized Advantage Estimation

Schulman, J., Moritz, P., Levine, S., Jordan, M., Abbeel, P.: High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438 (2015) 5

work page internal anchor Pith review Pith/arXiv arXiv 2015
[62]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) 3, 5

work page internal anchor Pith review Pith/arXiv arXiv 2017
[63]

Google AI (2025) 1

Silver, D., Sutton, R.S.: Welcome to the era of experience. Google AI (2025) 1

work page 2025
[64]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Snell, C., Lee, J., Xu, K., Kumar, A.: Scaling llm test-time compute optimally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314 (2024) 2, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[65]

In: Proceedings of International Conference on Learning Representations (ICLR) (2023) 2

Taiga, A.A., Agarwal, R., Farebrother, J., Courville, A., Bellemare, M.G.: Investigating multi- task pretraining and generalization in reinforcement learning. In: Proceedings of International Conference on Learning Representations (ICLR) (2023) 2

work page 2023
[66]

arXiv preprint arXiv:2408.03539 (2024) 2

Tang, C., Abbatematteo, B., Hu, J., Chandra, R., Martín-Martín, R., Stone, P.: Deep reinforce- ment learning for robotics: A survey of real-world successes. arXiv preprint arXiv:2408.03539 (2024) 2

work page arXiv 2024
[67]

Journal of Machine Learning Research (JMLR) 10(7) (2009) 2

Taylor, M.E., Stone, P.: Transfer learning for reinforcement learning domains: A survey. Journal of Machine Learning Research (JMLR) 10(7) (2009) 2

work page 2009
[68]

Octo: An Open-Source Generalist Robot Policy

Team, O.M., Ghosh, D., Walke, H., Pertsch, K., Black, K., Mees, O., Dasari, S., Hejna, J., Kreiman, T., Xu, C., et al.: Octo: An open-source generalist robot policy. arXiv preprint arXiv:2405.12213 (2024) 1, 2, 7, 8, 9 13

work page internal anchor Pith review Pith/arXiv arXiv 2024
[69]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y ., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al.: Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023) 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[70]

In: Proceedings of International Conference on Machine Learning (ICML)

Uchendu, I., Xiao, T., Lu, Y ., Zhu, B., Yan, M., Simon, J., Bennice, M., Fu, C., Ma, C., Jiao, J., et al.: Jump-start reinforcement learning. In: Proceedings of International Conference on Machine Learning (ICML). pp. 34556–34583. PMLR (2023) 2, 3

work page 2023
[71]

Leveraging Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse Rewards

Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) 2

work page internal anchor Pith review Pith/arXiv arXiv 2017
[72]

Walke, H., Black, K., Lee, A., Kim, M.J., Du, M., Zheng, C., Zhao, T., Hansen-Estruch, P., Vuong, Q., He, A., Myers, V ., Fang, K., Finn, C., Levine, S.: Bridgedata v2: A dataset for robot learning at scale (2023) 1, 2

work page 2023
[73]

arXiv preprint arXiv:2407.00603 (2024) 1

Wang, Y ., Zhang, H., Tang, Y ., Liu, Y ., Feng, J., Dai, J., Jin, X.: Hierarchical memory for long video qa. arXiv preprint arXiv:2407.00603 (2024) 1

work page arXiv 2024
[74]

arXiv preprint arXiv:2412.01268 (2024) 1

Wang, Y ., Zhang, H., Tian, J., Tang, Y .: Ponder & press: Advancing visual gui agent towards general computer control. arXiv preprint arXiv:2412.01268 (2024) 1

work page arXiv 2024
[75]

RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning

Wang, Z., Wang, K., Wang, Q., Zhang, P., Li, L., Yang, Z., Yu, K., Nguyen, M.N., Liu, L., Gottlieb, E., et al.: Ragen: Understanding self-evolution in llm agents via multi-turn reinforcement learning. arXiv preprint arXiv:2504.20073 (2025) 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[76]

Wołczyk, M., Cupiał, B., Ostaszewski, M., Bortkiewicz, M., Zaj ˛ ac, M., Pascanu, R., Łukasz Kuci´nski, Miło´s, P.: Fine-tuning reinforcement learning models is secretly a forgetting mitigation problem (2024), https://arxiv.org/abs/2402.02868 2

work page arXiv 2024
[77]

arXiv preprint arXiv:2403.12203 (2024) 2

Xing, J., Romero, A., Bauersfeld, L., Scaramuzza, D.: Bootstrapping reinforcement learning with imitation for vision-based agile flight. arXiv preprint arXiv:2403.12203 (2024) 2

work page arXiv 2024
[78]

arXiv preprint arXiv:2412.09858 (2024) 9

Xu, C., Li, Q., Luo, J., Levine, S.: Rldg: Robotic generalist policy distillation via reinforcement learning. arXiv preprint arXiv:2412.09858 (2024) 9

work page arXiv 2024
[79]

Qwen2.5 Technical Report

Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., et al.: Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115 (2024) 1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[80]

arXiv preprint arXiv:2412.00447 (2024) 1

Ye, X., Gan, Y ., Ge, Y ., Zhang, X.P., Tang, Y .: Atp-llava: Adaptive token pruning for large vision language models. arXiv preprint arXiv:2412.00447 (2024) 1

work page arXiv 2024

Showing first 80 references.