Recognition: no theorem link
VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning
Pith reviewed 2026-05-16 12:52 UTC · model grok-4.3
The pith
VLA-RL applies online reinforcement learning to raise pretrained vision-language-action models above finetuned baselines on robot tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VLA-RL casts general robotic manipulation as a trajectory-level reinforcement-learning problem inside an auto-regressive VLA, models the trajectory as a multi-modal multi-turn conversation, and supplies rewards through a fine-tuned vision-language process reward model trained on pseudo-labels from automatically segmented demonstrations. With supporting techniques for curriculum selection, vectorized GPU environments, batch decoding, and critic warmup, the method produces a 4.5 percent lift over the strongest finetuned baseline on the 40-task LIBERO suite and matches the performance of advanced commercial systems while continuing to improve with additional test-time steps.
What carries the argument
The robotic process reward model: a pretrained vision-language model fine-tuned on pseudo reward labels extracted from automatically segmented task demonstrations to provide dense guidance for online RL.
If this is right
- Pretrained VLAs can be improved at test time without collecting new human demonstrations.
- Process rewards derived from vision-language models can replace sparse success signals in long-horizon manipulation.
- Curriculum ordering and vectorized execution become necessary engineering ingredients for scaling online RL on robots.
- Extended test-time optimization yields continued gains, opening a path to inference-time scaling in robotics.
Where Pith is reading between the lines
- Similar reward-model-plus-online-RL pipelines could transfer to other embodied domains such as navigation or assembly where demonstration data is also limited.
- The observed scaling with test-time steps implies that robot policies may eventually be deployed with variable compute budgets at inference, trading latency for reliability.
- If the reward model can be kept frozen while the policy improves, the approach decouples perception from control and may simplify safety verification.
Load-bearing premise
A vision-language model trained on automatically extracted task segments will generate reward signals accurate and general enough to drive stable online reinforcement learning across out-of-distribution robot scenarios.
What would settle it
Running the same VLA-RL procedure with a reward model that receives human-verified segment labels instead of pseudo labels and observing no performance gain or a performance drop would falsify the claim that the pseudo-label approach suffices.
read the original abstract
Recent high-capacity vision-language-action (VLA) models have demonstrated impressive performance on a range of robotic manipulation tasks by imitating human demonstrations. However, exploiting offline data with limited visited states will cause execution failure in out-of-distribution scenarios. Intuitively, an exploration-based method that improves on online collected data at test time could address this limitation. We present VLA-RL, an algorithmic and systematic framework that leverages online reinforcement learning (RL) to improve pretrained auto-regressive VLAs in downstream tasks. Within a unified perspective, we first introduce a trajectory-level RL formulation for auto-regressive VLA training, which models general robotic manipulation trajectory as multi-modal multi-turn conversation. To address the challenge of sparse rewards, we fine-tune a pretrained vision-language model as a robotic process reward model, which is trained on pseudo reward labels annotated on automatically extracted task segments. To scale up, we identify several implementation findings that improve the stability and efficiency including curriculum selection strategy, GPU-balanced vectorized environments, batch decoding, and critic warmup. VLA-RL enables OpenVLA-7B to surpass the strongest finetuned baseline by 4.5% on 40 challenging robotic manipulation tasks in LIBERO, and even matches the performance of advanced commercial models such as $\pi_0$-FAST. Notably, we observe that VLA-RL benefits from increased test-time optimization, indicating an early spark of inference scaling laws in robotics.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces VLA-RL, a framework for improving pretrained vision-language-action (VLA) models via online reinforcement learning on robotic manipulation tasks. It reformulates auto-regressive VLA training as trajectory-level RL modeled as multi-modal multi-turn conversations, addresses sparse rewards by fine-tuning a pretrained VLM as a process reward model on pseudo-labels from automatically extracted task segments, and incorporates practical stabilizations such as curriculum selection, vectorized environments, batch decoding, and critic warmup. On the LIBERO benchmark's 40 tasks, the method reportedly lifts OpenVLA-7B by 4.5% over the strongest fine-tuned baseline and reaches parity with commercial systems such as π₀-FAST, while also noting benefits from increased test-time optimization.
Significance. If the central empirical result holds, the work provides concrete evidence that online RL can be scaled to high-capacity VLAs for general manipulation, yielding gains that close the gap to proprietary models. The practical implementation findings for stable training and the observation of inference-time scaling are useful contributions to the robotics community, particularly for practitioners seeking to extend offline VLA policies without additional human demonstrations.
major comments (2)
- [Process reward model training and pseudo-label construction] The process reward model is trained exclusively on pseudo-labels generated by automatic task-segment extraction, yet the manuscript reports no human validation, inter-annotator agreement, held-out accuracy, or error analysis for these labels. Because the 4.5% gain and parity with π₀-FAST rest on the quality of the dense rewards supplied during online RL, any systematic mis-labeling (e.g., on contact-rich or multi-step tasks) would render the improvement indistinguishable from baseline variance or curriculum effects.
- [Experimental results and LIBERO evaluation] The experimental section presents the 4.5% aggregate improvement on LIBERO without reporting per-task breakdowns, standard deviations across random seeds, or statistical significance tests. In the absence of these quantities it is impossible to determine whether the lift exceeds run-to-run variability or is driven by a small subset of tasks.
minor comments (2)
- [Abstract] The abstract states that 'several implementation findings' improve stability but does not enumerate them; a short bulleted list would improve readability.
- [Method formulation] Notation for the trajectory-level RL objective and the process reward model could be unified more clearly with the conversation-style formulation introduced earlier.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript accordingly to improve clarity and rigor.
read point-by-point responses
-
Referee: [Process reward model training and pseudo-label construction] The process reward model is trained exclusively on pseudo-labels generated by automatic task-segment extraction, yet the manuscript reports no human validation, inter-annotator agreement, held-out accuracy, or error analysis for these labels. Because the 4.5% gain and parity with π₀-FAST rest on the quality of the dense rewards supplied during online RL, any systematic mis-labeling (e.g., on contact-rich or multi-step tasks) would render the improvement indistinguishable from baseline variance or curriculum effects.
Authors: We agree that explicit validation of the pseudo-labels is important for substantiating the dense reward signals. The original manuscript describes the automatic task-segment extraction but does not include human validation or error analysis. In the revision we will add a new subsection detailing the extraction algorithm, results from manual inspection of 200 randomly sampled segments (reporting precision/recall against human labels), and a targeted error analysis on contact-rich and multi-step tasks. This will directly address concerns about label quality and its impact on the observed gains. revision: yes
-
Referee: [Experimental results and LIBERO evaluation] The experimental section presents the 4.5% aggregate improvement on LIBERO without reporting per-task breakdowns, standard deviations across random seeds, or statistical significance tests. In the absence of these quantities it is impossible to determine whether the lift exceeds run-to-run variability or is driven by a small subset of tasks.
Authors: We acknowledge that aggregate results alone limit interpretability. The revised manuscript will include a per-task success rate table for all 40 LIBERO tasks, standard deviations computed over three independent random seeds, and statistical significance tests (paired t-tests and Wilcoxon signed-rank tests) comparing VLA-RL against the strongest baselines. These additions will demonstrate that the 4.5% improvement is consistent across tasks and exceeds run-to-run variability. revision: yes
Circularity Check
No significant circularity: empirical RL gains rest on independent online optimization and pseudo-label reward model
full rationale
The paper's central result is an empirical performance improvement (4.5% on LIBERO) obtained by running online RL on newly collected trajectories, using a separately fine-tuned VLM reward model whose labels come from automatic segment extraction. No derivation chain reduces the reported gains to fitted parameters from the same dataset, self-referential definitions, or load-bearing self-citations. The trajectory-as-conversation modeling choice and implementation heuristics (curriculum, batch decoding, critic warmup) are engineering decisions whose validity is tested by the external benchmark results rather than assumed by construction. The pseudo-label step introduces potential label noise but does not create circularity because the subsequent RL phase optimizes against new online data and reports held-out task success rates.
Axiom & Free-Parameter Ledger
free parameters (1)
- RL training hyperparameters (learning rate, batch size, curriculum thresholds)
axioms (1)
- domain assumption A pretrained vision-language model can be fine-tuned on automatically extracted task segments to produce reliable process rewards for robotic trajectories.
Forward citations
Cited by 19 Pith papers
-
Enhancing MLLM Spatial Understanding via Active 3D Scene Exploration for Multi-Perspective Reasoning
A training-free Visual Chain-of-Thought framework reconstructs high-fidelity 3D meshes from single images and iteratively synthesizes optimal novel views to enhance MLLM spatial comprehension on benchmarks like 3DSRBench.
-
You've Got a Golden Ticket: Improving Generative Robot Policies With A Single Noise Vector
Optimizing a single constant initial noise vector for frozen generative robot policies improves success rates on 38 of 43 tasks by up to 58% relative improvement.
-
RL-VLA$^3$: A Flexible and Asynchronous Reinforcement Learning Framework for VLA Training
RL-VLA³ is an asynchronous RL framework for VLA training that delivers up to 85.2% higher throughput than synchronous baselines while preserving identical sample efficiency and scaling to 256 GPUs.
-
Reinforcing VLAs in Task-Agnostic World Models
RAW-Dream lets VLAs learn new tasks in zero-shot imagination by using a world model pre-trained only on task-free behaviors and an unmodified VLM to supply rewards, with dual-noise verification to limit hallucinations.
-
Unified Noise Steering for Efficient Human-Guided VLA Adaptation
UniSteer unifies human corrective actions and noise-space RL for VLA adaptation by inverting actions to noise targets, raising success rates from 20% to 90% in 66 minutes across four real-world manipulation tasks.
-
Learning while Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies
Fleet-scale RL framework improves a single generalist VLA policy from deployment data to 95% average success on eight real-world manipulation tasks with 16 dual-arm robots.
-
LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning
LaST-R1 introduces a RL post-training method called LAPO that optimizes latent Chain-of-Thought reasoning in vision-language-action models, yielding 99.9% success on LIBERO and up to 22.5% real-world gains.
-
LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning
LaST-R1 reaches 99.8% average success on the LIBERO benchmark using one-shot warm-up plus LAPO reinforcement learning on latent physical reasoning, with up to 44% real-world gains on complex single- and dual-arm tasks.
-
CorridorVLA: Explicit Spatial Constraints for Generative Action Heads via Sparse Anchors
CorridorVLA improves VLA models by using predicted sparse anchors to impose explicit spatial corridors on action trajectories, yielding 3.4-12.4% success rate gains on LIBERO-Plus with GR00T-Corr reaching 83.21%.
-
MoRI: Mixture of RL and IL Experts for Long-Horizon Manipulation Tasks
MoRI dynamically mixes RL and IL experts with variance-based switching and IL regularization to reach 97.5% success in four real-world robotic tasks while cutting human intervention by 85.8%.
-
E-VLA: Event-Augmented Vision-Language-Action Model for Dark and Blurred Scenes
E-VLA integrates event streams directly into VLA models via lightweight fusion, raising Pick-Place success from 0% to 60-90% at 20 lux and from 0% to 20-25% under severe motion blur.
-
RISE: Self-Improving Robot Policy with Compositional World Model
RISE combines a controllable dynamics model and progress value model into a closed-loop self-improving pipeline that updates robot policies entirely in imagination, reporting over 35% absolute gains on three real-world tasks.
-
$\pi^{*}_{0.6}$: a VLA That Learns From Experience
RECAP enables a generalist VLA to self-improve via advantage-conditioned RL on mixed real-world data, more than doubling throughput and halving failure rates on hard manipulation tasks.
-
SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning
SimpleVLA-RL applies tailored reinforcement learning to VLA models, reaching SoTA on LIBERO, outperforming π₀ on RoboTwin, and surpassing SFT in real-world tasks while reducing data needs and identifying a 'pushcut' p...
-
F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions
F1 integrates next-scale visual foresight prediction into a Mixture-of-Transformer VLA architecture to reformulate action generation as foresight-guided inverse dynamics, achieving higher success rates on 136 tasks.
-
Learning Action Manifold with Multi-view Latent Priors for Robotic Manipulation
The method uses multi-view diffusion priors and action manifold learning to resolve depth ambiguity and improve action prediction in VLA robotic manipulation models, reporting higher success rates than baselines on LI...
-
Closing the Loop: Unified 3D Scene Generation and Immersive Interaction via LLM-RL Coupling
A closed-loop system couples LLM-based 3D scene generation with RL optimization and VR user interactions to produce adaptive, immersive environments, claiming SOTA results on the ALFRED benchmark.
-
OmniVLA-RL: A Vision-Language-Action Model with Spatial Understanding and Online RL
OmniVLA-RL uses a mix-of-transformers architecture and flow-matching reformulated as SDE with group segmented policy optimization to surpass prior VLA models on LIBERO benchmarks.
-
AugVLA-3D: Depth-Driven Feature Augmentation for Vision-Language-Action Models
AugVLA-3D augments existing VLA models with depth-derived 3D features and action priors to improve generalization and action accuracy in 3D robotic tasks.
Reference graph
Works this paper leans on
-
[1]
Agarwal, R., Schwarzer, M., Castro, P.S., Courville, A.C., Bellemare, M.: Reincarnating reinforcement learning: Reusing prior computation to accelerate progress. Proceedings of Advances in Neural Information Processing Systems (NeurIPS) 35, 28955–28971 (2022) 2, 3
work page 2022
-
[2]
Solving Rubik's Cube with a Robot Hand
Akkaya, I., Andrychowicz, M., Chociej, M., Litwin, M., McGrew, B., Petron, A., Paino, A., Plappert, M., Powell, G., Ribas, R., et al.: Solving rubik’s cube with a robot hand. arXiv preprint arXiv:1910.07113 (2019) 2
work page internal anchor Pith review Pith/arXiv arXiv 1910
-
[3]
arXiv preprint arXiv:2505.14231 (2025) 3
Bai, S., Li, M., Liu, Y ., Tang, J., Zhang, H., Sun, L., Chu, X., Tang, Y .: Univg-r1: Reasoning guided universal visual grounding with reinforcement learning. arXiv preprint arXiv:2505.14231 (2025) 3
-
[4]
Proceedings of Advances in Neural Information Processing Systems (NeurIPS) 35, 24639– 24654 (2022) 2
Baker, B., Akkaya, I., Zhokov, P., Huizinga, J., Tang, J., Ecoffet, A., Houghton, B., Sampedro, R., Clune, J.: Video pretraining (vpt): Learning to act by watching unlabeled online videos. Proceedings of Advances in Neural Information Processing Systems (NeurIPS) 35, 24639– 24654 (2022) 2
work page 2022
-
[5]
In: Proceedings of International Conference on Machine Learning (ICML)
Ball, P.J., Smith, L., Kostrikov, I., Levine, S.: Efficient online reinforcement learning with offline data. In: Proceedings of International Conference on Machine Learning (ICML). pp. 1577–1594. PMLR (2023) 2
work page 2023
-
[6]
arXiv preprint arXiv:2309.01918 (2023) 1, 2
Bharadhwaj, H., Vakil, J., Sharma, M., Gupta, A., Tulsiani, S., Kumar, V .: Roboagent: General- ization and efficiency in robot manipulation via semantic augmentations and action chunking. arXiv preprint arXiv:2309.01918 (2023) 1, 2
-
[7]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., et al.: zpi_0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164 (2024) 1, 2, 9
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[8]
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
Brohan, A., Brown, N., Carbajal, J., Chebotar, Y ., Chen, X., Choromanski, K., Ding, T., Driess, D., Dubey, A., Finn, C., et al.: Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818 (2023) 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[9]
RT-1: Robotics Transformer for Real-World Control at Scale
Brohan, A., Brown, N., Carbajal, J., Chebotar, Y ., Dabis, J., Finn, C., Gopalakrishnan, K., Hausman, K., Herzog, A., Hsu, J., et al.: Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817 (2022) 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[10]
Robotics: Science and Systems (RSS) (2019) 1
Cabi, S., Colmenarejo, S.G., Novikov, A., Konyushkova, K., Reed, S., Jeong, R., Zolna, K., Aytar, Y ., Budden, D., Vecerik, M., Sushkov, O., Barker, D., Scholz, J., Denil, M., de Freitas, N., Wang, Z.: Scaling data-driven robotics with reward sketching and batch reinforcement learning. Robotics: Science and Systems (RSS) (2019) 1
work page 2019
-
[11]
Chen, L., Lu, K., Rajeswaran, A., Lee, K., Grover, A., Laskin, M., Abbeel, P., Srinivas, A., Mordatch, I.: Decision transformer: Reinforcement learning via sequence modeling. arXiv preprint arXiv:2106.01345 (2021) 3
-
[12]
In: Robotics: Science and Systems (RSS) (2023) 6, 8
Chi, C., Feng, S., Du, Y ., Xu, Z., Cousineau, E., Burchfiel, B., Song, S.: Diffusion policy: Visuomotor policy learning via action diffusion. In: Robotics: Science and Systems (RSS) (2023) 6, 8
work page 2023
-
[13]
Christiano, P., Leike, J., Brown, T.B., Martic, M., Legg, S., Amodei, D.: Deep reinforcement learning from human preferences (2023), https://arxiv.org/abs/1706.03741 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[14]
Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models
Chu, Y ., Xu, J., Zhou, X., Yang, Q., Zhang, S., Yan, Z., Zhou, C., Zhou, J.: Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models. arXiv preprint arXiv:2311.07919 (2023) 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[15]
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Collaboration, O.X.E.: Open X-Embodiment: Robotic learning datasets and RT-X models. https://arxiv.org/abs/2310.08864 (2023) 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[16]
Conference on Robot Learning (CoRL) (2019) 1, 2 10
Dasari, S., Ebert, F., Tian, S., Nair, S., Bucher, B., Schmeckpeper, K., Singh, S., Levine, S., Finn, C.: Robonet: Large-scale multi-robot learning. Conference on Robot Learning (CoRL) (2019) 1, 2 10
work page 2019
-
[17]
DeepSeek-AI, et al.: Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning (2025), https://arxiv.org/abs/2501.12948 1, 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[18]
Bridge Data: Boosting Generalization of Robotic Skills with Cross-Domain Datasets
Ebert, F., Yang, Y ., Schmeckpeper, K., Bucher, B., Georgakis, G., Daniilidis, K., Finn, C., Levine, S.: Bridge data: Boosting generalization of robotic skills with cross-domain datasets. arXiv preprint arXiv:2109.13396 (2021) 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[19]
arXiv preprint arXiv:2312.02976 (2023) 1, 2
Ehsani, K., Gupta, T., Hendrix, R., Salvador, J., Weihs, L., Zeng, K.H., Singh, K.P., Kim, Y ., Han, W., Herrasti, A., et al.: Imitating shortest paths in simulation enables effective navigation and manipulation in the real world. arXiv preprint arXiv:2312.02976 (2023) 1, 2
-
[20]
Fang, H.S., Fang, H., Tang, Z., Liu, J., Wang, C., Wang, J., Zhu, H., Lu, C.: Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot. Towards Generalist Robots: Learning Paradigms for Scalable Skill Acquisition@ CoRL2023 3, 5 (2023) 1, 2
work page 2023
-
[21]
Reinforced Self-Training (ReST) for Language Modeling
Gulcehre, C., Paine, T.L., Srinivasan, S., Konyushkova, K., Weerts, L., Sharma, A., Siddhant, A., Ahern, A., Wang, M., Gu, C., et al.: Reinforced self-training (rest) for language modeling. arXiv preprint arXiv:2308.08998 (2023) 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[22]
arXiv preprint arXiv:2501.16664 (2025) 2
Guo, Y ., Zhang, J., Chen, X., Ji, X., Wang, Y .J., Hu, Y ., Chen, J.: Improving vision-language- action model with online reinforcement learning. arXiv preprint arXiv:2501.16664 (2025) 2
-
[23]
Proceedings of Advances in Neural Information Processing Systems (NeurIPS) 31 (2018) 1
Gupta, A., Murali, A., Gandhi, D.P., Pinto, L.: Robot learning in homes: Improving general- ization and reducing dataset bias. Proceedings of Advances in Neural Information Processing Systems (NeurIPS) 31 (2018) 1
work page 2018
-
[24]
Conference on Robot Learning (CoRL) (2019) 2, 3
Gupta, A., Kumar, V ., Lynch, C., Levine, S., Hausman, K.: Relay policy learning: Solving long-horizon tasks via imitation and reinforcement learning. Conference on Robot Learning (CoRL) (2019) 2, 3
work page 2019
-
[25]
In: Proceedings of AAAI Conference on Artificial Intelligence (AAAI) (2018) 2
Hester, T., Vecerik, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., et al.: Deep q-learning from demonstrations. In: Proceedings of AAAI Conference on Artificial Intelligence (AAAI) (2018) 2
work page 2018
-
[26]
Proceedings of International Conference on Learning Representations (ICLR) 1(2), 3 (2022) 5
Hu, E.J., Shen, Y ., Wallis, P., Allen-Zhu, Z., Li, Y ., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. Proceedings of International Conference on Learning Representations (ICLR) 1(2), 3 (2022) 5
work page 2022
-
[27]
arXiv preprint arXiv:2311.02198 (2023) 2
Hu, H., Mirchandani, S., Sadigh, D.: Imitation bootstrapped reinforcement learning. arXiv preprint arXiv:2311.02198 (2023) 2
-
[28]
arXiv preprint arXiv:2409.16578 (2024) 2, 9
Hu, J., Hendrix, R., Farhadi, A., Kembhavi, A., Martin-Martin, R., Stone, P., Zeng, K.H., Ehsan, K.: Flare: Achieving masterful and adaptive robot policies with large-scale reinforcement learning fine-tuning. arXiv preprint arXiv:2409.16578 (2024) 2, 9
-
[29]
arXiv preprint arXiv:2305.04866 (2023) 2
Hu, J., Stone, P., Martín-Martín, R.: Causal policy gradient for whole-body mobile manipulation. arXiv preprint arXiv:2305.04866 (2023) 2
-
[30]
OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework
Hu, J., Wu, X., Zhu, Z., Wang, W., Zhang, D., Cao, Y ., et al.: Openrlhf: An easy-to-use, scalable and high-performance rlhf framework. arXiv preprint arXiv:2405.11143 (2024) 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[31]
Hu, J., Zhang, Y ., Han, Q., Jiang, D., Zhang, X., Shum, H.Y .: Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model (2025), https: //arxiv.org/abs/2503.24290 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[32]
$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization
Intelligence, P., Black, K., Brown, N., Darpinian, J., Dhabalia, K., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., et al.: zpi_t0.5u: a vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054 (2025) 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[33]
In: Conference on Robot Learning (CoRL)
Jang, E., Irpan, A., Khansari, M., Kappler, D., Ebert, F., Lynch, C., Levine, S., Finn, C.: Bc-z: Zero-shot task generalization with robotic imitation learning. In: Conference on Robot Learning (CoRL). pp. 991–1002. PMLR (2022) 1 11
work page 2022
-
[34]
arXiv preprint arXiv:2004.10190 (2020) 2
Julian, R., Swanson, B., Sukhatme, G.S., Levine, S., Finn, C., Hausman, K.: Never stop learning: The effectiveness of fine-tuning in robotic reinforcement learning. arXiv preprint arXiv:2004.10190 (2020) 2
-
[35]
QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation
Kalashnikov, D., Irpan, A., Pastor, P., Ibarz, J., Herzog, A., Jang, E., Quillen, D., Holly, E., Kalakrishnan, M., Vanhoucke, V ., et al.: QT-Opt: Scalable deep reinforcement learning for vision-based robotic manipulation. arXiv preprint arXiv:1806.10293 (2018) 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[36]
Kalashnkov, D., Varley, J., Chebotar, Y ., Swanson, B., Jonschkowski, R., Finn, C., Levine, S., Hausman, K.: Mt-opt: Continuous multi-task robotic reinforcement learning at scale. arXiv (2021) 1, 2
work page 2021
-
[37]
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
Khazatsky, A., Pertsch, K., Nair, S., Balakrishna, A., Dasari, S., Karamcheti, S., Nasiriany, S., Srirama, M.K., Chen, L.Y ., Ellis, K., et al.: Droid: A large-scale in-the-wild robot manipulation dataset. arXiv preprint arXiv:2403.12945 (2024) 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[38]
Journal of Artificial Intelligence Research75, 1401–1476 (2022) 2
Khetarpal, K., Riemer, M., Rish, I., Precup, D.: Towards continual reinforcement learning: A review and perspectives. Journal of Artificial Intelligence Research75, 1401–1476 (2022) 2
work page 2022
-
[39]
OpenVLA: An Open-Source Vision-Language-Action Model
Kim, M.J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., Sanketi, P., et al.: Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246 (2024) 1, 2, 3, 6, 7, 8
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[40]
In: From motor learning to interaction learning in robots, pp
Kober, J., Mohler, B., Peters, J.: Imitation and reinforcement learning for motor primitives with perceptual coupling. In: From motor learning to interaction learning in robots, pp. 209–225. Springer (2010) 2, 3
work page 2010
-
[41]
In: Proceedings of the 29th Symposium on Operating Systems Principles
Kwon, W., Li, Z., Zhuang, S., Sheng, Y ., Zheng, L., Yu, C.H., Gonzalez, J., Zhang, H., Stoica, I.: Efficient memory management for large language model serving with pagedattention. In: Proceedings of the 29th Symposium on Operating Systems Principles. pp. 611–626 (2023) 6
work page 2023
-
[42]
Tulu 3: Pushing Frontiers in Open Language Model Post-Training
Lambert, N., Morrison, J., Pyatkin, V ., Huang, S., Ivison, H., Brahman, F., Miranda, L.J.V ., Liu, A., Dziri, N., Lyu, S., et al.: Tz" ulu 3: Pushing frontiers in open language model post-training. arXiv preprint arXiv:2411.15124 (2024) 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[43]
Lightman, H., Kosaraju, V ., Burda, Y ., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., Cobbe, K.: Let’s verify step by step. arXiv preprint arXiv:2305.20050 (2023) 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [44]
-
[45]
Proceedings of Advances in Neural Information Processing Systems (NeurIPS) 36 (2024) 2, 6
Liu, B., Zhu, Y ., Gao, C., Feng, Y ., Liu, Q., Zhu, Y ., Stone, P.: Libero: Benchmarking knowledge transfer for lifelong robot learning. Proceedings of Advances in Neural Information Processing Systems (NeurIPS) 36 (2024) 2, 6
work page 2024
-
[46]
In: Proceedings of European Conference on Computer Vision (ECCV)
Liu, J., Dai, W., Wang, C., Cheng, Y ., Tang, Y ., Tong, X.: Plan, posture and go: Towards open- vocabulary text-to-motion generation. In: Proceedings of European Conference on Computer Vision (ECCV). pp. 445–463. Springer (2024) 2
work page 2024
-
[47]
Liu, Z., Chen, C., Li, W., Qi, P., Tianyu Pang, C.D., Lee, W.S., Lin, M.: Understanding r1-zero- like training: A critical perspective (2025), https://arxiv.org/abs/2503.20783 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[48]
arXiv preprint arXiv:2312.07062 (2023) 2
Lu, G., Wang, Z., Liu, C., Lu, J., Tang, Y .: Thinkbot: Embodied instruction following with thought chain reasoning. arXiv preprint arXiv:2312.07062 (2023) 2
-
[49]
arXiv preprint arXiv:2111.05424 (2021) 2
Lu, Y ., Hausman, K., Chebotar, Y ., Yan, M., Jang, E., Herzog, A., Xiao, T., Irpan, A., Khansari, M., Kalashnikov, D., et al.: Aw-opt: Learning robotic skills with imitation and reinforcement at scale. arXiv preprint arXiv:2111.05424 (2021) 2
-
[50]
In: IEEE International Conference on Robotics and Automation (ICRA)
Luo, J., Hu, Z., Xu, C., Tan, Y .L., Berg, J., Sharma, A., Schaal, S., Finn, C., Gupta, A., Levine, S.: Serl: A software suite for sample-efficient robotic reinforcement learning. In: IEEE International Conference on Robotics and Automation (ICRA). pp. 16961–16969. IEEE (2024) 2, 3 12
work page 2024
-
[51]
arXiv preprint arXiv:2410.21845 (2024) 2, 3
Luo, J., Xu, C., Wu, J., Levine, S.: Precise and dexterous robotic manipulation via human-in- the-loop reinforcement learning. arXiv preprint arXiv:2410.21845 (2024) 2, 3
-
[52]
In: Conference on Robot Learning (CoRL)
Mandlekar, A., Zhu, Y ., Garg, A., Booher, J., Spero, M., Tung, A., Gao, J., Emmons, J., Gupta, A., Orbay, E., et al.: Roboturk: A crowdsourcing platform for robotic skill learning through imitation. In: Conference on Robot Learning (CoRL). pp. 879–893. PMLR (2018) 1
work page 2018
-
[53]
In: 13th USENIX symposium on operating systems design and implementation (OSDI 18)
Moritz, P., Nishihara, R., Wang, S., Tumanov, A., Liaw, R., Liang, E., Elibol, M., Yang, Z., Paul, W., Jordan, M.I., et al.: Ray: A distributed framework for emerging tAIu applications. In: 13th USENIX symposium on operating systems design and implementation (OSDI 18). pp. 561–577 (2018) 6
work page 2018
-
[54]
AWAC: Accelerating Online Reinforcement Learning with Offline Datasets
Nair, A., Gupta, A., Dalal, M., Levine, S.: Awac: Accelerating online reinforcement learning with offline datasets. arXiv preprint arXiv:2006.09359 (2020) 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2006
-
[55]
DINOv2: Learning Robust Visual Features without Supervision
Oquab, M., Darcet, T., Moutakanni, T., V o, H., Szafraniec, M., Khalidov, V ., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023) 4
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[56]
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C.L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J., Lowe, R.: Training language models to follow instructions with human feedback (2022), https://arxiv.org/abs/2203.02155 1, 3
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[57]
FAST: Efficient Action Tokenization for Vision-Language-Action Models
Pertsch, K., Stachowicz, K., Ichter, B., Driess, D., Nair, S., Vuong, Q., Mees, O., Finn, C., Levine, S.: Fast: Efficient action tokenization for vision-language-action models. arXiv preprint arXiv:2501.09747 (2025) 1, 2, 8
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[58]
In: IEEE International Conference on Robotics and Automation (ICRA)
Pinto, L., Gupta, A.: Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours. In: IEEE International Conference on Robotics and Automation (ICRA). pp. 3406–3413. IEEE (2016) 1
work page 2016
-
[59]
Proceedings of Advances in Neural Information Processing Systems (NeurIPS) (2023) 7
Rafailov, R., Sharma, A., Mitchell, E., Manning, C.D., Ermon, S., Finn, C.: Direct preference optimization: Your language model is secretly a reward model. Proceedings of Advances in Neural Information Processing Systems (NeurIPS) (2023) 7
work page 2023
-
[60]
Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations
Rajeswaran, A., Kumar, V ., Gupta, A., Vezzani, G., Schulman, J., Todorov, E., Levine, S.: Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. arXiv preprint arXiv:1709.10087 (2017) 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[61]
High-Dimensional Continuous Control Using Generalized Advantage Estimation
Schulman, J., Moritz, P., Levine, S., Jordan, M., Abbeel, P.: High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438 (2015) 5
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[62]
Proximal Policy Optimization Algorithms
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) 3, 5
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[63]
Silver, D., Sutton, R.S.: Welcome to the era of experience. Google AI (2025) 1
work page 2025
-
[64]
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
Snell, C., Lee, J., Xu, K., Kumar, A.: Scaling llm test-time compute optimally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314 (2024) 2, 7
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[65]
In: Proceedings of International Conference on Learning Representations (ICLR) (2023) 2
Taiga, A.A., Agarwal, R., Farebrother, J., Courville, A., Bellemare, M.G.: Investigating multi- task pretraining and generalization in reinforcement learning. In: Proceedings of International Conference on Learning Representations (ICLR) (2023) 2
work page 2023
-
[66]
arXiv preprint arXiv:2408.03539 (2024) 2
Tang, C., Abbatematteo, B., Hu, J., Chandra, R., Martín-Martín, R., Stone, P.: Deep reinforce- ment learning for robotics: A survey of real-world successes. arXiv preprint arXiv:2408.03539 (2024) 2
-
[67]
Journal of Machine Learning Research (JMLR) 10(7) (2009) 2
Taylor, M.E., Stone, P.: Transfer learning for reinforcement learning domains: A survey. Journal of Machine Learning Research (JMLR) 10(7) (2009) 2
work page 2009
-
[68]
Octo: An Open-Source Generalist Robot Policy
Team, O.M., Ghosh, D., Walke, H., Pertsch, K., Black, K., Mees, O., Dasari, S., Hejna, J., Kreiman, T., Xu, C., et al.: Octo: An open-source generalist robot policy. arXiv preprint arXiv:2405.12213 (2024) 1, 2, 7, 8, 9 13
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[69]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y ., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al.: Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023) 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[70]
In: Proceedings of International Conference on Machine Learning (ICML)
Uchendu, I., Xiao, T., Lu, Y ., Zhu, B., Yan, M., Simon, J., Bennice, M., Fu, C., Ma, C., Jiao, J., et al.: Jump-start reinforcement learning. In: Proceedings of International Conference on Machine Learning (ICML). pp. 34556–34583. PMLR (2023) 2, 3
work page 2023
-
[71]
Leveraging Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse Rewards
Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017) 2
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[72]
Walke, H., Black, K., Lee, A., Kim, M.J., Du, M., Zheng, C., Zhao, T., Hansen-Estruch, P., Vuong, Q., He, A., Myers, V ., Fang, K., Finn, C., Levine, S.: Bridgedata v2: A dataset for robot learning at scale (2023) 1, 2
work page 2023
-
[73]
arXiv preprint arXiv:2407.00603 (2024) 1
Wang, Y ., Zhang, H., Tang, Y ., Liu, Y ., Feng, J., Dai, J., Jin, X.: Hierarchical memory for long video qa. arXiv preprint arXiv:2407.00603 (2024) 1
-
[74]
arXiv preprint arXiv:2412.01268 (2024) 1
Wang, Y ., Zhang, H., Tian, J., Tang, Y .: Ponder & press: Advancing visual gui agent towards general computer control. arXiv preprint arXiv:2412.01268 (2024) 1
-
[75]
RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning
Wang, Z., Wang, K., Wang, Q., Zhang, P., Li, L., Yang, Z., Yu, K., Nguyen, M.N., Liu, L., Gottlieb, E., et al.: Ragen: Understanding self-evolution in llm agents via multi-turn reinforcement learning. arXiv preprint arXiv:2504.20073 (2025) 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [76]
-
[77]
arXiv preprint arXiv:2403.12203 (2024) 2
Xing, J., Romero, A., Bauersfeld, L., Scaramuzza, D.: Bootstrapping reinforcement learning with imitation for vision-based agile flight. arXiv preprint arXiv:2403.12203 (2024) 2
-
[78]
arXiv preprint arXiv:2412.09858 (2024) 9
Xu, C., Li, Q., Luo, J., Levine, S.: Rldg: Robotic generalist policy distillation via reinforcement learning. arXiv preprint arXiv:2412.09858 (2024) 9
-
[79]
Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., et al.: Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115 (2024) 1
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[80]
arXiv preprint arXiv:2412.00447 (2024) 1
Ye, X., Gan, Y ., Ge, Y ., Zhang, X.P., Tang, Y .: Atp-llava: Adaptive token pruning for large vision language models. arXiv preprint arXiv:2412.00447 (2024) 1
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.