Recognition: 2 theorem links
· Lean TheoremSimpleVLA-RL: Scaling VLA Training via Reinforcement Learning
Pith reviewed 2026-05-15 07:55 UTC · model grok-4.3
The pith
Reinforcement learning scales vision-language-action model training beyond supervised fine-tuning
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SimpleVLA-RL builds an RL framework on top of existing VLA training by adding VLA-specific trajectory sampling, scalable parallelization, multi-environment rendering, and optimized loss computation. When applied, it reaches state-of-the-art results on manipulation benchmarks, outperforms prior models on additional task suites through exploration-enhancing strategies, surpasses supervised fine-tuning on real-world tasks, and reduces dependence on large human datasets while revealing a pushcut phenomenon where the policy discovers previously unseen patterns.
What carries the argument
VLA-specific trajectory sampling paired with multi-environment rendering and exploration-enhancing strategies inside the reinforcement learning loop
If this is right
- Achieves state-of-the-art performance on standard robotic manipulation benchmarks
- Outperforms other VLA models on multiple task evaluation suites
- Surpasses supervised fine-tuning results in real-world robotic tasks
- Reduces reliance on large-scale human-operated demonstration trajectories
- Enables the policy to discover new action patterns beyond the initial training data
Where Pith is reading between the lines
- The same RL adaptations could extend to other long-horizon action models that currently rely on supervised fine-tuning
- The pushcut phenomenon points to RL's potential for uncovering novel strategies that pure imitation learning would miss
- Further scaling may allow robots to learn complex sequences with minimal new human data collection
- This could shift emphasis in robotics from data gathering toward efficient exploration during training
Load-bearing premise
The introduced trajectory sampling, multi-environment rendering, and exploration strategies remain stable and effective across different base VLA models and real-world distribution shifts without extensive additional tuning.
What would settle it
Applying the framework to a new base VLA model in a previously unseen real-world environment and measuring no improvement or degradation compared to supervised fine-tuning would disprove the claim of reliable scaling and generalization.
read the original abstract
Vision-Language-Action (VLA) models have recently emerged as a powerful paradigm for robotic manipulation. Despite substantial progress enabled by large-scale pretraining and supervised fine-tuning (SFT), these models face two fundamental challenges: (i) the scarcity and high cost of large-scale human-operated robotic trajectories required for SFT scaling, and (ii) limited generalization to tasks involving distribution shift. Recent breakthroughs in Large Reasoning Models (LRMs) demonstrate that reinforcement learning (RL) can dramatically enhance step-by-step reasoning capabilities, raising a natural question: Can RL similarly improve the long-horizon step-by-step action planning of VLA? In this work, we introduce SimpleVLA-RL, an efficient RL framework tailored for VLA models. Building upon veRL, we introduce VLA-specific trajectory sampling, scalable parallelization, multi-environment rendering, and optimized loss computation. When applied to OpenVLA-OFT, SimpleVLA-RL achieves SoTA performance on LIBERO and even outperforms $\pi_0$ on RoboTwin 1.0\&2.0 with the exploration-enhancing strategies we introduce. SimpleVLA-RL not only reduces dependence on large-scale data and enables robust generalization, but also remarkably surpasses SFT in real-world tasks. Moreover, we identify a novel phenomenon ``pushcut'' during RL training, wherein the policy discovers previously unseen patterns beyond those seen in the previous training process. Github: https://github.com/PRIME-RL/SimpleVLA-RL
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SimpleVLA-RL, an RL framework for Vision-Language-Action models built on veRL. It adds VLA-specific trajectory sampling, scalable parallelization, multi-environment rendering, and optimized loss computation, along with exploration-enhancing strategies. Applied to OpenVLA-OFT, the method claims state-of-the-art results on LIBERO, outperformance of π₀ on RoboTwin 1.0/2.0, and superior real-world performance over supervised fine-tuning, while reducing reliance on large-scale data and improving generalization. A novel 'pushcut' phenomenon during training is also identified.
Significance. If the empirical claims hold after additional validation, the work would be significant for the VLA and robotics communities. It provides evidence that RL can scale VLA training more efficiently than SFT alone, addressing data scarcity and generalization challenges. The open-sourced code and identification of the 'pushcut' phenomenon offer concrete starting points for further research on policy discovery in long-horizon robotic tasks.
major comments (4)
- [§4] §4 (Experimental Evaluation): Benchmark results on LIBERO and RoboTwin are reported without error bars, standard deviations, or multi-seed statistics. This makes it impossible to assess whether the claimed SoTA margins and outperformance of π₀ are statistically reliable or sensitive to training stochasticity.
- [§4.3] §4.3 (Ablations): No ablation studies quantify the individual or combined contributions of the VLA-specific trajectory sampling, multi-environment rendering, and exploration-enhancing strategies. These components are central to the claimed data-efficiency and generalization benefits, yet their necessity is not demonstrated.
- [§5] §5 (Real-World Experiments): Real-world results are shown on a narrow task set without quantified distribution-shift metrics, failure-case breakdowns, or comparison to the scale of SFT data used. This limits support for the claims of robust generalization and reduced data dependence.
- [§3] §3 (Method): The framework is evaluated exclusively on OpenVLA-OFT. No transfer experiments apply the identical RL stack (sampling, rendering, exploration) to other base VLA models such as π₀, leaving the generality of SimpleVLA-RL unsubstantiated.
minor comments (2)
- [Abstract / §1] The abstract and §1 should explicitly state the exact number of training trajectories or environments used in the RL phase versus the SFT baseline to make the data-efficiency claim concrete.
- [Figures in §4] Figure captions in the experimental section would benefit from including the precise hyperparameter values for the exploration strategies to aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for their insightful and constructive comments. We address each major point below and describe the revisions we will incorporate to improve the manuscript.
read point-by-point responses
-
Referee: [§4] §4 (Experimental Evaluation): Benchmark results on LIBERO and RoboTwin are reported without error bars, standard deviations, or multi-seed statistics. This makes it impossible to assess whether the claimed SoTA margins and outperformance of π₀ are statistically reliable or sensitive to training stochasticity.
Authors: We acknowledge this limitation. Due to the substantial computational cost of RL training for large VLA models, the current experiments were conducted with a single seed. In the revised manuscript, we will perform additional runs with multiple seeds for the primary LIBERO and RoboTwin results and report standard deviations along with error bars to establish the statistical reliability of the performance claims. revision: yes
-
Referee: [§4.3] §4.3 (Ablations): No ablation studies quantify the individual or combined contributions of the VLA-specific trajectory sampling, multi-environment rendering, and exploration-enhancing strategies. These components are central to the claimed data-efficiency and generalization benefits, yet their necessity is not demonstrated.
Authors: We agree that explicit ablations are required to substantiate the contribution of each proposed component. We will add a dedicated ablation section in the revision that isolates and quantifies the effects of VLA-specific trajectory sampling, multi-environment rendering, and the exploration-enhancing strategies on both data efficiency and generalization performance. revision: yes
-
Referee: [§5] §5 (Real-World Experiments): Real-world results are shown on a narrow task set without quantified distribution-shift metrics, failure-case breakdowns, or comparison to the scale of SFT data used. This limits support for the claims of robust generalization and reduced data dependence.
Authors: We concur that additional analysis would strengthen the real-world claims. In the revised version, we will expand §5 to include failure-case breakdowns, quantified distribution-shift metrics where measurable, and explicit comparisons of the data volume used in our RL approach versus the SFT baselines to better support the reported generalization and data-efficiency benefits. revision: yes
-
Referee: [§3] §3 (Method): The framework is evaluated exclusively on OpenVLA-OFT. No transfer experiments apply the identical RL stack (sampling, rendering, exploration) to other base VLA models such as π₀, leaving the generality of SimpleVLA-RL unsubstantiated.
Authors: The referee correctly identifies that transfer experiments would further demonstrate generality. While the framework is constructed modularly on veRL with VLA-specific adaptations intended to be model-agnostic, we only report results on OpenVLA-OFT. In the revision, we will add a discussion section detailing how the sampling, rendering, and exploration components can be applied to other VLA models such as π₀ and will note this as a key direction for future work. We will also include preliminary transfer results if additional compute becomes available. revision: partial
Circularity Check
No circularity: empirical results on external benchmarks
full rationale
The paper introduces an RL framework (SimpleVLA-RL) with VLA-specific components and reports performance gains on independent public benchmarks (LIBERO, RoboTwin) plus real-world tasks. No derivation chain, equation, or prediction reduces to a fitted parameter, self-definition, or self-citation loop by construction. Claims rest on external evaluation protocols rather than internal equivalence.
Axiom & Free-Parameter Ledger
free parameters (1)
- exploration strategy hyperparameters
axioms (1)
- domain assumption Reinforcement learning can enhance step-by-step action planning in VLA models analogously to its effect on reasoning in large language models
Forward citations
Cited by 21 Pith papers
-
NoiseGate: Learning Per-Latent Timestep Schedules as Information Gating in World Action Models
NoiseGate learns per-latent timestep schedules as an information-gating policy in diffusion-based world action models, yielding consistent gains on RoboTwin manipulation tasks.
-
D-VLA: A High-Concurrency Distributed Asynchronous Reinforcement Learning Framework for Vision-Language-Action Models
D-VLA uses plane decoupling and a swimlane pipeline to deliver higher throughput and linear speedup than prior RL frameworks when training billion- and trillion-parameter VLA models on benchmarks like LIBERO.
-
D-VLA: A High-Concurrency Distributed Asynchronous Reinforcement Learning Framework for Vision-Language-Action Models
D-VLA introduces plane decoupling and a swimlane asynchronous pipeline to achieve high-concurrency RL training and linear scalability for billion- to trillion-parameter vision-language-action models.
-
Learning to See What You Need: Gaze Attention for Multimodal Large Language Models
Gaze Attention groups visual embeddings into selectable regions and dynamically restricts attention to task-relevant ones, matching dense baselines with up to 90% fewer visual KV entries via added context tokens.
-
Reinforcing VLAs in Task-Agnostic World Models
RAW-Dream lets VLAs learn new tasks in zero-shot imagination by using a world model pre-trained only on task-free behaviors and an unmodified VLM to supply rewards, with dual-noise verification to limit hallucinations.
-
RankQ: Offline-to-Online Reinforcement Learning via Self-Supervised Action Ranking
RankQ adds a self-supervised ranking loss to Q-learning to learn structured action orderings, yielding competitive or better performance than prior methods on D4RL benchmarks and large gains in vision-based robot fine-tuning.
-
Unified Noise Steering for Efficient Human-Guided VLA Adaptation
UniSteer unifies human corrective actions and noise-space RL for VLA adaptation by inverting actions to noise targets, raising success rates from 20% to 90% in 66 minutes across four real-world manipulation tasks.
-
Learning while Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies
Fleet-scale RL framework improves a single generalist VLA policy from deployment data to 95% average success on eight real-world manipulation tasks with 16 dual-arm robots.
-
LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning
LaST-R1 reaches 99.8% average success on the LIBERO benchmark using one-shot warm-up plus LAPO reinforcement learning on latent physical reasoning, with up to 44% real-world gains on complex single- and dual-arm tasks.
-
LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning
LaST-R1 introduces a RL post-training method called LAPO that optimizes latent Chain-of-Thought reasoning in vision-language-action models, yielding 99.9% success on LIBERO and up to 22.5% real-world gains.
-
GS-Playground: A High-Throughput Photorealistic Simulator for Vision-Informed Robot Learning
GS-Playground delivers a high-throughput photorealistic simulator for vision-informed robot learning via parallel physics integrated with batch 3D Gaussian Splatting at 10^4 FPS and an automated Real2Sim workflow for ...
-
AsyncShield: A Plug-and-Play Edge Adapter for Asynchronous Cloud-based VLA Navigation
AsyncShield restores VLA geometric intent from latency via kinematic pose mapping and uses PPO-Lagrangian to balance tracking with LiDAR safety constraints in a plug-and-play module.
-
Breaking Lock-In: Preserving Steerability under Low-Data VLA Post-Training
DeLock mitigates lock-in in low-data VLA post-training via visual grounding preservation and test-time contrastive prompt guidance, outperforming baselines across eight evaluations while matching data-heavy generalist...
-
RL Token: Bootstrapping Online RL with Vision-Language-Action Models
RL Token enables sample-efficient online RL fine-tuning of large VLAs, delivering up to 3x speed gains and higher success rates on real-robot manipulation tasks within minutes to hours.
-
CorridorVLA: Explicit Spatial Constraints for Generative Action Heads via Sparse Anchors
CorridorVLA improves VLA models by using predicted sparse anchors to impose explicit spatial corridors on action trajectories, yielding 3.4-12.4% success rate gains on LIBERO-Plus with GR00T-Corr reaching 83.21%.
-
Test-Time Perturbation Learning with Delayed Feedback for Vision-Language-Action Models
PDF improves VLA success rates on LIBERO and Atari by applying test-time perturbation learning with delayed feedback to correct trajectory overfitting and overconfidence.
-
ST-$\pi$: Structured SpatioTemporal VLA for Robotic Manipulation
ST-π structures VLA models by having a spatiotemporal VLM produce causally ordered chunk-level prompts that guide a dual-generator action expert to jointly handle spatial and temporal control in robotic manipulation.
-
$\pi^{*}_{0.6}$: a VLA That Learns From Experience
RECAP enables a generalist VLA to self-improve via advantage-conditioned RL on mixed real-world data, more than doubling throughput and halving failure rates on hard manipulation tasks.
-
TimeRFT: Stimulating Generalizable Time Series Forecasting for TSFMs via Reinforcement Finetuning
TimeRFT applies reinforcement learning with multi-faceted step-wise rewards and informative sample selection to improve generalization and accuracy in TSFM adaptation beyond supervised fine-tuning.
-
Jump-Start Reinforcement Learning with Vision-Language-Action Regularization
VLAJS augments PPO with sparse annealed VLA guidance through directional regularization to cut required interactions by over 50% on manipulation tasks and enable zero-shot sim-to-real transfer.
-
Causal World Modeling for Robot Control
LingBot-VA combines video world modeling with policy learning via Mixture-of-Transformers, closed-loop rollouts, and asynchronous inference to improve robot manipulation in simulation and real settings.
Reference graph
Works this paper leans on
-
[1]
OpenVLA: An Open-Source Vision-Language-Action Model
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language- action model.arXiv preprint arXiv:2406.09246,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Yifan Zhong, Fengshuo Bai, Shaofei Cai, Xuchuan Huang, Zhang Chen, Xiaowei Zhang, Yuanfei Wang, Shaoyang Guo, Tianrui Guan, Ka Nam Lui, et al. A survey on vision-language-action models: An action tokenization perspective.arXiv preprint arXiv:2507.01925,
-
[3]
arXiv preprint arXiv:2505.04769 (2025)
Ranjan Sapkota, Yang Cao, Konstantinos I Roumeliotis, and Manoj Karkee. Vision-language-action models: Concepts, progress, applications and challenges.arXiv preprint arXiv:2505.04769,
-
[4]
What Matters in Learning from Offline Human Demonstrations for Robot Manipulation
Ajay Mandlekar, Danfei Xu, Josiah Wong, Soroush Nasiriany, Chen Wang, Rohun Kulkarni, Li Fei-Fei, Silvio Savarese, Yuke Zhu, and Roberto Martín-Martín. What matters in learning from offline human demonstrations for robot manipulation.arXiv preprint arXiv:2108.03298,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xu Huang, Shu Jiang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems.arXiv preprint arXiv:2503.06669, 2025a. Jensen Gao, Annie Xie, Ted Xiao, Chelsea Finn, and Dorsa Sadigh. Efficient data collection ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Gemini Robotics: Bringing AI into the Physical World
Gemini Robotics Team, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montserrat Gon- zalez Arenas, Travis Armstrong, Ashwin Balakrishna, Robert Baruch, Maria Bauza, Michiel Blokzijl, et al. Gemini robotics: Bringing ai into the physical world.arXiv preprint arXiv:2503.20020, 2025a. Jijia Liu, Feng Gao, Bingwen Wei, Xinlei Chen, Qingmin Liao, Yi...
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Eureka: Human-Level Reward Design via Coding Large Language Models
Yecheng Jason Ma, William Liang, Guanzhi Wang, De-An Huang, Osbert Bastani, Dinesh Jayaraman, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Eureka: Human-level reward design via coding large language models.arXiv preprint arXiv:2310.12931,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
HybridFlow: A Flexible and Efficient RLHF Framework
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
The effect of sampling temperature on problem solving in large language models
Matthew Renze. The effect of sampling temperature on problem solving in large language models. In Findings of the association for computational linguistics: EMNLP 2024, pages 7346–7356,
work page 2024
-
[12]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. pi_0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation
Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation.arXiv preprint arXiv:2410.07864,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Gr-3 technical report.arXiv preprint arXiv:2507.15493,
Chilam Cheang, Sijin Chen, Zhongren Cui, Yingdong Hu, Liqun Huang, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Xiao Ma, et al. Gr-3 technical report.arXiv preprint arXiv:2507.15493,
-
[15]
Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success
Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
20 SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, et al. Dapo: An open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Mingjie Liu, Shizhe Diao, Ximing Lu, Jian Hu, Xin Dong, Yejin Choi, Jan Kautz, and Yi Dong. Prorl: Prolonged reinforcement learning expands reasoning boundaries in large language models.arXiv preprint arXiv:2505.24864, 2025b. Zihan Liu, Zhuolin Yang, Yang Chen, Chankyu Lee, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Acereason-nemotron 1.1: Advancing...
-
[18]
Process Reinforcement through Implicit Rewards
URLhttps://hkunlp.github. io/blog/2025/Polaris. Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, et al. Process reinforcement through implicit rewards.arXiv preprint arXiv:2502.01456, 2025a. Taiwei Shi, Yiyang Wu, Linxin Song, Tianyi Zhou, and Jieyu Zhao. Efficient reinforcement finetunin...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[19]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Mengqi Liao, Xiangyu Xi, Ruinian Chen, Jia Leng, Yangen Hu, Ke Zeng, Shuai Liu, and Huaiyu Wan. Enhancing efficiency and exploration in reinforcement learning for llms.arXiv preprint arXiv:2505.18573,
-
[21]
Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Qiwei Liang, Zixuan Li, Xianliang Lin, Yiheng Ge, Zhenyu Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025a. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Am...
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
UniVLA: Learning to Act Anywhere with Task-centric Latent Actions
Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Univla: Learning to act anywhere with task-centric latent actions.arXiv preprint arXiv:2505.06111, 2025b. 21 SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuon...
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
Chia-Yu Hung, Qi Sun, Pengfei Hong, Amir Zadeh, Chuan Li, U Tan, Navonil Majumder, Soujanya Poria, et al. Nora: A small open-sourced generalist vision language action model for embodied tasks.arXiv preprint arXiv:2504.19854,
-
[24]
Octo: An Open-Source Generalist Robot Policy
Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213,
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations
Yanjie Ze, Gu Zhang, Kangning Zhang, Chenyuan Hu, Muhan Wang, and Huazhe Xu. 3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations.arXiv preprint arXiv:2403.03954,
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization
Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. pi_{0.5}: a vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054,
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
Training Compute-Optimal Large Language Models
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Ruther- ford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556,
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
H-rdt: Human manipulation enhanced bimanual robotic manipulation.arXiv preprint arXiv:2507.23523,
Hongzhe Bi, Lingxuan Wu, Tianwei Lin, Hengkai Tan, Zhizhong Su, Hang Su, and Jun Zhu. H-rdt: Human manipulation enhanced bimanual robotic manipulation.arXiv preprint arXiv:2507.23523,
-
[30]
Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Hel- yar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720,
work page internal anchor Pith review Pith/arXiv arXiv
-
[31]
Understanding R1-Zero-Like Training: A Critical Perspective
Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025d. YuxinZuo,KaiyanZhang,LiSheng,ShangQu,GanquCui,XuekaiZhu,HaozhanLi,YuchenZhang,Xin- wei Long, Ermo Hua, et al. Ttrl: Test-time reinforcement learning.arXiv prepr...
work page internal anchor Pith review Pith/arXiv arXiv
-
[32]
The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models
22 SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, et al. The entropy mechanism of reinforcement learning for reasoning language models.arXiv preprint arXiv:2505.22617, 2025b. Yecheng Jason Ma, Shagun Sodhani, Dinesh Jayar...
work page internal anchor Pith review Pith/arXiv arXiv
-
[33]
Zhecheng Yuan, Tianming Wei, Shuiqi Cheng, Gu Zhang, Yuanpei Chen, and Huazhe Xu. Learning to manipulate anywhere: A visual generalizable framework for reinforcement learning.arXiv preprint arXiv:2407.15815,
-
[34]
Robotic Control via Embodied Chain-of-Thought Reasoning
Michał Zawalski, William Chen, Karl Pertsch, Oier Mees, Chelsea Finn, and Sergey Levine. Robotic control via embodied chain-of-thought reasoning.arXiv preprint arXiv:2407.08693,
work page internal anchor Pith review arXiv
-
[35]
Training strategies for efficient embodied reasoning.arXiv preprint arXiv:2505.08243, 2025b
William Chen, Suneel Belkhale, Suvir Mirchandani, Oier Mees, Danny Driess, Karl Pertsch, and Sergey Levine. Training strategies for efficient embodied reasoning.arXiv preprint arXiv:2505.08243, 2025b. Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, and Jianyu Chen. Video prediction policy: A...
-
[36]
Haoran Geng, Feishi Wang, Songlin Wei, Yuyang Li, Bangjun Wang, Boshi An, Charlie Tianyue Cheng, Haozhe Lou, Peihao Li, Yen-Jen Wang, et al. Roboverse: Towards a unified platform, dataset and benchmark for scalable and generalizable robot learning.arXiv preprint arXiv:2504.18904,
-
[37]
Zhenyu Jiang, Yuqi Xie, Kevin Lin, Zhenjia Xu, Weikang Wan, Ajay Mandlekar, Linxi Fan, and Yuke Zhu. Dexmimicgen: Automated data generation for bimanual dexterous manipulation via imitation learning.arXiv preprint arXiv:2410.24185,
-
[38]
Grape: Generalizing robot policy via preference alignment.arXiv preprint arXiv:2411.19309,
Zijian Zhang, Kaiyuan Zheng, Zhaorun Chen, Joel Jang, Yi Li, Siwei Han, Chaoqi Wang, Mingyu Ding, Dieter Fox, and Huaxiu Yao. Grape: Generalizing robot policy via preference alignment.arXiv preprint arXiv:2411.19309,
-
[39]
Yuhui Chen, Shuai Tian, Shugao Liu, Yingting Zhou, Haoran Li, and Dongbin Zhao. Conrft: A reinforcedfine-tuningmethodforvlamodelsviaconsistencypolicy.arXivpreprintarXiv:2502.05450, 2025c. Luong Trung, Xinbo Zhang, Zhanming Jie, Peng Sun, Xiaoran Jin, and Hang Li. Reft: Reasoning with reinforcedfine-tuning. InProceedingsofthe62ndAnnualMeetingoftheAssociati...
-
[40]
Hongyin Zhang, Zifeng Zhuang, Han Zhao, Pengxiang Ding, Hongchao Lu, and Donglin Wang. Reinbot: Amplifying robot visual-language manipulation with reinforcement learning.arXiv preprint arXiv:2505.07395,
-
[41]
Yanjiang Guo, Jianke Zhang, Xiaoyu Chen, Xiang Ji, Yen-Jen Wang, Yucheng Hu, and Jianyu Chen. Improving vision-language-action model with online reinforcement learning.arXiv preprint arXiv:2501.16664, 2025b. 23 SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning Shuhan Tan, Kairan Dou, Yue Zhao, and Philipp Krähenbühl. Interactive post-training ...
-
[42]
Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs
Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms.arXiv preprint arXiv:2402.14740,
work page internal anchor Pith review Pith/arXiv arXiv
-
[43]
GitHub repository. Guanxing Lu, Wenkai Guo, Chubin Zhang, Yuheng Zhou, Haonan Jiang, Zifeng Gao, Yansong Tang, and Ziwei Wang. Vla-rl: Towards masterful and general robotic manipulation with scalable reinforcement learning.arXiv preprint arXiv:2505.18719,
-
[44]
Zengjue Chen, Runliang Niu, He Kong, and Qi Wang. Tgrpo: Fine-tuning vision-language-action model via trajectory-wise group relative policy optimization.arXiv preprint arXiv:2506.08440, 2025d. Junyang Shu, Zhiwei Lin, and Yongtao Wang. Rftf: Reinforcement fine-tuning for embodied agents with temporal feedback.arXiv preprint arXiv:2505.19767,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.