pith. machine review for the scientific record. sign in

arxiv: 2509.09674 · v1 · submitted 2025-09-11 · 💻 cs.RO · cs.AI· cs.CL· cs.LG

Recognition: 2 theorem links

· Lean Theorem

SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning

Authors on Pith no claims yet

Pith reviewed 2026-05-15 07:55 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.CLcs.LG
keywords vision-language-action modelsreinforcement learningrobotic manipulationpolicy optimizationgeneralizationtrajectory samplingexploration strategiesreal-world robotics
0
0 comments X

The pith

Reinforcement learning scales vision-language-action model training beyond supervised fine-tuning

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that an efficient reinforcement learning framework can train vision-language-action models to plan robotic actions more effectively than supervised fine-tuning alone. This approach uses tailored trajectory sampling and multi-environment rendering to let the models explore and improve step-by-step policies with far less human demonstration data. A reader would care because large-scale human trajectories are costly and scarce, while this method promises stronger generalization across task variations and better results on actual robots.

Core claim

SimpleVLA-RL builds an RL framework on top of existing VLA training by adding VLA-specific trajectory sampling, scalable parallelization, multi-environment rendering, and optimized loss computation. When applied, it reaches state-of-the-art results on manipulation benchmarks, outperforms prior models on additional task suites through exploration-enhancing strategies, surpasses supervised fine-tuning on real-world tasks, and reduces dependence on large human datasets while revealing a pushcut phenomenon where the policy discovers previously unseen patterns.

What carries the argument

VLA-specific trajectory sampling paired with multi-environment rendering and exploration-enhancing strategies inside the reinforcement learning loop

If this is right

  • Achieves state-of-the-art performance on standard robotic manipulation benchmarks
  • Outperforms other VLA models on multiple task evaluation suites
  • Surpasses supervised fine-tuning results in real-world robotic tasks
  • Reduces reliance on large-scale human-operated demonstration trajectories
  • Enables the policy to discover new action patterns beyond the initial training data

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same RL adaptations could extend to other long-horizon action models that currently rely on supervised fine-tuning
  • The pushcut phenomenon points to RL's potential for uncovering novel strategies that pure imitation learning would miss
  • Further scaling may allow robots to learn complex sequences with minimal new human data collection
  • This could shift emphasis in robotics from data gathering toward efficient exploration during training

Load-bearing premise

The introduced trajectory sampling, multi-environment rendering, and exploration strategies remain stable and effective across different base VLA models and real-world distribution shifts without extensive additional tuning.

What would settle it

Applying the framework to a new base VLA model in a previously unseen real-world environment and measuring no improvement or degradation compared to supervised fine-tuning would disprove the claim of reliable scaling and generalization.

read the original abstract

Vision-Language-Action (VLA) models have recently emerged as a powerful paradigm for robotic manipulation. Despite substantial progress enabled by large-scale pretraining and supervised fine-tuning (SFT), these models face two fundamental challenges: (i) the scarcity and high cost of large-scale human-operated robotic trajectories required for SFT scaling, and (ii) limited generalization to tasks involving distribution shift. Recent breakthroughs in Large Reasoning Models (LRMs) demonstrate that reinforcement learning (RL) can dramatically enhance step-by-step reasoning capabilities, raising a natural question: Can RL similarly improve the long-horizon step-by-step action planning of VLA? In this work, we introduce SimpleVLA-RL, an efficient RL framework tailored for VLA models. Building upon veRL, we introduce VLA-specific trajectory sampling, scalable parallelization, multi-environment rendering, and optimized loss computation. When applied to OpenVLA-OFT, SimpleVLA-RL achieves SoTA performance on LIBERO and even outperforms $\pi_0$ on RoboTwin 1.0\&2.0 with the exploration-enhancing strategies we introduce. SimpleVLA-RL not only reduces dependence on large-scale data and enables robust generalization, but also remarkably surpasses SFT in real-world tasks. Moreover, we identify a novel phenomenon ``pushcut'' during RL training, wherein the policy discovers previously unseen patterns beyond those seen in the previous training process. Github: https://github.com/PRIME-RL/SimpleVLA-RL

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

4 major / 2 minor

Summary. The paper introduces SimpleVLA-RL, an RL framework for Vision-Language-Action models built on veRL. It adds VLA-specific trajectory sampling, scalable parallelization, multi-environment rendering, and optimized loss computation, along with exploration-enhancing strategies. Applied to OpenVLA-OFT, the method claims state-of-the-art results on LIBERO, outperformance of π₀ on RoboTwin 1.0/2.0, and superior real-world performance over supervised fine-tuning, while reducing reliance on large-scale data and improving generalization. A novel 'pushcut' phenomenon during training is also identified.

Significance. If the empirical claims hold after additional validation, the work would be significant for the VLA and robotics communities. It provides evidence that RL can scale VLA training more efficiently than SFT alone, addressing data scarcity and generalization challenges. The open-sourced code and identification of the 'pushcut' phenomenon offer concrete starting points for further research on policy discovery in long-horizon robotic tasks.

major comments (4)
  1. [§4] §4 (Experimental Evaluation): Benchmark results on LIBERO and RoboTwin are reported without error bars, standard deviations, or multi-seed statistics. This makes it impossible to assess whether the claimed SoTA margins and outperformance of π₀ are statistically reliable or sensitive to training stochasticity.
  2. [§4.3] §4.3 (Ablations): No ablation studies quantify the individual or combined contributions of the VLA-specific trajectory sampling, multi-environment rendering, and exploration-enhancing strategies. These components are central to the claimed data-efficiency and generalization benefits, yet their necessity is not demonstrated.
  3. [§5] §5 (Real-World Experiments): Real-world results are shown on a narrow task set without quantified distribution-shift metrics, failure-case breakdowns, or comparison to the scale of SFT data used. This limits support for the claims of robust generalization and reduced data dependence.
  4. [§3] §3 (Method): The framework is evaluated exclusively on OpenVLA-OFT. No transfer experiments apply the identical RL stack (sampling, rendering, exploration) to other base VLA models such as π₀, leaving the generality of SimpleVLA-RL unsubstantiated.
minor comments (2)
  1. [Abstract / §1] The abstract and §1 should explicitly state the exact number of training trajectories or environments used in the RL phase versus the SFT baseline to make the data-efficiency claim concrete.
  2. [Figures in §4] Figure captions in the experimental section would benefit from including the precise hyperparameter values for the exploration strategies to aid reproducibility.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for their insightful and constructive comments. We address each major point below and describe the revisions we will incorporate to improve the manuscript.

read point-by-point responses
  1. Referee: [§4] §4 (Experimental Evaluation): Benchmark results on LIBERO and RoboTwin are reported without error bars, standard deviations, or multi-seed statistics. This makes it impossible to assess whether the claimed SoTA margins and outperformance of π₀ are statistically reliable or sensitive to training stochasticity.

    Authors: We acknowledge this limitation. Due to the substantial computational cost of RL training for large VLA models, the current experiments were conducted with a single seed. In the revised manuscript, we will perform additional runs with multiple seeds for the primary LIBERO and RoboTwin results and report standard deviations along with error bars to establish the statistical reliability of the performance claims. revision: yes

  2. Referee: [§4.3] §4.3 (Ablations): No ablation studies quantify the individual or combined contributions of the VLA-specific trajectory sampling, multi-environment rendering, and exploration-enhancing strategies. These components are central to the claimed data-efficiency and generalization benefits, yet their necessity is not demonstrated.

    Authors: We agree that explicit ablations are required to substantiate the contribution of each proposed component. We will add a dedicated ablation section in the revision that isolates and quantifies the effects of VLA-specific trajectory sampling, multi-environment rendering, and the exploration-enhancing strategies on both data efficiency and generalization performance. revision: yes

  3. Referee: [§5] §5 (Real-World Experiments): Real-world results are shown on a narrow task set without quantified distribution-shift metrics, failure-case breakdowns, or comparison to the scale of SFT data used. This limits support for the claims of robust generalization and reduced data dependence.

    Authors: We concur that additional analysis would strengthen the real-world claims. In the revised version, we will expand §5 to include failure-case breakdowns, quantified distribution-shift metrics where measurable, and explicit comparisons of the data volume used in our RL approach versus the SFT baselines to better support the reported generalization and data-efficiency benefits. revision: yes

  4. Referee: [§3] §3 (Method): The framework is evaluated exclusively on OpenVLA-OFT. No transfer experiments apply the identical RL stack (sampling, rendering, exploration) to other base VLA models such as π₀, leaving the generality of SimpleVLA-RL unsubstantiated.

    Authors: The referee correctly identifies that transfer experiments would further demonstrate generality. While the framework is constructed modularly on veRL with VLA-specific adaptations intended to be model-agnostic, we only report results on OpenVLA-OFT. In the revision, we will add a discussion section detailing how the sampling, rendering, and exploration components can be applied to other VLA models such as π₀ and will note this as a key direction for future work. We will also include preliminary transfer results if additional compute becomes available. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical results on external benchmarks

full rationale

The paper introduces an RL framework (SimpleVLA-RL) with VLA-specific components and reports performance gains on independent public benchmarks (LIBERO, RoboTwin) plus real-world tasks. No derivation chain, equation, or prediction reduces to a fitted parameter, self-definition, or self-citation loop by construction. Claims rest on external evaluation protocols rather than internal equivalence.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework rests on the standard RL assumption that reward signals can guide long-horizon action policies in VLA models, plus several implementation hyperparameters whose values are not detailed in the abstract.

free parameters (1)
  • exploration strategy hyperparameters
    Tuned parameters controlling the exploration-enhancing strategies introduced for stable RL training of VLA policies.
axioms (1)
  • domain assumption Reinforcement learning can enhance step-by-step action planning in VLA models analogously to its effect on reasoning in large language models
    Invoked in the introduction as the motivating parallel to recent LRM breakthroughs.

pith-pipeline@v0.9.0 · 5648 in / 1335 out tokens · 32416 ms · 2026-05-15T07:55:51.457256+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. NoiseGate: Learning Per-Latent Timestep Schedules as Information Gating in World Action Models

    cs.RO 2026-05 unverdicted novelty 7.0

    NoiseGate learns per-latent timestep schedules as an information-gating policy in diffusion-based world action models, yielding consistent gains on RoboTwin manipulation tasks.

  2. D-VLA: A High-Concurrency Distributed Asynchronous Reinforcement Learning Framework for Vision-Language-Action Models

    cs.AI 2026-05 unverdicted novelty 6.0

    D-VLA uses plane decoupling and a swimlane pipeline to deliver higher throughput and linear speedup than prior RL frameworks when training billion- and trillion-parameter VLA models on benchmarks like LIBERO.

  3. D-VLA: A High-Concurrency Distributed Asynchronous Reinforcement Learning Framework for Vision-Language-Action Models

    cs.AI 2026-05 unverdicted novelty 6.0

    D-VLA introduces plane decoupling and a swimlane asynchronous pipeline to achieve high-concurrency RL training and linear scalability for billion- to trillion-parameter vision-language-action models.

  4. Learning to See What You Need: Gaze Attention for Multimodal Large Language Models

    cs.CV 2026-05 unverdicted novelty 6.0

    Gaze Attention groups visual embeddings into selectable regions and dynamically restricts attention to task-relevant ones, matching dense baselines with up to 90% fewer visual KV entries via added context tokens.

  5. Reinforcing VLAs in Task-Agnostic World Models

    cs.AI 2026-05 unverdicted novelty 6.0

    RAW-Dream lets VLAs learn new tasks in zero-shot imagination by using a world model pre-trained only on task-free behaviors and an unmodified VLM to supply rewards, with dual-noise verification to limit hallucinations.

  6. RankQ: Offline-to-Online Reinforcement Learning via Self-Supervised Action Ranking

    cs.AI 2026-05 unverdicted novelty 6.0

    RankQ adds a self-supervised ranking loss to Q-learning to learn structured action orderings, yielding competitive or better performance than prior methods on D4RL benchmarks and large gains in vision-based robot fine-tuning.

  7. Unified Noise Steering for Efficient Human-Guided VLA Adaptation

    cs.RO 2026-05 unverdicted novelty 6.0

    UniSteer unifies human corrective actions and noise-space RL for VLA adaptation by inverting actions to noise targets, raising success rates from 20% to 90% in 66 minutes across four real-world manipulation tasks.

  8. Learning while Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies

    cs.RO 2026-05 unverdicted novelty 6.0

    Fleet-scale RL framework improves a single generalist VLA policy from deployment data to 95% average success on eight real-world manipulation tasks with 16 dual-arm robots.

  9. LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning

    cs.RO 2026-04 unverdicted novelty 6.0

    LaST-R1 reaches 99.8% average success on the LIBERO benchmark using one-shot warm-up plus LAPO reinforcement learning on latent physical reasoning, with up to 44% real-world gains on complex single- and dual-arm tasks.

  10. LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning

    cs.RO 2026-04 unverdicted novelty 6.0

    LaST-R1 introduces a RL post-training method called LAPO that optimizes latent Chain-of-Thought reasoning in vision-language-action models, yielding 99.9% success on LIBERO and up to 22.5% real-world gains.

  11. GS-Playground: A High-Throughput Photorealistic Simulator for Vision-Informed Robot Learning

    cs.RO 2026-04 unverdicted novelty 6.0

    GS-Playground delivers a high-throughput photorealistic simulator for vision-informed robot learning via parallel physics integrated with batch 3D Gaussian Splatting at 10^4 FPS and an automated Real2Sim workflow for ...

  12. AsyncShield: A Plug-and-Play Edge Adapter for Asynchronous Cloud-based VLA Navigation

    cs.RO 2026-04 unverdicted novelty 6.0

    AsyncShield restores VLA geometric intent from latency via kinematic pose mapping and uses PPO-Lagrangian to balance tracking with LiDAR safety constraints in a plug-and-play module.

  13. Breaking Lock-In: Preserving Steerability under Low-Data VLA Post-Training

    cs.RO 2026-04 unverdicted novelty 6.0

    DeLock mitigates lock-in in low-data VLA post-training via visual grounding preservation and test-time contrastive prompt guidance, outperforming baselines across eight evaluations while matching data-heavy generalist...

  14. RL Token: Bootstrapping Online RL with Vision-Language-Action Models

    cs.LG 2026-04 unverdicted novelty 6.0

    RL Token enables sample-efficient online RL fine-tuning of large VLAs, delivering up to 3x speed gains and higher success rates on real-robot manipulation tasks within minutes to hours.

  15. CorridorVLA: Explicit Spatial Constraints for Generative Action Heads via Sparse Anchors

    cs.RO 2026-04 unverdicted novelty 6.0

    CorridorVLA improves VLA models by using predicted sparse anchors to impose explicit spatial corridors on action trajectories, yielding 3.4-12.4% success rate gains on LIBERO-Plus with GR00T-Corr reaching 83.21%.

  16. Test-Time Perturbation Learning with Delayed Feedback for Vision-Language-Action Models

    cs.CV 2026-04 unverdicted novelty 6.0

    PDF improves VLA success rates on LIBERO and Atari by applying test-time perturbation learning with delayed feedback to correct trajectory overfitting and overconfidence.

  17. ST-$\pi$: Structured SpatioTemporal VLA for Robotic Manipulation

    cs.RO 2026-04 unverdicted novelty 6.0

    ST-π structures VLA models by having a spatiotemporal VLM produce causally ordered chunk-level prompts that guide a dual-generator action expert to jointly handle spatial and temporal control in robotic manipulation.

  18. $\pi^{*}_{0.6}$: a VLA That Learns From Experience

    cs.LG 2025-11 unverdicted novelty 6.0

    RECAP enables a generalist VLA to self-improve via advantage-conditioned RL on mixed real-world data, more than doubling throughput and halving failure rates on hard manipulation tasks.

  19. TimeRFT: Stimulating Generalizable Time Series Forecasting for TSFMs via Reinforcement Finetuning

    eess.SP 2026-04 unverdicted novelty 5.0

    TimeRFT applies reinforcement learning with multi-faceted step-wise rewards and informative sample selection to improve generalization and accuracy in TSFM adaptation beyond supervised fine-tuning.

  20. Jump-Start Reinforcement Learning with Vision-Language-Action Regularization

    cs.LG 2026-04 unverdicted novelty 5.0

    VLAJS augments PPO with sparse annealed VLA guidance through directional regularization to cut required interactions by over 50% on manipulation tasks and enable zero-shot sim-to-real transfer.

  21. Causal World Modeling for Robot Control

    cs.CV 2026-01 unverdicted novelty 5.0

    LingBot-VA combines video world modeling with policy learning via Mixture-of-Transformers, closed-loop rollouts, and asynchronous inference to improve robot manipulation in simulation and real settings.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · cited by 19 Pith papers · 26 internal anchors

  1. [1]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language- action model.arXiv preprint arXiv:2406.09246,

  2. [2]

    A survey on vision-language-action models: An action tokenization perspective.arXiv preprint arXiv:2507.01925,

    Yifan Zhong, Fengshuo Bai, Shaofei Cai, Xuchuan Huang, Zhang Chen, Xiaowei Zhang, Yuanfei Wang, Shaoyang Guo, Tianrui Guan, Ka Nam Lui, et al. A survey on vision-language-action models: An action tokenization perspective.arXiv preprint arXiv:2507.01925,

  3. [3]

    arXiv preprint arXiv:2505.04769 (2025)

    Ranjan Sapkota, Yang Cao, Konstantinos I Roumeliotis, and Manoj Karkee. Vision-language-action models: Concepts, progress, applications and challenges.arXiv preprint arXiv:2505.04769,

  4. [4]

    What Matters in Learning from Offline Human Demonstrations for Robot Manipulation

    Ajay Mandlekar, Danfei Xu, Josiah Wong, Soroush Nasiriany, Chen Wang, Rohun Kulkarni, Li Fei-Fei, Silvio Savarese, Yuke Zhu, and Roberto Martín-Martín. What matters in learning from offline human demonstrations for robot manipulation.arXiv preprint arXiv:2108.03298,

  5. [5]

    AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

    Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xu Huang, Shu Jiang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems.arXiv preprint arXiv:2503.06669, 2025a. Jensen Gao, Annie Xie, Ted Xiao, Chelsea Finn, and Dorsa Sadigh. Efficient data collection ...

  6. [6]

    Gemini Robotics: Bringing AI into the Physical World

    Gemini Robotics Team, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montserrat Gon- zalez Arenas, Travis Armstrong, Ashwin Balakrishna, Robert Baruch, Maria Bauza, Michiel Blokzijl, et al. Gemini robotics: Bringing ai into the physical world.arXiv preprint arXiv:2503.20020, 2025a. Jijia Liu, Feng Gao, Bingwen Wei, Xinlei Chen, Qingmin Liao, Yi...

  7. [7]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

  8. [8]

    Eureka: Human-Level Reward Design via Coding Large Language Models

    Yecheng Jason Ma, William Liang, Guanzhi Wang, De-An Huang, Osbert Bastani, Dinesh Jayaraman, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Eureka: Human-level reward design via coding large language models.arXiv preprint arXiv:2310.12931,

  9. [9]

    HybridFlow: A Flexible and Efficient RLHF Framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256,

  10. [10]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

  11. [11]

    The effect of sampling temperature on problem solving in large language models

    Matthew Renze. The effect of sampling temperature on problem solving in large language models. In Findings of the association for computational linguistics: EMNLP 2024, pages 7346–7356,

  12. [12]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. pi_0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164,

  13. [13]

    RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

    Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation.arXiv preprint arXiv:2410.07864,

  14. [14]

    Gr-3 technical report.arXiv preprint arXiv:2507.15493,

    Chilam Cheang, Sijin Chen, Zhongren Cui, Yingdong Hu, Liqun Huang, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Xiao Ma, et al. Gr-3 technical report.arXiv preprint arXiv:2507.15493,

  15. [15]

    Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

    Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645,

  16. [16]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    20 SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, et al. Dapo: An open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476,

  17. [17]

    Prorl: Prolonged reinforcement learning expands reasoning boundaries in large language models.arXiv preprint arXiv:2505.24864, 2025b

    Mingjie Liu, Shizhe Diao, Ximing Lu, Jian Hu, Xin Dong, Yejin Choi, Jan Kautz, and Yi Dong. Prorl: Prolonged reinforcement learning expands reasoning boundaries in large language models.arXiv preprint arXiv:2505.24864, 2025b. Zihan Liu, Zhuolin Yang, Yang Chen, Chankyu Lee, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Acereason-nemotron 1.1: Advancing...

  18. [18]

    Process Reinforcement through Implicit Rewards

    URLhttps://hkunlp.github. io/blog/2025/Polaris. Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, et al. Process reinforcement through implicit rewards.arXiv preprint arXiv:2502.01456, 2025a. Taiwei Shi, Yiyang Wu, Linxin Song, Tianyi Zhou, and Jieyu Zhao. Efficient reinforcement finetunin...

  19. [19]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

  20. [20]

    Enhancing efficiency and exploration in reinforcement learning for llms.arXiv preprint arXiv:2505.18573,

    Mengqi Liao, Xiangyu Xi, Ruinian Chen, Jia Leng, Yangen Hu, Ke Zeng, Shuai Liu, and Huaiyu Wan. Enhancing efficiency and exploration in reinforcement learning for llms.arXiv preprint arXiv:2505.18573,

  21. [21]

    RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

    Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Qiwei Liang, Zixuan Li, Xianliang Lin, Yiheng Ge, Zhenyu Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025a. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Am...

  22. [22]

    UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

    Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Univla: Learning to act anywhere with task-centric latent actions.arXiv preprint arXiv:2505.06111, 2025b. 21 SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuon...

  23. [23]

    Nora: A small open-sourced generalist vision language action model for embodied tasks.arXiv preprint arXiv:2504.19854,

    Chia-Yu Hung, Qi Sun, Pengfei Hong, Amir Zadeh, Chuan Li, U Tan, Navonil Majumder, Soujanya Poria, et al. Nora: A small open-sourced generalist vision language action model for embodied tasks.arXiv preprint arXiv:2504.19854,

  24. [24]

    Octo: An Open-Source Generalist Robot Policy

    Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213,

  25. [25]

    3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations

    Yanjie Ze, Gu Zhang, Kangning Zhang, Chenyuan Hu, Muhan Wang, and Huazhe Xu. 3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations.arXiv preprint arXiv:2403.03954,

  26. [26]

    $\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. pi_{0.5}: a vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054,

  27. [27]

    Training Compute-Optimal Large Language Models

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Ruther- ford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556,

  28. [28]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,

  29. [29]

    H-rdt: Human manipulation enhanced bimanual robotic manipulation.arXiv preprint arXiv:2507.23523,

    Hongzhe Bi, Lingxuan Wu, Tianwei Lin, Hengkai Tan, Zhizhong Su, Hang Su, and Jun Zhu. H-rdt: Human manipulation enhanced bimanual robotic manipulation.arXiv preprint arXiv:2507.23523,

  30. [30]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Hel- yar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720,

  31. [31]

    Understanding R1-Zero-Like Training: A Critical Perspective

    Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025d. YuxinZuo,KaiyanZhang,LiSheng,ShangQu,GanquCui,XuekaiZhu,HaozhanLi,YuchenZhang,Xin- wei Long, Ermo Hua, et al. Ttrl: Test-time reinforcement learning.arXiv prepr...

  32. [32]

    The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

    22 SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, et al. The entropy mechanism of reinforcement learning for reasoning language models.arXiv preprint arXiv:2505.22617, 2025b. Yecheng Jason Ma, Shagun Sodhani, Dinesh Jayar...

  33. [33]

    Learning to manipulate anywhere: A visual generalizable framework for reinforcement learning.arXiv preprint arXiv:2407.15815,

    Zhecheng Yuan, Tianming Wei, Shuiqi Cheng, Gu Zhang, Yuanpei Chen, and Huazhe Xu. Learning to manipulate anywhere: A visual generalizable framework for reinforcement learning.arXiv preprint arXiv:2407.15815,

  34. [34]

    Robotic Control via Embodied Chain-of-Thought Reasoning

    Michał Zawalski, William Chen, Karl Pertsch, Oier Mees, Chelsea Finn, and Sergey Levine. Robotic control via embodied chain-of-thought reasoning.arXiv preprint arXiv:2407.08693,

  35. [35]

    Training strategies for efficient embodied reasoning.arXiv preprint arXiv:2505.08243, 2025b

    William Chen, Suneel Belkhale, Suvir Mirchandani, Oier Mees, Danny Driess, Karl Pertsch, and Sergey Levine. Training strategies for efficient embodied reasoning.arXiv preprint arXiv:2505.08243, 2025b. Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, and Jianyu Chen. Video prediction policy: A...

  36. [36]

    Roboverse: Towards a unified platform, dataset and benchmark for scalable and generalizable robot learning.arXiv preprint arXiv:2504.18904,

    Haoran Geng, Feishi Wang, Songlin Wei, Yuyang Li, Bangjun Wang, Boshi An, Charlie Tianyue Cheng, Haozhe Lou, Peihao Li, Yen-Jen Wang, et al. Roboverse: Towards a unified platform, dataset and benchmark for scalable and generalizable robot learning.arXiv preprint arXiv:2504.18904,

  37. [37]

    Dexmimicgen: Automated data generation for bimanual dexterous manipulation via imitation learning.arXiv preprint arXiv:2410.24185,

    Zhenyu Jiang, Yuqi Xie, Kevin Lin, Zhenjia Xu, Weikang Wan, Ajay Mandlekar, Linxi Fan, and Yuke Zhu. Dexmimicgen: Automated data generation for bimanual dexterous manipulation via imitation learning.arXiv preprint arXiv:2410.24185,

  38. [38]

    Grape: Generalizing robot policy via preference alignment.arXiv preprint arXiv:2411.19309,

    Zijian Zhang, Kaiyuan Zheng, Zhaorun Chen, Joel Jang, Yi Li, Siwei Han, Chaoqi Wang, Mingyu Ding, Dieter Fox, and Huaxiu Yao. Grape: Generalizing robot policy via preference alignment.arXiv preprint arXiv:2411.19309,

  39. [39]

    Conrft: A reinforcedfine-tuningmethodforvlamodelsviaconsistencypolicy.arXivpreprintarXiv:2502.05450, 2025c

    Yuhui Chen, Shuai Tian, Shugao Liu, Yingting Zhou, Haoran Li, and Dongbin Zhao. Conrft: A reinforcedfine-tuningmethodforvlamodelsviaconsistencypolicy.arXivpreprintarXiv:2502.05450, 2025c. Luong Trung, Xinbo Zhang, Zhanming Jie, Peng Sun, Xiaoran Jin, and Hang Li. Reft: Reasoning with reinforcedfine-tuning. InProceedingsofthe62ndAnnualMeetingoftheAssociati...

  40. [40]

    Reinbot: Amplifying robot visual-language manipulation with reinforcement learning.arXiv preprint arXiv:2505.07395,

    Hongyin Zhang, Zifeng Zhuang, Han Zhao, Pengxiang Ding, Hongchao Lu, and Donglin Wang. Reinbot: Amplifying robot visual-language manipulation with reinforcement learning.arXiv preprint arXiv:2505.07395,

  41. [41]

    Improving vision-language-action model with online reinforcement learning.arXiv preprint arXiv:2501.16664, 2025b

    Yanjiang Guo, Jianke Zhang, Xiaoyu Chen, Xiang Ji, Yen-Jen Wang, Yucheng Hu, and Jianyu Chen. Improving vision-language-action model with online reinforcement learning.arXiv preprint arXiv:2501.16664, 2025b. 23 SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning Shuhan Tan, Kairan Dou, Yue Zhao, and Philipp Krähenbühl. Interactive post-training ...

  42. [42]

    Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs

    Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms.arXiv preprint arXiv:2402.14740,

  43. [43]

    Vla-rl: Towards masterful and general robotic manipulation with scalable reinforcement learning.arXiv preprint arXiv:2505.18719, 2025

    GitHub repository. Guanxing Lu, Wenkai Guo, Chubin Zhang, Yuheng Zhou, Haonan Jiang, Zifeng Gao, Yansong Tang, and Ziwei Wang. Vla-rl: Towards masterful and general robotic manipulation with scalable reinforcement learning.arXiv preprint arXiv:2505.18719,

  44. [44]

    Tgrpo: Fine-tuning vision-language-action model via trajectory-wise group relative policy optimization.arXiv preprint arXiv:2506.08440, 2025d

    Zengjue Chen, Runliang Niu, He Kong, and Qi Wang. Tgrpo: Fine-tuning vision-language-action model via trajectory-wise group relative policy optimization.arXiv preprint arXiv:2506.08440, 2025d. Junyang Shu, Zhiwei Lin, and Yongtao Wang. Rftf: Reinforcement fine-tuning for embodied agents with temporal feedback.arXiv preprint arXiv:2505.19767,