pith. machine review for the scientific record. sign in

arxiv: 2605.07794 · v1 · submitted 2026-05-08 · 💻 cs.RO

Recognition: 2 theorem links

· Lean Theorem

NoiseGate: Learning Per-Latent Timestep Schedules as Information Gating in World Action Models

Haoran Li, Haoran Sun, Jing Long, Junwu Xiong, Shuai Di, Wen Huang, Yongjian Guo, Yucheng Guo, Yunxuan Ma, Zhong Guan, Zhouying Mo

Pith reviewed 2026-05-11 02:53 UTC · model grok-4.3

classification 💻 cs.RO
keywords World Action Modelsper-latent timestep schedulesinformation gatingMixture-of-Transformersrobot manipulationtask-reward optimizationdenoising schedulesjoint video-action modeling
0
0 comments X

The pith

NoiseGate learns per-latent timestep schedules to act as information gates in joint video-action world models for robot control.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

World Action Models generate actions by co-producing future observations along a shared denoising trajectory in a Mixture-of-Transformers backbone. Existing versions collapse all frames onto one shared timestep, treating every predicted latent as equally reliable for action decisions. NoiseGate instead samples independent timesteps per latent during training and deploys a small Gating Policy Network to output per-frame time increments at inference time. The network is trained end-to-end by task reward, without any hand-designed schedule shapes. On random-scene manipulation benchmarks the learned schedules produce consistent performance lifts over the shared-t baseline.

Core claim

By viewing each latent frame's noise level as a controllable reliability knob rather than a fixed hyperparameter, a lightweight policy network can discover per-latent timestep schedules that modulate the Key/Value contribution of each predicted observation to the action tokens; when this policy is optimized directly on task reward, the resulting schedules improve action generation quality over any fixed shared schedule.

What carries the argument

Lightweight Gating Policy Network that outputs per-latent time increments, trained with independent per-latent timestep sampling and task-reward optimization inside a joint video-action MoT backbone.

If this is right

  • Action generation can selectively ignore or trust different predicted future frames depending on the current state.
  • The same backbone can be reused across tasks because the schedule policy adapts without manual retuning.
  • Perception-prediction-control coupling becomes finer-grained than a single global noise level allows.
  • Training stability is maintained even though the schedule is no longer a fixed hyperparameter.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar per-component gating could be applied to any generative model where different output tokens have unequal downstream value.
  • The learned schedules might reveal which future observations carry the most decision-relevant information in manipulation tasks.
  • Extending the approach to longer horizons or multi-agent settings would test whether the same lightweight policy remains sufficient.
  • If the gating policy can be frozen after training, inference cost stays essentially unchanged from the baseline MoT.

Load-bearing premise

A small policy network optimized only on task reward can reliably discover useful per-latent noise schedules without any hand-crafted shape constraints.

What would settle it

Replace the learned per-latent schedules with either random schedules or the original shared-t schedule and measure whether task success rate on the RoboTwin random-scene suite drops by a statistically significant margin.

Figures

Figures reproduced from arXiv: 2605.07794 by Haoran Li, Haoran Sun, Jing Long, Junwu Xiong, Shuai Di, Wen Huang, Yongjian Guo, Yucheng Guo, Yunxuan Ma, Zhong Guan, Zhouying Mo.

Figure 1
Figure 1. Figure 1: Method preview. NoiseGate learns per￾latent schedules as task-adaptive information gates in a joint video-action denoising backbone. World Action Models (WAMs) [1, 2, 3, 4, 5] generalize classical video-language-action poli￾cies by modeling the joint distribution p(v, a | v0, l) over a chunk of F future latent frames v = (v1, . . . , vF ) and an action chunk a, given the current observation latent v0 and l… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of NoiseGate. The unified framework figure summarizes both the joint-sequence MoT backbone and the Gating Policy Network (GPN). At every denoising step, the GPN reads the current predicted-chunk latents and per-latent times, and emits increments ∆tf for v. The observation v0 is pinned at t0 = 0, and the action follows its own global schedule. 3.3 Learning the Per-Frame Schedule as a Policy We cast… view at source ↗
Figure 3
Figure 3. Figure 3: Noise as masking in the joint self-attention. Mean action→video attention versus each predicted frame’s current noise level tf . The monotone de￾cay confirms that tf empirically attenuates each frame’s K/V con￾tribution to the action tokens. Task A Task B [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visualization of a real test case comparing Stage-1 WAM and NoiseGate. For each method, the top row shows the true frames after executing the predicted actions, and the bottom row shows the predicted frames. Stage-1 WAM’s standard denoising leads to overconfident grasp prediction (second frame) and premature failure, whereas NoiseGate maintains higher uncertainty in the grasp frame through the learnable sc… view at source ↗
Figure 6
Figure 6. Figure 6: Layer-stratified version of [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Final-step residual noise by task and future-frame index. For each RoboTwin task, dots show the mean final-step noise level t¯of each predicted future frame over all evaluated chunks, and horizontal bars show ±1 standard deviation. Tasks are sorted by success rate. The non-uniform, task-dependent residuals show that the learned GPN does not collapse to a shared denoising endpoint; instead, it selectively l… view at source ↗
read the original abstract

World Action Models (WAMs) are an emerging family of policies that tie robot action generation to future-observation modeling. In this work, we focus on the joint video--action modeling paradigm, where actions and imagined future observations are co-generated along a shared denoising or flow trajectory, so that perception, prediction, and control are coupled within one generative process. Existing WAMs typically realize this paradigm with a Mixture-of-Transformers (MoT), where video and action tokens interact through shared self-attention. This architecture can in principle assign a separate timestep $t_f$ to each predicted latent frame, yet current systems collapse this degree of freedom onto a single shared scalar $t$. Under the noise-as-masking view of Diffusion Forcing, this shared schedule imposes the unjustified prior that every predicted latent is equally reliable for action generation. We instead view the per-latent schedule as a \emph{learnable information-gating policy}: by changing a latent frame's noise level, the policy modulates the reliability of its Key/Value contribution to the action tokens. We propose \textbf{NoiseGate}, which combines independent per-latent timestep sampling during backbone training, a lightweight Gating Policy Network that emits per-latent time increments during denoising, and task-reward optimization that trains the schedule policy without hand-crafted shape priors. Built on a joint video--action MoT backbone, NoiseGate delivers consistent gains on diverse RoboTwin random-scene manipulation tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces NoiseGate, a method to learn per-latent timestep schedules as an information-gating policy within joint video-action Mixture-of-Transformers (MoT) backbones for World Action Models. It combines independent per-latent timestep sampling during training, a lightweight Gating Policy Network that outputs per-latent time increments during denoising, and task-reward optimization to train the schedule policy without hand-crafted shape priors. The central empirical claim is that this yields consistent performance gains on diverse RoboTwin random-scene manipulation tasks.

Significance. If the claimed gains are robustly demonstrated, the approach could meaningfully advance generative world models for robotics by relaxing the shared-timestep assumption and allowing adaptive modulation of latent-frame reliability in action generation, potentially improving policy performance in unstructured manipulation settings.

major comments (2)
  1. Abstract: The claim of 'consistent gains' on RoboTwin tasks is stated without any quantitative metrics, baseline comparisons, error bars, or statistical tests, making it impossible to assess the magnitude, reliability, or reproducibility of the reported improvements.
  2. Method description (no numbered equations provided): The Gating Policy Network is introduced as emitting per-latent time increments, yet no explicit formulation is given for how these increments are applied to the shared denoising trajectory or how the task-reward objective is defined, leaving open whether the learned schedules provide information gating beyond what a shared-t baseline already achieves.
minor comments (2)
  1. The abstract introduces 'Mixture-of-Transformers (MoT)' and 'Diffusion Forcing' without citing the foundational references for these components.
  2. Several long sentences in the abstract could be split to improve readability and clarity of the technical contributions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and agree that targeted revisions will improve clarity and support for the claims. We will update the manuscript accordingly.

read point-by-point responses
  1. Referee: Abstract: The claim of 'consistent gains' on RoboTwin tasks is stated without any quantitative metrics, baseline comparisons, error bars, or statistical tests, making it impossible to assess the magnitude, reliability, or reproducibility of the reported improvements.

    Authors: We agree that the abstract would benefit from quantitative support to substantiate the 'consistent gains' claim. In the revised version, we will expand the abstract to include key metrics from the experimental results, such as average success rate improvements over shared-t baselines on RoboTwin tasks, along with references to error bars and statistical tests presented in the main results section. This will allow readers to evaluate the magnitude and reliability of the improvements directly from the abstract while maintaining its conciseness. revision: yes

  2. Referee: Method description (no numbered equations provided): The Gating Policy Network is introduced as emitting per-latent time increments, yet no explicit formulation is given for how these increments are applied to the shared denoising trajectory or how the task-reward objective is defined, leaving open whether the learned schedules provide information gating beyond what a shared-t baseline already achieves.

    Authors: We acknowledge that the current prose description of the Gating Policy Network and its integration lacks the precision of explicit equations, which could leave ambiguity about its distinction from a shared-t baseline. We will revise the method section to include numbered equations defining: the per-latent timestep computation (t_i = t + Δt_i where Δt_i is output by the lightweight policy network), the application of these timesteps within the joint video-action MoT denoising process, and the task-reward objective used for policy optimization. These additions will explicitly demonstrate how the per-latent modulation enables adaptive information gating that goes beyond uniform timestep assumptions, as validated by our ablation studies. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical gains rest on added trainable components rather than definitional reduction

full rationale

The paper introduces a Gating Policy Network and task-reward optimization to learn per-latent timestep schedules on top of an existing joint video-action MoT backbone. The central claim of consistent gains on RoboTwin tasks is presented as an empirical outcome from this new architecture and training procedure. No equations, derivations, or self-citations are shown that reduce the claimed improvement to a quantity defined by the method itself or to a fitted parameter renamed as a prediction. The approach explicitly avoids hand-crafted priors and uses independent per-latent sampling, keeping the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

Based on the abstract alone, the claim rests on the noise-as-masking interpretation of Diffusion Forcing, the MoT backbone, and the assumption that reward optimization can discover effective schedules. The Gating Policy Network introduces new trainable parameters whose values are not reported.

free parameters (1)
  • Gating Policy Network parameters
    Lightweight network weights trained via task-reward optimization; no count or initialization details given.
axioms (1)
  • domain assumption Noise-as-masking view of Diffusion Forcing holds for joint video-action modeling
    Invoked to justify treating timestep choice as information gating.
invented entities (1)
  • Gating Policy Network no independent evidence
    purpose: Emits per-latent time increments during denoising to modulate Key/Value reliability
    New component introduced to learn the schedule policy; no independent evidence of its necessity outside the paper's claim.

pith-pipeline@v0.9.0 · 5600 in / 1321 out tokens · 36534 ms · 2026-05-11T02:53:59.276492+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 22 internal anchors

  1. [1]

    Motus: A Unified Latent Action World Model

    Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chendong Xiang, Yinze Rong, et al. Motus: A unified latent action world model.arXiv preprint arXiv:2512.13030, 2025

  2. [2]

    Fast-WAM: Do World Action Models Need Test-time Future Imagination?

    Tianyuan Yuan, Zibin Dong, Yicheng Liu, and Hang Zhao. Fast-wam: Do world action models need test-time future imagination?arXiv preprint arXiv:2603.16666, 2026

  3. [3]

    World Action Models are Zero-shot Policies

    Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026

  4. [4]

    Videovla: Video generators can be generalizable robot manipulators.arXiv preprint arXiv:2512.06963, 2025

    Yichao Shen, Fangyun Wei, Zhiying Du, Yaobo Liang, Yan Lu, Jiaolong Yang, Nanning Zheng, and Baining Guo. Videovla: Video generators can be generalizable robot manipulators.arXiv preprint arXiv:2512.06963, 2025

  5. [5]

    Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning

    Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, Grace Lam, Percy Liang, Shuran Song, Ming-Yu Liu, Chelsea Finn, et al. Cosmos policy: Fine-tuning video models for visuomotor control and planning.arXiv preprint arXiv:2601.16163, 2026

  6. [6]

    Rynnvla-002: A unified vision-language-action and world model.arXiv preprint arXiv:2511.17502, 2025

    Jun Cen, Siteng Huang, Yuqian Yuan, Kehan Li, Hangjie Yuan, Chaohui Yu, Yuming Jiang, Jiayan Guo, Xin Li, Hao Luo, et al. Rynnvla-002: A unified vision-language-action and world model.arXiv preprint arXiv:2511.17502, 2025

  7. [7]

    Unified Video Action Model

    Shuang Li, Yihuai Gao, Dorsa Sadigh, and Shuran Song. Unified video action model.arXiv preprint arXiv:2503.00200, 2025

  8. [8]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...

  9. [9]

    Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024

    Boyuan Chen, Diego Martí Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024

  10. [10]

    arXiv preprint arXiv:2411.04996 , year =

    Weixin Liang, Lili Yu, Liang Luo, Srinivasan Iyer, Ning Dong, Chunting Zhou, Gargi Ghosh, Mike Lewis, Wen-tau Yih, Luke Zettlemoyer, et al. Mixture-of-transformers: A sparse and scalable architecture for multi-modal foundation models.arXiv preprint arXiv:2411.04996, 2024

  11. [11]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  12. [12]

    Diffusion policy: Visuomotor policy learning via action diffusion

    Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

  13. [13]

    SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

    Mustafa Shukor, Dana Aubakirova, Francesco Capuano, Pepijn Kooijmans, Steven Palma, Adil Zouitine, Michel Aractingi, Caroline Pascal, Martino Russi, Andres Marafioti, et al. Smolvla: A vision-language-action model for affordable and efficient robotics.arXiv preprint arXiv:2506.01844, 2025

  14. [14]

    Cogvla: Cognition-aligned vision-language- action model via instruction-driven routing & sparsification.arXiv preprint arXiv:2508.21046, 2025

    Wei Li, Renshan Zhang, Rui Shao, Jie He, and Liqiang Nie. Cogvla: Cognition-aligned vision-language-action model via instruction-driven routing & sparsification.arXiv preprint arXiv:2508.21046, 2025. 10

  15. [15]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

  16. [16]

    $\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. pi0.5: a vision- language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025

  17. [17]

    RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

    Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation.arXiv preprint arXiv:2410.07864, 2024

  18. [18]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

  19. [19]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. pi0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

  20. [20]

    Rt-2: Vision-language-action models transfer web knowledge to robotic control

    Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023

  21. [21]

    Gemini Robotics: Bringing AI into the Physical World

    Gemini Robotics Team, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montser- rat Gonzalez Arenas, Travis Armstrong, Ashwin Balakrishna, Robert Baruch, Maria Bauza, Michiel Blokzijl, et al. Gemini robotics: Bringing ai into the physical world.arXiv preprint arXiv:2503.20020, 2025

  22. [22]

    Memoryvla: Perceptual-cognitive memory in vision- language-action models for robotic manipulation.arXiv preprint arXiv:2508.19236, 2025

    Hao Shi, Bin Xie, Yingfei Liu, Lin Sun, Fengrong Liu, Tiancai Wang, Erjin Zhou, Haoqiang Fan, Xiangyu Zhang, and Gao Huang. Memoryvla: Perceptual-cognitive memory in vision- language-action models for robotic manipulation.arXiv preprint arXiv:2508.19236, 2025

  23. [23]

    GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

    Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, et al. Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation.arXiv preprint arXiv:2410.06158, 2024

  24. [24]

    Dreamgen: Unlocking generalization in robot learning through video world models.arXiv preprint arXiv:2505.12705, 2025

    Joel Jang, Seonghyeon Ye, Zongyu Lin, Jiannan Xiang, Johan Bjorck, Yu Fang, Fengyuan Hu, Spencer Huang, Kaushil Kundalia, Yen-Chen Lin, et al. Dreamgen: Unlocking generalization in robot learning through video world models.arXiv preprint arXiv:2505.12705, 2025

  25. [25]

    Cosmos World Foundation Model Platform for Physical AI

    Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

  26. [26]

    Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation

    Hongtao Wu, Ya Jing, Chilam Cheang, Guangzeng Chen, Jiafeng Xu, Xinghang Li, Minghuan Liu, Hang Li, and Tao Kong. Unleashing large-scale video generative pre-training for visual robot manipulation.arXiv preprint arXiv:2312.13139, 2023

  27. [27]

    Ro- bodreamer: Learning compositional world models for robot imagination.arXiv preprint arXiv:2404.12377, 2024

    Siyuan Zhou, Yilun Du, Jiaben Chen, Yandong Li, Dit-Yan Yeung, and Chuang Gan. Ro- bodreamer: Learning compositional world models for robot imagination.arXiv preprint arXiv:2404.12377, 2024

  28. [28]

    Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

    Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, and Jianyu Chen. Video prediction policy: A generalist robot policy with predictive visual representations.arXiv preprint arXiv:2412.14803, 2024

  29. [29]

    Denoising Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020. 11

  30. [30]

    Elucidating the design space of diffusion-based generative models.Advances in neural information processing systems, 35:26565–26577, 2022

    Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models.Advances in neural information processing systems, 35:26565–26577, 2022

  31. [31]

    Schedule on the fly: Diffusion time prediction for faster and better image generation

    Zilyu Ye, Zhiyang Chen, Tiancheng Li, Zemin Huang, Weijian Luo, and Guo-Jun Qi. Schedule on the fly: Diffusion time prediction for faster and better image generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 23412–23422, 2025

  32. [32]

    Rlinf-vla: A unified and efficient framework for vla+ rl training.arXiv preprint arXiv:2510.06710, 2025

    Hongzhi Zang, Mingjie Wei, Si Xu, Yongji Wu, Zhen Guo, Yuanqing Wang, Hao Lin, Liangzhi Shi, Yuqing Xie, Zhexuan Xu, et al. Rlinf-vla: A unified and efficient framework for vla+ rl training.arXiv preprint arXiv:2510.06710, 2025

  33. [33]

    Rlinf: Flexible and efficient large-scale reinforcement learning via macro-to-micro flow transformation.arXiv preprint arXiv:2509.15965, 2025

    Chao Yu, Yuanqing Wang, Zhen Guo, Hao Lin, Si Xu, Hongzhi Zang, Quanlu Zhang, Yongji Wu, Chunyang Zhu, Junhao Hu, et al. Rlinf: Flexible and efficient large-scale reinforcement learning via macro-to-micro flow transformation.arXiv preprint arXiv:2509.15965, 2025

  34. [34]

    RL-VLA$^3$: A Flexible and Asynchronous Reinforcement Learning Framework for VLA Training

    Zhong Guan, Haoran Sun, Yongjian Guo, Shuai Di, Xiaodong Bai, Jing Long, Tianyun Zhao, Mingxi Luo, Chen Zhou, Yucheng Guo, et al. Rl-vla3: Reinforcement learning vla accelerating via full asynchronism.arXiv preprint arXiv:2602.05765, 2026

  35. [35]

    arXiv preprint arXiv:2509.09674 , year=

    Haozhan Li, Yuxin Zuo, Jiale Yu, Yuhao Zhang, Zhaohui Yang, Kaiyan Zhang, Xuekai Zhu, Yuchen Zhang, Tianxing Chen, Ganqu Cui, et al. Simplevla-rl: Scaling vla training via reinforcement learning.arXiv preprint arXiv:2509.09674, 2025

  36. [36]

    What can rl bring to vla generalization? an empirical study.arXiv preprint, arXiv:2505.19789, 2025

    Jijia Liu, Feng Gao, Bingwen Wei, Xinlei Chen, Qingmin Liao, Yi Wu, Chao Yu, and Yu Wang. What can rl bring to vla generalization? an empirical study.arXiv preprint arXiv:2505.19789, 2025

  37. [37]

    Flow-GRPO: Training Flow Matching Models via Online RL

    Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470, 2025

  38. [38]

    2505.22094 , archivePrefix =

    Tonghe Zhang, Chao Yu, Sichang Su, and Yu Wang. Reinflow: Fine-tuning flow matching policy with online reinforcement learning.arXiv preprint arXiv:2505.22094, 2025

  39. [39]

    πrl: Online rl fine-tuning for flow-based vision- language-action models.arXiv preprint arXiv:2510.25889, 2025

    Kang Chen, Zhihao Liu, Tonghe Zhang, Zhen Guo, Si Xu, Hao Lin, Hongzhi Zang, Quanlu Zhang, Zhaofei Yu, Guoliang Fan, et al. πrl: Online rl fine-tuning for flow-based vision- language-action models.arXiv preprint arXiv:2510.25889, 2025

  40. [40]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

  41. [41]

    RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

    Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025. 12 A Limitations The main limitation of the current schedu...